How to Simulate a Dataset for Logistic Regression in R

This tutorial shows the steps of simulating a dataset for logistic regression R. Logistic regression is based on the following link function.

\[ Prob \{Y=1|X \} = \frac{1}{1+e^{- X \beta}} \]

In particular, the following are the steps for simulating a dataset for logistic regression in R.

Step 1: Generate Xs

Suppose that we have 2 Xs in the logistic regression, and the following is the R code to generate them.

# set the size of the sample    
n=200

# set seed      
set.seed(123)

# simulate X_1 (normal distribution; mean = 1, SD=1)   
X_1 <- rnorm(n,1,1)

# print out the first 6 rows of X_1 
head(X_1)

# simulate X_2 (normal distribution; mean = 2, SD=1)   
X_2 <- rnorm(n,2,1)

# print out the first 6 rows of X_2 
head(X_2)
> head(X_1)
[1] 0.4395244 0.7698225 2.5587083
[4] 1.0705084 1.1292877 2.7150650

> head(X_2)
[1] 4.198810 3.312413 1.734855 2.543194
[5] 1.585660 1.523753

Step 2: Generate

The following is to generate . The following are the true values of the statement.

Xβ = 0.5+0.3*X_1+0.8*X_2

xb<-0.5+0.3*X_1+0.8*X_2
head(xb)
> head(xb)
[1] 3.990906 3.380877 2.655496 2.855708
[5] 2.107314 2.533522

Step 3: Generate p

The following is to generate probability p.

p <- 1/(1 + exp(-xb))
head(p)
> head(p)
[1] 0.9818525 0.9671015 0.9343490
[4] 0.9456130 0.8916121 0.9264587

Step 4: Generate binary Y and combine a DataFrame

The following is to generate binary Y and then combine X1, X2, and Y into a DataFrame.

# generate Y
Y<-rbinom(n, size = 1, prob = p)
head(Y)

# combine X1, X2, and Y into a DataFrame
df <- data.frame(X_1=X_1, X_2=X_2, Y=Y)
head(df)
> head(Y)
[1] 1 1 1 1 1 0

> head(df)
        X_1      X_2 Y
1 0.4395244 4.198810 1
2 0.7698225 3.312413 1
3 2.5587083 1.734855 1
4 1.0705084 2.543194 1
5 1.1292877 1.585660 1
6 2.7150650 1.523753 0

Step 5 (optional): Run a logistic regression using simulated data

The following is to run a logistic regression using simulated data. This step is optional. Based on the output, we can see the following is the estimated , which is slightly different from the true values in Step 2.

Xb = 0.95+0.51*X_1+0.44*X_2

# run logistic regresion using simulated data 
results<- glm(Y ~X_1+X_2, data=df,family = "binomial")
summary(results)
> summary(results)

Call:
glm(formula = Y ~ X_1 + X_2, family = "binomial", data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4665   0.3153   0.4078   0.4879   0.7932  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   0.9491     0.5517   1.720   0.0854 .
X_1           0.5063     0.2816   1.798   0.0721 .
X_2           0.4391     0.2513   1.747   0.0806 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 130.03  on 199  degrees of freedom
Residual deviance: 123.80  on 197  degrees of freedom
AIC: 129.8

Number of Fisher Scoring iterations: 5