How to Simulate a Dataset for Logistic Regression in R

This tutorial shows the steps of simulating a dataset for logistic regression R. Logistic regression is based on the following link function.

\[ Prob \{Y=1|X \} = \frac{1}{1+e^{- X \beta}} \]

In particular, the following are the steps for simulating a dataset for logistic regression in R.

Step 1: Generate Xs

Suppose that we have 2 Xs in the logistic regression, and the following is the R code to generate them.

# set the size of the sample    
n=200

# set seed      
set.seed(123)

# simulate X_1 (normal distribution; mean = 1, SD=1)   
X_1 <- rnorm(n,1,1)

# print out the first 6 rows of X_1 
head(X_1)

# simulate X_2 (normal distribution; mean = 2, SD=1)   
X_2 <- rnorm(n,2,1)

# print out the first 6 rows of X_2 
head(X_2)

> head(X_1)
[1] 0.4395244 0.7698225 2.5587083
[4] 1.0705084 1.1292877 2.7150650

> head(X_2)
[1] 4.198810 3.312413 1.734855 2.543194
[5] 1.585660 1.523753

Step 2: Generate Xβ

The following is to generate Xβ. The following are the true values of the statement.

Xβ = 0.5+0.3*X_1+0.8*X_2

xb<-0.5+0.3*X_1+0.8*X_2
head(xb)

> head(xb)
[1] 3.990906 3.380877 2.655496 2.855708
[5] 2.107314 2.533522

Step 3: Generate p

The following is to generate probability p.

p <- 1/(1 + exp(-xb))
head(p)

> head(p)
[1] 0.9818525 0.9671015 0.9343490
[4] 0.9456130 0.8916121 0.9264587

Step 4: Generate binary Y and combine a DataFrame

The following is to generate binary Y and then combine X1, X2, and Y into a DataFrame.

# generate Y
Y<-rbinom(n, size = 1, prob = p)
head(Y)

# combine X1, X2, and Y into a DataFrame
df <- data.frame(X_1=X_1, X_2=X_2, Y=Y)
head(df)

> head(Y)
[1] 1 1 1 1 1 0

> head(df)
        X_1      X_2 Y
1 0.4395244 4.198810 1
2 0.7698225 3.312413 1
3 2.5587083 1.734855 1
4 1.0705084 2.543194 1
5 1.1292877 1.585660 1
6 2.7150650 1.523753 0

Step 5 (optional): Run a logistic regression using simulated data

The following is to run a logistic regression using simulated data. This step is optional. Based on the output, we can see the following is the estimated Xβ, which is slightly different from the true values in Step 2.

Xb = 0.95+0.51*X_1+0.44*X_2

# run logistic regresion using simulated data 
results<- glm(Y ~X_1+X_2, data=df,family = "binomial")
summary(results)

> summary(results)

Call:
glm(formula = Y ~ X_1 + X_2, family = "binomial", data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4665   0.3153   0.4078   0.4879   0.7932  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   0.9491     0.5517   1.720   0.0854 .
X_1           0.5063     0.2816   1.798   0.0721 .
X_2           0.4391     0.2513   1.747   0.0806 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 130.03  on 199  degrees of freedom
Residual deviance: 123.80  on 197  degrees of freedom
AIC: 129.8

Number of Fisher Scoring iterations: 5