This tutorial shows the steps of simulating a dataset for logistic regression R. Logistic regression is based on the following link function.
\[ Prob \{Y=1|X \} = \frac{1}{1+e^{- X \beta}} \]
In particular, the following are the steps for simulating a dataset for logistic regression in R.
Step 1: Generate Xs
Suppose that we have 2 Xs in the logistic regression, and the following is the R code to generate them.
# set the size of the sample    
n=200
# set seed      
set.seed(123)
# simulate X_1 (normal distribution; mean = 1, SD=1)   
X_1 <- rnorm(n,1,1)
# print out the first 6 rows of X_1 
head(X_1)
# simulate X_2 (normal distribution; mean = 2, SD=1)   
X_2 <- rnorm(n,2,1)
# print out the first 6 rows of X_2 
head(X_2)
> head(X_1) [1] 0.4395244 0.7698225 2.5587083 [4] 1.0705084 1.1292877 2.7150650 > head(X_2) [1] 4.198810 3.312413 1.734855 2.543194 [5] 1.585660 1.523753
Step 2: Generate Xβ
The following is to generate Xβ. The following are the true values of the statement.
Xβ = 0.5+0.3*X_1+0.8*X_2
xb<-0.5+0.3*X_1+0.8*X_2
head(xb)> head(xb) [1] 3.990906 3.380877 2.655496 2.855708 [5] 2.107314 2.533522
Step 3: Generate p
The following is to generate probability p.
p <- 1/(1 + exp(-xb))
head(p)> head(p) [1] 0.9818525 0.9671015 0.9343490 [4] 0.9456130 0.8916121 0.9264587
Step 4: Generate binary Y and combine a DataFrame
The following is to generate binary Y and then combine X1, X2, and Y into a DataFrame.
# generate Y
Y<-rbinom(n, size = 1, prob = p)
head(Y)
# combine X1, X2, and Y into a DataFrame
df <- data.frame(X_1=X_1, X_2=X_2, Y=Y)
head(df)> head(Y)
[1] 1 1 1 1 1 0
> head(df)
        X_1      X_2 Y
1 0.4395244 4.198810 1
2 0.7698225 3.312413 1
3 2.5587083 1.734855 1
4 1.0705084 2.543194 1
5 1.1292877 1.585660 1
6 2.7150650 1.523753 0
Step 5 (optional): Run a logistic regression using simulated data
The following is to run a logistic regression using simulated data. This step is optional. Based on the output, we can see the following is the estimated Xβ, which is slightly different from the true values in Step 2.
Xb = 0.95+0.51*X_1+0.44*X_2
# run logistic regresion using simulated data 
results<- glm(Y ~X_1+X_2, data=df,family = "binomial")
summary(results)> summary(results)
Call:
glm(formula = Y ~ X_1 + X_2, family = "binomial", data = df)
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4665   0.3153   0.4078   0.4879   0.7932  
Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   0.9491     0.5517   1.720   0.0854 .
X_1           0.5063     0.2816   1.798   0.0721 .
X_2           0.4391     0.2513   1.747   0.0806 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 130.03  on 199  degrees of freedom
Residual deviance: 123.80  on 197  degrees of freedom
AIC: 129.8
Number of Fisher Scoring iterations: 5
		