This tutorial shows the steps of simulating a dataset for logistic regression R. Logistic regression is based on the following link function.
\[ Prob \{Y=1|X \} = \frac{1}{1+e^{- X \beta}} \]
In particular, the following are the steps for simulating a dataset for logistic regression in R.
Step 1: Generate Xs
Suppose that we have 2 Xs in the logistic regression, and the following is the R code to generate them.
# set the size of the sample
n=200
# set seed
set.seed(123)
# simulate X_1 (normal distribution; mean = 1, SD=1)
X_1 <- rnorm(n,1,1)
# print out the first 6 rows of X_1
head(X_1)
# simulate X_2 (normal distribution; mean = 2, SD=1)
X_2 <- rnorm(n,2,1)
# print out the first 6 rows of X_2
head(X_2)
> head(X_1) [1] 0.4395244 0.7698225 2.5587083 [4] 1.0705084 1.1292877 2.7150650 > head(X_2) [1] 4.198810 3.312413 1.734855 2.543194 [5] 1.585660 1.523753
Step 2: Generate Xβ
The following is to generate Xβ. The following are the true values of the statement.
Xβ = 0.5+0.3*X_1+0.8*X_2
xb<-0.5+0.3*X_1+0.8*X_2
head(xb)
> head(xb) [1] 3.990906 3.380877 2.655496 2.855708 [5] 2.107314 2.533522
Step 3: Generate p
The following is to generate probability p.
p <- 1/(1 + exp(-xb))
head(p)
> head(p) [1] 0.9818525 0.9671015 0.9343490 [4] 0.9456130 0.8916121 0.9264587
Step 4: Generate binary Y and combine a DataFrame
The following is to generate binary Y and then combine X1, X2, and Y into a DataFrame.
# generate Y
Y<-rbinom(n, size = 1, prob = p)
head(Y)
# combine X1, X2, and Y into a DataFrame
df <- data.frame(X_1=X_1, X_2=X_2, Y=Y)
head(df)
> head(Y) [1] 1 1 1 1 1 0 > head(df) X_1 X_2 Y 1 0.4395244 4.198810 1 2 0.7698225 3.312413 1 3 2.5587083 1.734855 1 4 1.0705084 2.543194 1 5 1.1292877 1.585660 1 6 2.7150650 1.523753 0
Step 5 (optional): Run a logistic regression using simulated data
The following is to run a logistic regression using simulated data. This step is optional. Based on the output, we can see the following is the estimated Xβ, which is slightly different from the true values in Step 2.
Xb = 0.95+0.51*X_1+0.44*X_2
# run logistic regresion using simulated data
results<- glm(Y ~X_1+X_2, data=df,family = "binomial")
summary(results)
> summary(results) Call: glm(formula = Y ~ X_1 + X_2, family = "binomial", data = df) Deviance Residuals: Min 1Q Median 3Q Max -2.4665 0.3153 0.4078 0.4879 0.7932 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.9491 0.5517 1.720 0.0854 . X_1 0.5063 0.2816 1.798 0.0721 . X_2 0.4391 0.2513 1.747 0.0806 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 130.03 on 199 degrees of freedom Residual deviance: 123.80 on 197 degrees of freedom AIC: 129.8 Number of Fisher Scoring iterations: 5