This tutorial shows how to simulate a dataset for Poisson regression in R.
Step 1: Determine the model
Suppose that the following is the model with known population parameters, namely known regression coefficients of 0.2 and 0.08. Of course, in reality, the most likely result is that we do not know such parameters and we need to estimate.
\[ Y = 0.2 + 0.2 \times M + 0.08 \times K \]
Step 2: Simulate Independent Variables (IVs) in the known model
We are going to randomly generate two normal distribution data of M and K. Note that, you can generate other type of distribution, for instance, binary data for M and/or K.
# set the size of the sample n=500 # set seed set.seed(123) # generate M and X M<-rnorm(n,2,3) K<- rnorm(n, 5, 4) # print out first 6 M and X head(M) head(K)
Step 3: Simulate dependent variable (DV) in the known model
Note that, Poisson regression uses log link, and thus we need to use log link to connect between IVs (or, X) and DV (Y). We are going to use rpois() to generate the data.
# log link being used mu_1 <- exp(0.2 + 0.2*M+0.08*K) Y <- rpois(n, lambda=mu_1) # combine them into a data frame and pint out first 6 rows data <- data.frame(M=M, K=K, Y=Y) head(data)
The following is the output:
> head(data) M K Y 1 0.3185731 2.592429 0 2 1.3094675 1.025206 0 3 6.6761249 9.107140 6 4 2.2115252 8.004245 4 5 2.3878632 -1.036666 2 6 7.1451950 4.619410 8
Step 4: Use glm() to check if we simulate Poisson regression correctly
We can use glm() to see if the regression coefficients are close to those in the known model.
result_Poisson<-glm(Y~M+K, data = data, family = poisson(link = log)) result_Poisson
The following is the output. We can see that M is 0.20044 and K is 0.07496. Thus, they are very close to parameters shown in the known model in Step 1. That means that we correctly simulate data for Poisson regression in R.
> result_Poisson Call: glm(formula = Y ~ M + K, family = poisson(link = log), data = data) Coefficients: (Intercept) M K 0.24061 0.20044 0.07496 Degrees of Freedom: 499 Total (i.e. Null); 497 Residual Null Deviance: 1307 Residual Deviance: 545.3 AIC: 1884