This tutorial focuses on interaction between a categorial variable and a continuous variable in linear regression. Note that, in this tutorial, we limit the the categorical variable to be 2 levels. (For a categrocial variable with 3 levels, please refer to my another tutotrial on interaction and coding in linear regression .)
Coding Note
In this tutotiral, the dummy coding uses the group 3 as the reference group. Thus, the first comparison is between group 1 and group 3. The second comparison is between group 2 and group 3. For the detailed reasoning, please refer to my another tutorial on Dummy and Contrast Codings in Linear Regression.
Simulated Data
# set seed
set.seed(123)
# Repeat a sequence of numbers:
Xa<-rep(c(1, 2), times=5)
Xa<-as.factor(Xa)
Y<-rnorm(10)
Xb<-rnorm(10)
# combine it into a data frame
df<-data.frame(Xa,Xb,Y)
print(df)
Xa Xb Y 1 1 1.2240818 -0.56047565 2 2 0.3598138 -0.23017749 3 1 0.4007715 1.55870831 4 2 0.1106827 0.07050839 5 1 -0.5558411 0.12928774 6 2 1.7869131 1.71506499 7 1 0.4978505 0.46091621 8 2 -1.9666172 -1.26506123 9 1 0.7013559 -0.68685285 10 2 -0.4727914 -0.44566197
We can also calculate the means for each level of the categorical variable in R.
# calculate means by group
aggregate(df$Y, list(df$Xa), FUN=mean)
Group.1 x 1 1 0.18031675 2 2 -0.03106546
Situation 1: Categorical IV only
In situation 1, we will focus on a regression model with single IV, namely the categorical variable.
Xa | Means | Dummy Coding | Contrst Coding | ||
Group 1 | 0.1803 | Intercept | -0.0311 | (0.1803-0.0311)/2= 0.0746 | |
Group 2 | -0.0311 | Coded Variable | 0.1803-(-0.0311)=0.2114 | 0.1803-0.0746= 0.1057 |
Situation 1a: Dummy Coding:
# dummy coding
contrasts(df$Xa) =contr.treatment(2, base = 2)
# linear regression with dummy coding
result<-lm(Y~Xa,data=df)
# summarize the result
summary(result)
Call: lm(formula = Y ~ Xa, data = df) Residuals: Min 1Q Median 3Q Max -1.2340 -0.6592 -0.1251 0.2358 1.7461 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.03107 0.44932 -0.069 0.947 Xa1 0.21138 0.63544 0.333 0.748 Residual standard error: 1.005 on 8 degrees of freedom Multiple R-squared: 0.01364, Adjusted R-squared: -0.1097 F-statistic: 0.1107 on 1 and 8 DF, p-value: 0.7479
Situation 1b: Contrast Coding:
# contrast coding
contrasts(df$Xa) =contr.sum(2)
# linear regression with dummy coding
result<-lm(Y~Xa,data=df)
# summarize the result
summary(result)
Call: lm(formula = Y ~ Xa, data = df) Residuals: Min 1Q Median 3Q Max -1.2340 -0.6592 -0.1251 0.2358 1.7461 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.07463 0.31772 0.235 0.820 Xa1 0.10569 0.31772 0.333 0.748 Residual standard error: 1.005 on 8 degrees of freedom Multiple R-squared: 0.01364, Adjusted R-squared: -0.1097 F-statistic: 0.1107 on 1 and 8 DF, p-value: 0.7479
Situation 2: Categorical IV + Continuous IV
Situation 2a: Dummy Coding
# dummy coding
contrasts(df$Xa) =contr.treatment(2, base=2)
# linear regression with dummy coding
result<-lm(Y~Xa+Xb,data=df)
# summarize the result
summary(result)
Call: lm(formula = Y ~ Xa + Xb, data = df) Residuals: Min 1Q Median 3Q Max -1.15473 -0.35823 -0.07879 0.43272 1.40680 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.01151 0.39473 -0.029 0.978 Xa1 -0.05190 0.57614 -0.090 0.931 Xb 0.53727 0.29252 1.837 0.109 Residual standard error: 0.8823 on 7 degrees of freedom Multiple R-squared: 0.3344, Adjusted R-squared: 0.1442 F-statistic: 1.758 on 2 and 7 DF, p-value: 0.2406
Situation 2b: Contrast Coding:
# contrast coding
contrasts(df$Xa) =contr.sum(2)
# linear regression with contrast coding
result<-lm(Y~Xa+Xb,data=df)
# summarize the result
summary(result)
Call: lm(formula = Y ~ Xa + Xb, data = df) Residuals: Min 1Q Median 3Q Max -1.15473 -0.35823 -0.07879 0.43272 1.40680 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.03746 0.28561 -0.131 0.899 Xa1 -0.02595 0.28807 -0.090 0.931 Xb 0.53727 0.29252 1.837 0.109 Residual standard error: 0.8823 on 7 degrees of freedom Multiple R-squared: 0.3344, Adjusted R-squared: 0.1442 F-statistic: 1.758 on 2 and 7 DF, p-value: 0.2406
Situation 2c: Dummy Coding + Centering
Here, we continue focusing on a model with the dummy coding for the categorical variable. In addition, we center the continuous variable.
# dummy coding
contrasts(df$Xa) =contr.treatment(2, base=2)
# centering continuous variable
center_scale <- function(x) { scale(x, scale = FALSE)}
df$Xb<-center_scale(df$Xb)
# linear regression with dummy coding
result<-lm(Y~Xa+Xb,data=df)
# summarize the result
summary(result)
Call: lm(formula = Y ~ Xa + Xb, data = df) Residuals: Min 1Q Median 3Q Max -1.15473 -0.35823 -0.07879 0.43272 1.40680 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1006 0.4010 0.251 0.809 Xa1 -0.0519 0.5761 -0.090 0.931 Xb 0.5373 0.2925 1.837 0.109 Residual standard error: 0.8823 on 7 degrees of freedom Multiple R-squared: 0.3344, Adjusted R-squared: 0.1442 F-statistic: 1.758 on 2 and 7 DF, p-value: 0.2406
Situation 2d: Contrast Coding+ Centering
# contrast coding
contrasts(df$Xa) =contr.sum(2)
# centering continuous variable
center_scale <- function(x) { scale(x, scale = FALSE)}
df$Xb<-center_scale(df$Xb)
# linear regression with contrast coding
result<-lm(Y~Xa+Xb,data=df)
# summarize the result
summary(result)
Call: lm(formula = Y ~ Xa + Xb, data = df) Residuals: Min 1Q Median 3Q Max -1.15473 -0.35823 -0.07879 0.43272 1.40680 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.07463 0.27901 0.267 0.797 Xa1 -0.02595 0.28807 -0.090 0.931 Xb 0.53727 0.29252 1.837 0.109 Residual standard error: 0.8823 on 7 degrees of freedom Multiple R-squared: 0.3344, Adjusted R-squared: 0.1442 F-statistic: 1.758 on 2 and 7 DF, p-value: 0.2406
Situation 3: Interaction
Situation 3a: Dummy Coding
# dummy coding
contrasts(df$Xa) =contr.treatment(2, base = 2)
# linear regression with dummy coding
result<-lm(Y~Xa+Xb+Xa*Xb,data=df)
# summarize the result
summary(result)
Call: lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df) Residuals: Min 1Q Median 3Q Max -0.74993 -0.47098 -0.04572 0.28724 1.35337 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.003186 0.334174 -0.010 0.9927 Xa1 0.398200 0.540029 0.737 0.4887 Xb 0.765925 0.274210 2.793 0.0314 * Xa1:Xb -1.239197 0.638357 -1.941 0.1003 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7469 on 6 degrees of freedom Multiple R-squared: 0.5912, Adjusted R-squared: 0.3868 F-statistic: 2.892 on 3 and 6 DF, p-value: 0.1243
Situation 3b: Contrast Coding
# contrast coding
contrasts(df$Xa) =contr.sum(2)
# linear regression with contrast coding
result<-lm(Y~Xa+Xb+Xa*Xb,data=df)
# summarize the result
summary(result)
Call: lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df) Residuals: Min 1Q Median 3Q Max -0.74993 -0.47098 -0.04572 0.28724 1.35337 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1959 0.2700 0.726 0.495 Xa1 0.1991 0.2700 0.737 0.489 Xb 0.1463 0.3192 0.458 0.663 Xa1:Xb -0.6196 0.3192 -1.941 0.100 Residual standard error: 0.7469 on 6 degrees of freedom Multiple R-squared: 0.5912, Adjusted R-squared: 0.3868 F-statistic: 2.892 on 3 and 6 DF, p-value: 0.1243
Situation 3c: Dummy Coding + Centering
# dummy coding
contrasts(df$Xa) =contr.treatment(2)
# centering the continuous IV
center_scale <- function(x) { scale(x, scale = FALSE)}
df$Xb<-center_scale(df$Xb)
# linear regression with dummy coding
result<-lm(Y~Xa+Xb+Xa*Xb,data=df)
# summarize the result
summary(result)
Call: lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df) Residuals: Min 1Q Median 3Q Max -0.74993 -0.47098 -0.04572 0.28724 1.35337 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1566 0.3407 0.460 0.6620 Xa1 0.1397 0.4976 0.281 0.7884 Xb 0.7659 0.2742 2.793 0.0314 * Xa1:Xb -1.2392 0.6384 -1.941 0.1003 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7469 on 6 degrees of freedom Multiple R-squared: 0.5912, Adjusted R-squared: 0.3868 F-statistic: 2.892 on 3 and 6 DF, p-value: 0.1243
Situation 3d: Contrast Coding + Centering
# contrast coding
contrasts(df$Xa) =contr.sum(2)
# centering the continuous IV
center_scale <- function(x) { scale(x, scale = FALSE)}
df$Xb<-center_scale(df$Xb)
# linear regression with contrast coding
result<-lm(Y~Xa+Xb+Xa*Xb,data=df)
# summarize the result
summary(result)
Call: lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df) Residuals: Min 1Q Median 3Q Max -0.74993 -0.47098 -0.04572 0.28724 1.35337 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.22644 0.24880 0.910 0.398 Xa1 0.06984 0.24880 0.281 0.788 Xb 0.14633 0.31918 0.458 0.663 Xa1:Xb -0.61960 0.31918 -1.941 0.100 Residual standard error: 0.7469 on 6 degrees of freedom Multiple R-squared: 0.5912, Adjusted R-squared: 0.3868 F-statistic: 2.892 on 3 and 6 DF, p-value: 0.1243
Summary
S1 (D) | S1 (C) | S2a (D) | S2b (C) | S2c (D) | S2d (C) | |
No C. | No C. | Yes C. | Yes C. | |||
Intercept | -0.03107 | 0.0746 | -0.01151 | -0.0375 | 0.1006 | 0.0746 |
p-value for intercept | 0.947 | 0.820 | 0.978 | 0.899 | 0.809 | 0.797 |
coefficient for Categorial IV | -0.211 | 0.1057 | -0.05190 | -0.0260 | -0.0519 | -0.0260 |
p-value for Categorial IV | 0.748 | 0.748 | 0.931 | 0.931 | 0.931 | 0.931 |
coefficient for continuous IV | 0.5373 | 0.5373 | 0.5373 | 0.5373 | ||
p-value for continuous IV | 0.109 | 0.109 | 0.109 | 0.109 |
Situation 2: Categorical IV + Centered Continuous IV
- S1 vs. S2: Adding a continuous IV into the regression changes everything, both coefficient and p-value, for the categorical variable.
- S2a vs. S2b: Coding (contrast vs. dummy) does not change anything of the continuous IV.
- S2c vs. S2d: same point as point 2.
- S2a vs. S2b: Coding (contrast vs. dummy) does not change p-value of the categorical variable, but does change the regression coefficient for the categorical variable.
- S2c vs. S2d: same point as point 4.
- S2a +S2b vs. S2c+S2d: Centering only changes both things in intercept, and does not change anything else.
S1 | S1 | S2a | S2b | S2c | S2d | S3a | S3b | S3c | S3d | |
Coding | D | C | D | C | D | C | D | C | D | C |
Centering Continuous IV | No | No | Yes | Yes | No | No | Yes | Yes | ||
Intercept | -0.03107 | 0.0746 | -0.01151 | -0.0375 | 0.1006 | 0.0746 | -0.003186 | 0.1959 | 0.1566 | 0.2264 |
p-value for intercept | 0.947 | 0.820 | 0.978 | 0.899 | 0.809 | 0.797 | 0.9927 | 0.495 | 0.6620 | 0.398 |
coefficient for categorial IV | -0.211 | 0.1057 | -0.05190 | -0.0260 | -0.0519 | -0.0260 | -0.398200 | 0.1991 | 0.1397 | 0.0698 |
p-value for categorial IV | 0.748 | 0.748 | 0.931 | 0.931 | 0.931 | 0.931 | 0.4887 | 0.489 | 0.7884 | 0.788 |
coefficient for continuous IV | 0.5373 | 0.5373 | 0.5373 | 0.5373 | 0.765925 | 0.1463 | 0.7659 | 0.1463 | ||
p-value for continuous IV | 0.109 | 0.109 | 0.109 | 0.109 | 0.0314 * | 0.663 | 0.0314 * | 0.663 | ||
Interacttion coefficient | -1.2392 | -0.6196 | -1.2392 | -0.6196 | ||||||
p-value | 0.100 | 0.100 | 0.100 | 0.100 |
Situation 3: Categorical IV + Centered Continuous IV+ Categorical Variable * Continuous
- S2 vs. S3: Adding an interaction item changes all other coefficients.
- S3a vs. S3b: Coding makes a difference in both things (coefficient and p-value) for the continuous IV.
- S3c vs. S3d: same as point 2.
- S3a vs. S3b: While coding makes a difference in the regression coefficients for categorical IV and the interaction item, the p-values for categorical IV and the interaction item do not change.
- S3c vs. S2d: same as point 4.
- S3a+S3b vs. S3c+S3d: Centering only changes both things (coefficient and p-value) in intercetpt and categorical IV.
- S3a+S3b vs. S3c+S3d: Centering does not change either thing for the continuous IV or the interaction item.
Take-home message
- For a categorical variable with 2 levels, different codings do not impact the p-value of interaction effect.
- For a categorical variable with 2 levels, when the interaction effect is significant, if you want to look at main effects, you better look at the simple main effect (e.g., using Johnson Neyman or just doing slope tests under each level of the categorical variable).
- If the interaction effect is significant, it is difficult to explain any other main effect, except for the simple main effects mentioned in point 2.