This tutorial explains the differences between dummy coding and contrast coding in linear regression using R code examples. It is worth pointing out that, this tutorial focuses on the categorical independent variable has 3 levels.
Short Note
Note that, in R, the default reference group in dummy coding uses the first item in an alphabetical order (or, numeric order). Thus, Group 1 will be the reference level, and the first dummy-coded variable is Group 2 comparing to Group 1. The second dummy-coded variable is Group 3 comparing to Group 1.
In contrast, in R, the first contrast-coded variable will be the comparing the Group 1 with the overall mean. the second contrast-coded variable will be the comparing the Group 2 with the overall mean.
Therefore, to make it a bit more consistent (and less confusing) between dummy coding and contrast coding, in the following, I changed reference level in the dummy coding to be Group 3.
In particular, in this tutorial, the first dummy-coded variable will be Group 1 comparing to Group 3. The second dummy-coded variable will be Group 2 comparing to Group 3.
Simulated Data
# set seed
set.seed(123)
# Repeat a sequence of numbers:
Xa<-rep(c(1, 2, 3), times=5)
Xa<-as.factor(Xa)
Y<-rnorm(15)
Xb<-rnorm(15)
# combine it into a data frame
df<-data.frame(Xa,Xb,Y)
print(df)
Xa Xb Y 1 1 1.7869131 -0.56047565 2 2 0.4978505 -0.23017749 3 3 -1.9666172 1.55870831 4 1 0.7013559 0.07050839 5 2 -0.4727914 0.12928774 6 3 -1.0678237 1.71506499 7 1 -0.2179749 0.46091621 8 2 -1.0260044 -1.26506123 9 3 -0.7288912 -0.68685285 10 1 -0.6250393 -0.44566197 11 2 -1.6866933 1.22408180 12 3 0.8377870 0.35981383 13 1 0.1533731 0.40077145 14 2 -1.1381369 0.11068272 15 3 1.2538149 -0.55584113
We can calculate the means for each group level using the following R code.
# calculate means by group
aggregate(df$Y, list(df$Xa), FUN=mean)
Group.1 x 1 1 -0.014788314 2 2 -0.006237295 3 3 0.478178628
Xa | Means | Dummy Coding | Contrast Coding | ||
Group 1 | -0.0148 | Intercept | 0.4782 | (-0.0148+(-0.006)+0.4782)/3=0.1524 | |
Group 2 | -0.0062 | Coded Variable 1 | -0.0148 – 0.4782=-0.493 | -0.0148-0.1524=-0.1672 | |
Group 3 | 0.4782 | Coded Variable 2 | -0.0062 – 0.4782 =-0.4844 | -0.0062-0.1524=-0.1586 |
Situation 1: Categorial IV Only
In Situation 1, the linear regression has only one IV, namely Xa, which has 3 levels.
Situation 1a: Dummy Coding
# dummy coding
contrasts(df$Xa) =contr.treatment(3, base =3)
# linear regression with dummy coding
result<-lm(Y~Xa,data=df)
summary(result)
Call: lm(formula = Y ~ Xa, data = df) Residuals: Min 1Q Median 3Q Max -1.2588 -0.4883 0.0853 0.4456 1.2369 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.4782 0.3918 1.221 0.246 Xa1 -0.4930 0.5540 -0.890 0.391 Xa2 -0.4844 0.5540 -0.874 0.399 Residual standard error: 0.876 on 12 degrees of freedom Multiple R-squared: 0.07959, Adjusted R-squared: -0.07381 F-statistic: 0.5188 on 2 and 12 DF, p-value: 0.608
Situation 1b: Contrast Coding
# contrast coding
contrasts(df$Xa) =contr.sum(3)
# linear regression with contrast coding
result<-lm(Y~Xa,data=df)
summary(result)
Call: lm(formula = Y ~ Xa, data = df) Residuals: Min 1Q Median 3Q Max -1.2588 -0.4883 0.0853 0.4456 1.2369 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1524 0.2262 0.674 0.513 Xa1 -0.1672 0.3199 -0.523 0.611 Xa2 -0.1586 0.3199 -0.496 0.629 Residual standard error: 0.876 on 12 degrees of freedom Multiple R-squared: 0.07959, Adjusted R-squared: -0.07381 F-statistic: 0.5188 on 2 and 12 DF, p-value: 0.608
S1a | S1b | |
Coding | Dummy | Constrat |
Centering | N. A. | N. A. |
Intercept | 0.4782 | 0.1524 |
p-value | 0.246 | 0.513 |
Coded IV-1 | -0.4930 | -0.1672 |
p-value | 0.391 | 0.611 |
Coded IV-2 | -0.4844 | -0.1586 |
p-value | 0.399 | 0.629 |
Summary of Situation 1 (Single Categorical IV):
- Different codings (dummy coding vs. contrast coding) lead to different meanings of intercepts. For dummy coding, the intercept is the reference group (or, the reference level). In contrast, for contrast coding, the intercept is the mean of all groups (or, all levels).
- Note that, for both dummy coding and contrast coding, regression coefficients are always about if the comparison of a certain group and the reference group (i.e., the intercept) is significant. In the dummy coding, the reference group (i.e., the number in the intercept) is Group 3 in this case. The reference group (i.e., the number in the intercept) in the contrast coding is the overall mean.
- When a regression only has categorical variables, the regression coefficients are basically about if the mean comparisons are significant.
Situation 2: Categorical IV + Continuous IV
In Situation 2, the linear regression model has 2 IVs, namely a categorical IV and a continuous IV.
Compared to Situation 1, after adding a continuous IV, you can no longer directly connect the means of Y and the categorical IV. This is because the continuous IV will “explain partially” of Y, making the interpretations of the regression coefficients for categorical X not straightforward.
Situation 2a: Dummy Coding
# dummy coding
contrasts(df$Xa) =contr.treatment(3, base =3)
# linear regression with dummy coding
result<-lm(Y~Xa+Xb,data=df)
summary(result)
Call: lm(formula = Y ~ Xa + Xb, data = df) Residuals: Min 1Q Median 3Q Max -1.3704 -0.1987 0.2314 0.3548 0.9232 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.3352 0.3572 0.938 0.368 Xa1 -0.1961 0.5167 -0.380 0.712 Xa2 -0.6687 0.5035 -1.328 0.211 Xb -0.4277 0.2131 -2.007 0.070 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7828 on 11 degrees of freedom Multiple R-squared: 0.3263, Adjusted R-squared: 0.1426 F-statistic: 1.776 on 3 and 11 DF, p-value: 0.2098
Situation 2b: Contrast Coding
# contrast coding
contrasts(df$Xa) =contr.sum(3)
# linear regression with contrast coding
result<-lm(Y~Xa+Xb,data=df)
summary(result)
Call: lm(formula = Y ~ Xa + Xb, data = df) Residuals: Min 1Q Median 3Q Max -1.3704 -0.1987 0.2314 0.3548 0.9232 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.04691 0.20883 0.225 0.826 Xa1 0.09217 0.31368 0.294 0.774 Xa2 -0.38042 0.30645 -1.241 0.240 Xb -0.42773 0.21312 -2.007 0.070 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7828 on 11 degrees of freedom Multiple R-squared: 0.3263, Adjusted R-squared: 0.1426 F-statistic: 1.776 on 3 and 11 DF, p-value: 0.2098
As we can see, different codings in the categorical IV do not change the slope of the continuous IV, always -0.4277.
Situation 2c: Dummy Coding + Centering
This section explores if centering the continuous IVs changes something.
The short answer is that centering continuous IV only changes the intercept, not any of the regression coefficients, regardless of dummy or contrast coding.
# dummy coding
contrasts(df$Xa) =contr.treatment(3, based = 3)
# centering the continuous IV
center_scale <- function(x) { scale(x, scale = FALSE)}
df$Xb<-center_scale(df$Xb)
# linear regression with dummy coding
result<-lm(Y~Xa+Xb,data=df)
summary(result)
Call: lm(formula = Y ~ Xa + Xb, data = df) Residuals: Min 1Q Median 3Q Max -1.3704 -0.1987 0.2314 0.3548 0.9232 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.4406 0.3506 1.257 0.235 Xa1 -0.1961 0.5167 -0.380 0.712 Xa2 -0.6687 0.5035 -1.328 0.211 Xb -0.4277 0.2131 -2.007 0.070 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7828 on 11 degrees of freedom Multiple R-squared: 0.3263, Adjusted R-squared: 0.1426 F-statistic: 1.776 on 3 and 11 DF, p-value: 0.2098
Situation 2d: Contrast Coding + Centering
# contrast coding
contrasts(df$Xa) =contr.sum(3)
# centering the continuous IV
center_scale <- function(x) { scale(x, scale = FALSE)}
df$Xb<-center_scale(df$Xb)
# linear regression with contrast coding
result<-lm(Y~Xa+Xb,data=df)
summary(result)
Call: lm(formula = Y ~ Xa + Xb, data = df) Residuals: Min 1Q Median 3Q Max -1.3704 -0.1987 0.2314 0.3548 0.9232 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.15238 0.20211 0.754 0.467 Xa1 0.09217 0.31368 0.294 0.774 Xa2 -0.38042 0.30645 -1.241 0.240 Xb -0.42773 0.21312 -2.007 0.070 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7828 on 11 degrees of freedom Multiple R-squared: 0.3263, Adjusted R-squared: 0.1426 F-statistic: 1.776 on 3 and 11 DF, p-value: 0.2098
S1a | S1b | S2a | S2b | S2c | S2d | |
Coding | Dummy | Contrast | Dummy | Contrast | Dummy | Contrast |
Centering | N. A. | N. A. | No | No | Yes | Yes |
Intercept | 0.4782 | 0.1524 | 0.3352 | 0.04691 | 0.4406 | 0.15238 |
p-value | 0.246 | 0.513 | 0.368 | 0.826 | 0.235 | 0.467 |
Coded IV-1 | -0.4930 | -0.1672 | -0.1961 | 0.09217 | -0.1961 | 0.09217 |
p-value | 0.391 | 0.611 | 0.712 | 0.774 | 0.712 | 0.774 |
Coded IV-2 | -0.4844 | -0.1586 | -0.6687 | -0.38042 | -0.6687 | -0.38042 |
p-value | 0.399 | 0.629 | 0.211 | 0.240 | 0.211 | 0.240 |
Continuous IV | -0.4277 | -0.4277 | -0.4277 | -0.4277 | ||
p-value | 0.070 | 0.070 | 0.070 | 0.070 |
Summary of Situation 2 (Categorical Variable + Continuous Variable):
- When there is a continuous variable added to the regression, it becomes complicated, as you can no longer really think of regression coefficients as a mean comparison on the measure of Y in different levels of X. Rather, given that the continuous variable (e.g., Xb) explains part of Y, categorical IV (e.g., Xa) regression coefficients can no longer be traced back directly to mean comparisons for Y values.
- S2a vs. S2b: The interesting observation: Regardless of dummy or contrast coding, the regression coefficient for Xb is consistent, namely -0.4277 in the example.
- S2c vs. S2d: same as point 2.
- S2a+S2b vs. S2c+S2d: Another interesting observation: Centering continuous IV (e.g., Xb) only changes the intercept, not any of the regression coefficients, regardless of dummy or contrast coding.
Situation 3: Interaction
Compared to Situation 2, interaction items are added in Situation 3. Note that, since there are 2 categorical levels, there are 2 interaction items.
When with 2 interaction items, dummy coding and contrast coding differ in the slope of the continuous variable, as well as regression coefficients for the two interaction items. All p-values are different.
Situation 3a: Dummy Coding
# dummy coding
contrasts(df$Xa) =contr.treatment(3, base=3)
# linear regression with dummy coding
result<-lm(Y~Xa+Xb+Xa*Xb,data=df)
summary(result)
Call: lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df) Residuals: Min 1Q Median 3Q Max -1.3822 -0.2130 0.1822 0.3688 0.8626 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.4299 0.3793 1.133 0.286 Xa1 -0.3258 0.6018 -0.541 0.601 Xa2 -0.6431 0.5978 -1.076 0.310 Xb -0.5503 0.3141 -1.752 0.114 Xa1:Xb 0.3543 0.5506 0.644 0.536 Xa2:Xb 0.1513 0.6001 0.252 0.807 Residual standard error: 0.846 on 9 degrees of freedom Multiple R-squared: 0.3562, Adjusted R-squared: -0.001457 F-statistic: 0.9959 on 5 and 9 DF, p-value: 0.4715
Situation 3b: Contrast Coding
# contrast coding
contrasts(df$Xa) =contr.sum(3)
# linear regression with contrast coding
result<-lm(Y~Xa+Xb+Xa*Xb,data=df)
summary(result)
Call: lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df) Residuals: Min 1Q Median 3Q Max -1.3822 -0.2130 0.1822 0.3688 0.8626 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.01277 0.26246 0.049 0.962 Xa1 0.04295 0.35414 0.121 0.906 Xa2 -0.32435 0.40941 -0.792 0.449 Xb -0.38180 0.25047 -1.524 0.162 Xa1:Xb 0.18580 0.36180 0.514 0.620 Xa2:Xb -0.01726 0.38715 -0.045 0.965 Residual standard error: 0.846 on 9 degrees of freedom Multiple R-squared: 0.3562, Adjusted R-squared: -0.001457 F-statistic: 0.9959 on 5 and 9 DF, p-value: 0.4715
Situation 3c: Dummy coding + centering
Based on the output below, we can see centering changes intercept and two coded variables. Centering does NOT change any output related to the centered variable, namely the continuous IV.
# dummy coding
contrasts(df$Xa) =contr.treatment(3, base=3)
# centering the continuous IV
center_scale <- function(x) { scale(x, scale = FALSE)}
df$Xb<-center_scale(df$Xb)
# linear regression with dummy coding
result<-lm(Y~Xa+Xb+Xa*Xb,data=df)
summary(result)
Call: lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df) Residuals: Min 1Q Median 3Q Max -1.3822 -0.2130 0.1822 0.3688 0.8626 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.4299 0.3793 1.133 0.286 Xa1 -0.3258 0.6018 -0.541 0.601 Xa2 -0.6431 0.5978 -1.076 0.310 Xb -0.5503 0.3141 -1.752 0.114 Xa1:Xb 0.3543 0.5506 0.644 0.536 Xa2:Xb 0.1513 0.6001 0.252 0.807 Residual standard error: 0.846 on 9 degrees of freedom Multiple R-squared: 0.3562, Adjusted R-squared: -0.001457 F-statistic: 0.9959 on 5 and 9 DF, p-value: 0.4715
Situation 3d: Contrast coding + centering
# contrast coding
contrasts(df$Xa) =contr.sum(3)
# centering the continuous IV
center_scale <- function(x) { scale(x, scale = FALSE)}
df$Xb<-center_scale(df$Xb)
# linear regression with contrast coding
result<-lm(Y~Xa+Xb+Xa*Xb,data=df)
summary(result)
Call: lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df) Residuals: Min 1Q Median 3Q Max -1.3822 -0.2130 0.1822 0.3688 0.8626 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.106919 0.252900 0.423 0.682 Xa1 -0.002872 0.369765 -0.008 0.994 Xa2 -0.320094 0.367566 -0.871 0.406 Xb -0.381798 0.250467 -1.524 0.162 Xa1:Xb 0.185803 0.361797 0.514 0.620 Xa2:Xb -0.017262 0.387154 -0.045 0.965 Residual standard error: 0.846 on 9 degrees of freedom Multiple R-squared: 0.3562, Adjusted R-squared: -0.001457 F-statistic: 0.9959 on 5 and 9 DF, p-value: 0.4715
S1a | S1b | S2a | S2b | S2c | S2d | S3a | S3b | S3c | S3d | ||
Coding | Dum. | Contra. | Dum. | Contra. | Dum. | Contra. | Dum. | Contra. | Dum. | Contra. | |
Centering | N. A. | N. A. | No | No | Yes | Yes | No | No | Yes | Yes | |
Intercept | 0.4782 | 0.1524 | 0.3352 | 0.04691 | 0.4406 | 0.15238 | 0.4299 | 0.01277 | 0.4299 | 0.106919 | |
p-value | 0.246 | 0.513 | 0.368 | 0.826 | 0.235 | 0.467 | 0.286 | 0.962 | 0.286 | 0.682 | |
Coded IV-1 | -0.4930 | -0.1672 | -0.1961 | 0.09217 | -0.1961 | 0.09217 | -0.3258 | 0.04295 | -0.3258 | -0.002872 | |
p-value | 0.391 | 0.611 | 0.712 | 0.774 | 0.712 | 0.774 | 0.601 | 0.906 | 0.601 | 0.994 | |
Coded IV-2 | -0.4844 | -0.1586 | -0.6687 | -0.38042 | -0.6687 | -0.38042 | -0.6431 | -0.32435 | -0.6431 | -0.320094 | |
p-value | 0.399 | 0.629 | 0.211 | 0.240 | 0.211 | 0.240 | 0.310 | 0.449 | 0.310 | 0.406 | |
Continuous IV | -0.4277 | -0.4277 | -0.4277 | -0.4277 | -0.5503 | -0.38180 | -0.5503 | -0.381798 | |||
p-value | 0.070 | 0.070 | 0.070 | 0.070 | 0.114 | 0.162 | 0.114 | 0.162 | |||
Interaction with Coded IV-1 | 0.3543 | 0.18580 | 0.3543 | 0.185803 | |||||||
p-value | 0.536 | 0.620 | 0.536 | 0.620 | |||||||
Interaction with Coded IV-2 | 0.1513 | -0.01726 | 0.1513 | -0.017262 | |||||||
p-value | 0.807 | 0.965 | 0.807 | 0.965 |
Summary for Situation 3 (Categorical Variable + Continuous Variable + Categorical Variable * Continuous)
- S2 vs. S3: Adding an interaction item changes all other coefficients.
- S3a vs. S3b: Coding makes a difference in both things (coefficient and p-value) for the continuous IV.
- S3a vs. S3b: Coding makes a difference in both things (coefficient and p-value) for both interaction items. (This is VERY VERY important, especially the fact that the p-values are different between dummy coding and contrast coding!)
- S3c vs. S3d: Same as point 2.
- S3c vs. S3d: Same as point 3.
- S3a+S3b vs. S3c+S3d: Centering only changes both things (coefficient and p-value) in the intercetpt and categorical IVs, when it is contrast coding, but not dummy coding. (This is important!)
- S3a+S3b vs. S3c+S3d: Centering does not change either thing for the continuous IV or the interaction item. (This is important!)
Take-home message:
When the categorical variable has 3 levels, different codings (e.g., dummy coding vs. contrast coding) can lead to difference in the p-value in interaction items.
Disclaimer:
Please read the disclaimer statement.
Further Reading:
Note that, there is another tutorial about dummy and contrast codings in R.