This tutorial explains the differences between dummy coding and contrast coding in linear regression using R code examples. It is worth pointing out that, this tutorial focuses on the categorical independent variable has 3 levels.
Short Note
Note that, in R, the default reference group in dummy coding uses the first item in an alphabetical order (or, numeric order). Thus, Group 1 will be the reference level, and the first dummy-coded variable is Group 2 comparing to Group 1. The second dummy-coded variable is Group 3 comparing to Group 1.
In contrast, in R, the first contrast-coded variable will be the comparing the Group 1 with the overall mean. the second contrast-coded variable will be the comparing the Group 2 with the overall mean.
Therefore, to make it a bit more consistent (and less confusing) between dummy coding and contrast coding, in the following, I changed reference level in the dummy coding to be Group 3.
In particular, in this tutorial, the first dummy-coded variable will be Group 1 comparing to Group 3. The second dummy-coded variable will be Group 2 comparing to Group 3.
Simulated Data
# set seed
set.seed(123)
# Repeat a sequence of numbers:
Xa<-rep(c(1, 2, 3), times=5)
Xa<-as.factor(Xa)
Y<-rnorm(15)
Xb<-rnorm(15)
# combine it into a data frame
df<-data.frame(Xa,Xb,Y)
print(df)
Xa Xb Y 1 1 1.7869131 -0.56047565 2 2 0.4978505 -0.23017749 3 3 -1.9666172 1.55870831 4 1 0.7013559 0.07050839 5 2 -0.4727914 0.12928774 6 3 -1.0678237 1.71506499 7 1 -0.2179749 0.46091621 8 2 -1.0260044 -1.26506123 9 3 -0.7288912 -0.68685285 10 1 -0.6250393 -0.44566197 11 2 -1.6866933 1.22408180 12 3 0.8377870 0.35981383 13 1 0.1533731 0.40077145 14 2 -1.1381369 0.11068272 15 3 1.2538149 -0.55584113
We can calculate the means for each group level using the following R code.
# calculate means by group
aggregate(df$Y, list(df$Xa), FUN=mean)
Group.1 x 1 1 -0.014788314 2 2 -0.006237295 3 3 0.478178628
| Xa | Means | Dummy Coding | Contrast Coding | ||
| Group 1 | -0.0148 | Intercept | 0.4782 | (-0.0148+(-0.006)+0.4782)/3=0.1524 | |
| Group 2 | -0.0062 | Coded Variable 1 | -0.0148 – 0.4782=-0.493 | -0.0148-0.1524=-0.1672 | |
| Group 3 | 0.4782 | Coded Variable 2 | -0.0062 – 0.4782 =-0.4844 | -0.0062-0.1524=-0.1586 |
Situation 1: Categorial IV Only
In Situation 1, the linear regression has only one IV, namely Xa, which has 3 levels.
Situation 1a: Dummy Coding
# dummy coding
contrasts(df$Xa) =contr.treatment(3, base =3)
# linear regression with dummy coding
result<-lm(Y~Xa,data=df)
summary(result)
Call:
lm(formula = Y ~ Xa, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.2588 -0.4883 0.0853 0.4456 1.2369
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4782 0.3918 1.221 0.246
Xa1 -0.4930 0.5540 -0.890 0.391
Xa2 -0.4844 0.5540 -0.874 0.399
Residual standard error: 0.876 on 12 degrees of freedom
Multiple R-squared: 0.07959, Adjusted R-squared: -0.07381
F-statistic: 0.5188 on 2 and 12 DF, p-value: 0.608
Situation 1b: Contrast Coding
# contrast coding
contrasts(df$Xa) =contr.sum(3)
# linear regression with contrast coding
result<-lm(Y~Xa,data=df)
summary(result)
Call:
lm(formula = Y ~ Xa, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.2588 -0.4883 0.0853 0.4456 1.2369
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1524 0.2262 0.674 0.513
Xa1 -0.1672 0.3199 -0.523 0.611
Xa2 -0.1586 0.3199 -0.496 0.629
Residual standard error: 0.876 on 12 degrees of freedom
Multiple R-squared: 0.07959, Adjusted R-squared: -0.07381
F-statistic: 0.5188 on 2 and 12 DF, p-value: 0.608
| S1a | S1b | |
| Coding | Dummy | Constrat |
| Centering | N. A. | N. A. |
| Intercept | 0.4782 | 0.1524 |
| p-value | 0.246 | 0.513 |
| Coded IV-1 | -0.4930 | -0.1672 |
| p-value | 0.391 | 0.611 |
| Coded IV-2 | -0.4844 | -0.1586 |
| p-value | 0.399 | 0.629 |
Summary of Situation 1 (Single Categorical IV):
- Different codings (dummy coding vs. contrast coding) lead to different meanings of intercepts. For dummy coding, the intercept is the reference group (or, the reference level). In contrast, for contrast coding, the intercept is the mean of all groups (or, all levels).
- Note that, for both dummy coding and contrast coding, regression coefficients are always about if the comparison of a certain group and the reference group (i.e., the intercept) is significant. In the dummy coding, the reference group (i.e., the number in the intercept) is Group 3 in this case. The reference group (i.e., the number in the intercept) in the contrast coding is the overall mean.
- When a regression only has categorical variables, the regression coefficients are basically about if the mean comparisons are significant.
Situation 2: Categorical IV + Continuous IV
In Situation 2, the linear regression model has 2 IVs, namely a categorical IV and a continuous IV.
Compared to Situation 1, after adding a continuous IV, you can no longer directly connect the means of Y and the categorical IV. This is because the continuous IV will “explain partially” of Y, making the interpretations of the regression coefficients for categorical X not straightforward.
Situation 2a: Dummy Coding
# dummy coding
contrasts(df$Xa) =contr.treatment(3, base =3)
# linear regression with dummy coding
result<-lm(Y~Xa+Xb,data=df)
summary(result)
Call:
lm(formula = Y ~ Xa + Xb, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.3704 -0.1987 0.2314 0.3548 0.9232
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3352 0.3572 0.938 0.368
Xa1 -0.1961 0.5167 -0.380 0.712
Xa2 -0.6687 0.5035 -1.328 0.211
Xb -0.4277 0.2131 -2.007 0.070 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7828 on 11 degrees of freedom
Multiple R-squared: 0.3263, Adjusted R-squared: 0.1426
F-statistic: 1.776 on 3 and 11 DF, p-value: 0.2098
Situation 2b: Contrast Coding
# contrast coding
contrasts(df$Xa) =contr.sum(3)
# linear regression with contrast coding
result<-lm(Y~Xa+Xb,data=df)
summary(result)
Call:
lm(formula = Y ~ Xa + Xb, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.3704 -0.1987 0.2314 0.3548 0.9232
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.04691 0.20883 0.225 0.826
Xa1 0.09217 0.31368 0.294 0.774
Xa2 -0.38042 0.30645 -1.241 0.240
Xb -0.42773 0.21312 -2.007 0.070 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7828 on 11 degrees of freedom
Multiple R-squared: 0.3263, Adjusted R-squared: 0.1426
F-statistic: 1.776 on 3 and 11 DF, p-value: 0.2098
As we can see, different codings in the categorical IV do not change the slope of the continuous IV, always -0.4277.
Situation 2c: Dummy Coding + Centering
This section explores if centering the continuous IVs changes something.
The short answer is that centering continuous IV only changes the intercept, not any of the regression coefficients, regardless of dummy or contrast coding.
# dummy coding
contrasts(df$Xa) =contr.treatment(3, based = 3)
# centering the continuous IV
center_scale <- function(x) { scale(x, scale = FALSE)}
df$Xb<-center_scale(df$Xb)
# linear regression with dummy coding
result<-lm(Y~Xa+Xb,data=df)
summary(result)
Call:
lm(formula = Y ~ Xa + Xb, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.3704 -0.1987 0.2314 0.3548 0.9232
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4406 0.3506 1.257 0.235
Xa1 -0.1961 0.5167 -0.380 0.712
Xa2 -0.6687 0.5035 -1.328 0.211
Xb -0.4277 0.2131 -2.007 0.070 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7828 on 11 degrees of freedom
Multiple R-squared: 0.3263, Adjusted R-squared: 0.1426
F-statistic: 1.776 on 3 and 11 DF, p-value: 0.2098
Situation 2d: Contrast Coding + Centering
# contrast coding
contrasts(df$Xa) =contr.sum(3)
# centering the continuous IV
center_scale <- function(x) { scale(x, scale = FALSE)}
df$Xb<-center_scale(df$Xb)
# linear regression with contrast coding
result<-lm(Y~Xa+Xb,data=df)
summary(result)
Call:
lm(formula = Y ~ Xa + Xb, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.3704 -0.1987 0.2314 0.3548 0.9232
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.15238 0.20211 0.754 0.467
Xa1 0.09217 0.31368 0.294 0.774
Xa2 -0.38042 0.30645 -1.241 0.240
Xb -0.42773 0.21312 -2.007 0.070 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7828 on 11 degrees of freedom
Multiple R-squared: 0.3263, Adjusted R-squared: 0.1426
F-statistic: 1.776 on 3 and 11 DF, p-value: 0.2098
| S1a | S1b | S2a | S2b | S2c | S2d | |
| Coding | Dummy | Contrast | Dummy | Contrast | Dummy | Contrast |
| Centering | N. A. | N. A. | No | No | Yes | Yes |
| Intercept | 0.4782 | 0.1524 | 0.3352 | 0.04691 | 0.4406 | 0.15238 |
| p-value | 0.246 | 0.513 | 0.368 | 0.826 | 0.235 | 0.467 |
| Coded IV-1 | -0.4930 | -0.1672 | -0.1961 | 0.09217 | -0.1961 | 0.09217 |
| p-value | 0.391 | 0.611 | 0.712 | 0.774 | 0.712 | 0.774 |
| Coded IV-2 | -0.4844 | -0.1586 | -0.6687 | -0.38042 | -0.6687 | -0.38042 |
| p-value | 0.399 | 0.629 | 0.211 | 0.240 | 0.211 | 0.240 |
| Continuous IV | -0.4277 | -0.4277 | -0.4277 | -0.4277 | ||
| p-value | 0.070 | 0.070 | 0.070 | 0.070 |
Summary of Situation 2 (Categorical Variable + Continuous Variable):
- When there is a continuous variable added to the regression, it becomes complicated, as you can no longer really think of regression coefficients as a mean comparison on the measure of Y in different levels of X. Rather, given that the continuous variable (e.g., Xb) explains part of Y, categorical IV (e.g., Xa) regression coefficients can no longer be traced back directly to mean comparisons for Y values.
- S2a vs. S2b: The interesting observation: Regardless of dummy or contrast coding, the regression coefficient for Xb is consistent, namely -0.4277 in the example.
- S2c vs. S2d: same as point 2.
- S2a+S2b vs. S2c+S2d: Another interesting observation: Centering continuous IV (e.g., Xb) only changes the intercept, not any of the regression coefficients, regardless of dummy or contrast coding.
Situation 3: Interaction
Compared to Situation 2, interaction items are added in Situation 3. Note that, since there are 2 categorical levels, there are 2 interaction items.
When with 2 interaction items, dummy coding and contrast coding differ in the slope of the continuous variable, as well as regression coefficients for the two interaction items. All p-values are different.
Situation 3a: Dummy Coding
# dummy coding
contrasts(df$Xa) =contr.treatment(3, base=3)
# linear regression with dummy coding
result<-lm(Y~Xa+Xb+Xa*Xb,data=df)
summary(result)
Call:
lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.3822 -0.2130 0.1822 0.3688 0.8626
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4299 0.3793 1.133 0.286
Xa1 -0.3258 0.6018 -0.541 0.601
Xa2 -0.6431 0.5978 -1.076 0.310
Xb -0.5503 0.3141 -1.752 0.114
Xa1:Xb 0.3543 0.5506 0.644 0.536
Xa2:Xb 0.1513 0.6001 0.252 0.807
Residual standard error: 0.846 on 9 degrees of freedom
Multiple R-squared: 0.3562, Adjusted R-squared: -0.001457
F-statistic: 0.9959 on 5 and 9 DF, p-value: 0.4715
Situation 3b: Contrast Coding
# contrast coding
contrasts(df$Xa) =contr.sum(3)
# linear regression with contrast coding
result<-lm(Y~Xa+Xb+Xa*Xb,data=df)
summary(result)
Call:
lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.3822 -0.2130 0.1822 0.3688 0.8626
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.01277 0.26246 0.049 0.962
Xa1 0.04295 0.35414 0.121 0.906
Xa2 -0.32435 0.40941 -0.792 0.449
Xb -0.38180 0.25047 -1.524 0.162
Xa1:Xb 0.18580 0.36180 0.514 0.620
Xa2:Xb -0.01726 0.38715 -0.045 0.965
Residual standard error: 0.846 on 9 degrees of freedom
Multiple R-squared: 0.3562, Adjusted R-squared: -0.001457
F-statistic: 0.9959 on 5 and 9 DF, p-value: 0.4715
Situation 3c: Dummy coding + centering
Based on the output below, we can see centering changes intercept and two coded variables. Centering does NOT change any output related to the centered variable, namely the continuous IV.
# dummy coding
contrasts(df$Xa) =contr.treatment(3, base=3)
# centering the continuous IV
center_scale <- function(x) { scale(x, scale = FALSE)}
df$Xb<-center_scale(df$Xb)
# linear regression with dummy coding
result<-lm(Y~Xa+Xb+Xa*Xb,data=df)
summary(result)
Call:
lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.3822 -0.2130 0.1822 0.3688 0.8626
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4299 0.3793 1.133 0.286
Xa1 -0.3258 0.6018 -0.541 0.601
Xa2 -0.6431 0.5978 -1.076 0.310
Xb -0.5503 0.3141 -1.752 0.114
Xa1:Xb 0.3543 0.5506 0.644 0.536
Xa2:Xb 0.1513 0.6001 0.252 0.807
Residual standard error: 0.846 on 9 degrees of freedom
Multiple R-squared: 0.3562, Adjusted R-squared: -0.001457
F-statistic: 0.9959 on 5 and 9 DF, p-value: 0.4715
Situation 3d: Contrast coding + centering
# contrast coding
contrasts(df$Xa) =contr.sum(3)
# centering the continuous IV
center_scale <- function(x) { scale(x, scale = FALSE)}
df$Xb<-center_scale(df$Xb)
# linear regression with contrast coding
result<-lm(Y~Xa+Xb+Xa*Xb,data=df)
summary(result)
Call:
lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.3822 -0.2130 0.1822 0.3688 0.8626
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.106919 0.252900 0.423 0.682
Xa1 -0.002872 0.369765 -0.008 0.994
Xa2 -0.320094 0.367566 -0.871 0.406
Xb -0.381798 0.250467 -1.524 0.162
Xa1:Xb 0.185803 0.361797 0.514 0.620
Xa2:Xb -0.017262 0.387154 -0.045 0.965
Residual standard error: 0.846 on 9 degrees of freedom
Multiple R-squared: 0.3562, Adjusted R-squared: -0.001457
F-statistic: 0.9959 on 5 and 9 DF, p-value: 0.4715
| S1a | S1b | S2a | S2b | S2c | S2d | S3a | S3b | S3c | S3d | ||
| Coding | Dum. | Contra. | Dum. | Contra. | Dum. | Contra. | Dum. | Contra. | Dum. | Contra. | |
| Centering | N. A. | N. A. | No | No | Yes | Yes | No | No | Yes | Yes | |
| Intercept | 0.4782 | 0.1524 | 0.3352 | 0.04691 | 0.4406 | 0.15238 | 0.4299 | 0.01277 | 0.4299 | 0.106919 | |
| p-value | 0.246 | 0.513 | 0.368 | 0.826 | 0.235 | 0.467 | 0.286 | 0.962 | 0.286 | 0.682 | |
| Coded IV-1 | -0.4930 | -0.1672 | -0.1961 | 0.09217 | -0.1961 | 0.09217 | -0.3258 | 0.04295 | -0.3258 | -0.002872 | |
| p-value | 0.391 | 0.611 | 0.712 | 0.774 | 0.712 | 0.774 | 0.601 | 0.906 | 0.601 | 0.994 | |
| Coded IV-2 | -0.4844 | -0.1586 | -0.6687 | -0.38042 | -0.6687 | -0.38042 | -0.6431 | -0.32435 | -0.6431 | -0.320094 | |
| p-value | 0.399 | 0.629 | 0.211 | 0.240 | 0.211 | 0.240 | 0.310 | 0.449 | 0.310 | 0.406 | |
| Continuous IV | -0.4277 | -0.4277 | -0.4277 | -0.4277 | -0.5503 | -0.38180 | -0.5503 | -0.381798 | |||
| p-value | 0.070 | 0.070 | 0.070 | 0.070 | 0.114 | 0.162 | 0.114 | 0.162 | |||
| Interaction with Coded IV-1 | 0.3543 | 0.18580 | 0.3543 | 0.185803 | |||||||
| p-value | 0.536 | 0.620 | 0.536 | 0.620 | |||||||
| Interaction with Coded IV-2 | 0.1513 | -0.01726 | 0.1513 | -0.017262 | |||||||
| p-value | 0.807 | 0.965 | 0.807 | 0.965 |
Summary for Situation 3 (Categorical Variable + Continuous Variable + Categorical Variable * Continuous)
- S2 vs. S3: Adding an interaction item changes all other coefficients.
- S3a vs. S3b: Coding makes a difference in both things (coefficient and p-value) for the continuous IV.
- S3a vs. S3b: Coding makes a difference in both things (coefficient and p-value) for both interaction items. (This is VERY VERY important, especially the fact that the p-values are different between dummy coding and contrast coding!)
- S3c vs. S3d: Same as point 2.
- S3c vs. S3d: Same as point 3.
- S3a+S3b vs. S3c+S3d: Centering only changes both things (coefficient and p-value) in the intercetpt and categorical IVs, when it is contrast coding, but not dummy coding. (This is important!)
- S3a+S3b vs. S3c+S3d: Centering does not change either thing for the continuous IV or the interaction item. (This is important!)
Take-home message:
When the categorical variable has 3 levels, different codings (e.g., dummy coding vs. contrast coding) can lead to difference in the p-value in interaction items.
Disclaimer:
Please read the disclaimer statement.
Further Reading:
Note that, there is another tutorial about dummy and contrast codings in R.