Dummy and Contrast Codings in Linear Regression

This tutorial explains the differences between dummy coding and contrast coding in linear regression using R code examples. It is worth pointing out that, this tutorial focuses on the categorical independent variable has 3 levels.

Short Note

Note that, in R, the default reference group in dummy coding uses the first item in an alphabetical order (or, numeric order). Thus, Group 1 will be the reference level, and the first dummy-coded variable is Group 2 comparing to Group 1. The second dummy-coded variable is Group 3 comparing to Group 1.

In contrast, in R, the first contrast-coded variable will be the comparing the Group 1 with the overall mean. the second contrast-coded variable will be the comparing the Group 2 with the overall mean.

Therefore, to make it a bit more consistent (and less confusing) between dummy coding and contrast coding, in the following, I changed reference level in the dummy coding to be Group 3.

In particular, in this tutorial, the first dummy-coded variable will be Group 1 comparing to Group 3. The second dummy-coded variable will be Group 2 comparing to Group 3.

Simulated Data

# set seed

# Repeat a sequence of numbers:
Xa<-rep(c(1, 2, 3), times=5)

# combine it into a data frame
   Xa         Xb           Y
1   1  1.7869131 -0.56047565
2   2  0.4978505 -0.23017749
3   3 -1.9666172  1.55870831
4   1  0.7013559  0.07050839
5   2 -0.4727914  0.12928774
6   3 -1.0678237  1.71506499
7   1 -0.2179749  0.46091621
8   2 -1.0260044 -1.26506123
9   3 -0.7288912 -0.68685285
10  1 -0.6250393 -0.44566197
11  2 -1.6866933  1.22408180
12  3  0.8377870  0.35981383
13  1  0.1533731  0.40077145
14  2 -1.1381369  0.11068272
15  3  1.2538149 -0.55584113

We can calculate the means for each group level using the following R code.

# calculate means by group
aggregate(df$Y, list(df$Xa), FUN=mean) 
  Group.1            x
1       1 -0.014788314
2       2 -0.006237295
3       3  0.478178628
XaMeansDummy Coding Contrast Coding
Group 1-0.0148Intercept0.4782(-0.0148+(-0.006)+0.4782)/3=0.1524
Group 2-0.0062Coded Variable 1-0.0148 – 0.4782=-0.493-0.0148-0.1524=-0.1672
Group 30.4782Coded Variable 2-0.0062 – 0.4782 =-0.4844-0.0062-0.1524=-0.1586

Situation 1: Categorial IV Only

In Situation 1, the linear regression has only one IV, namely Xa, which has 3 levels.

Situation 1a: Dummy Coding

# dummy coding
contrasts(df$Xa) =contr.treatment(3, base =3)

# linear regression with dummy coding
lm(formula = Y ~ Xa, data = df)

    Min      1Q  Median      3Q     Max 
-1.2588 -0.4883  0.0853  0.4456  1.2369 

            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.4782     0.3918   1.221    0.246
Xa1          -0.4930     0.5540  -0.890    0.391
Xa2          -0.4844     0.5540  -0.874    0.399

Residual standard error: 0.876 on 12 degrees of freedom
Multiple R-squared:  0.07959,	Adjusted R-squared:  -0.07381 
F-statistic: 0.5188 on 2 and 12 DF,  p-value: 0.608

Situation 1b: Contrast Coding

# contrast coding
contrasts(df$Xa) =contr.sum(3)

# linear regression with contrast coding
lm(formula = Y ~ Xa, data = df)

    Min      1Q  Median      3Q     Max 
-1.2588 -0.4883  0.0853  0.4456  1.2369 

            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.1524     0.2262   0.674    0.513
Xa1          -0.1672     0.3199  -0.523    0.611
Xa2          -0.1586     0.3199  -0.496    0.629

Residual standard error: 0.876 on 12 degrees of freedom
Multiple R-squared:  0.07959,	Adjusted R-squared:  -0.07381 
F-statistic: 0.5188 on 2 and 12 DF,  p-value: 0.608
CodingDummy Constrat
CenteringN. A. N. A.
Intercept 0.47820.1524
p-value 0.2460.513
Coded IV-1-0.4930-0.1672
p-value 0.3910.611
Coded IV-2-0.4844-0.1586

Summary of Situation 1 (Single Categorical IV):

  1. Different codings (dummy coding vs. contrast coding) lead to different meanings of intercepts. For dummy coding, the intercept is the reference group (or, the reference level). In contrast, for contrast coding, the intercept is the mean of all groups (or, all levels).
  2. Note that, for both dummy coding and contrast coding, regression coefficients are always about if the comparison of a certain group and the reference group (i.e., the intercept) is significant. In the dummy coding, the reference group (i.e., the number in the intercept) is Group 3 in this case. The reference group (i.e., the number in the intercept) in the contrast coding is the overall mean.
  3. When a regression only has categorical variables, the regression coefficients are basically about if the mean comparisons are significant.

Situation 2: Categorical IV + Continuous IV

In Situation 2, the linear regression model has 2 IVs, namely a categorical IV and a continuous IV.

Compared to Situation 1, after adding a continuous IV, you can no longer directly connect the means of Y and the categorical IV. This is because the continuous IV will “explain partially” of Y, making the interpretations of the regression coefficients for categorical X not straightforward.

Situation 2a: Dummy Coding

# dummy coding
contrasts(df$Xa) =contr.treatment(3, base =3)

# linear regression with dummy coding
lm(formula = Y ~ Xa + Xb, data = df)

    Min      1Q  Median      3Q     Max 
-1.3704 -0.1987  0.2314  0.3548  0.9232 

            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   0.3352     0.3572   0.938    0.368  
Xa1          -0.1961     0.5167  -0.380    0.712  
Xa2          -0.6687     0.5035  -1.328    0.211  
Xb           -0.4277     0.2131  -2.007    0.070 .
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7828 on 11 degrees of freedom
Multiple R-squared:  0.3263,	Adjusted R-squared:  0.1426 
F-statistic: 1.776 on 3 and 11 DF,  p-value: 0.2098

Situation 2b: Contrast Coding

# contrast coding
contrasts(df$Xa) =contr.sum(3)

# linear regression with contrast coding
lm(formula = Y ~ Xa + Xb, data = df)

    Min      1Q  Median      3Q     Max 
-1.3704 -0.1987  0.2314  0.3548  0.9232 

            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.04691    0.20883   0.225    0.826  
Xa1          0.09217    0.31368   0.294    0.774  
Xa2         -0.38042    0.30645  -1.241    0.240  
Xb          -0.42773    0.21312  -2.007    0.070 .
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7828 on 11 degrees of freedom
Multiple R-squared:  0.3263,	Adjusted R-squared:  0.1426 
F-statistic: 1.776 on 3 and 11 DF,  p-value: 0.2098

As we can see, different codings in the categorical IV do not change the slope of the continuous IV, always -0.4277.

Situation 2c: Dummy Coding + Centering

This section explores if centering the continuous IVs changes something.

The short answer is that centering continuous IV only changes the intercept, not any of the regression coefficients, regardless of dummy or contrast coding.

# dummy coding
contrasts(df$Xa) =contr.treatment(3, based = 3)

# centering the continuous IV
center_scale <- function(x) { scale(x, scale = FALSE)}

# linear regression with dummy coding
lm(formula = Y ~ Xa + Xb, data = df)

    Min      1Q  Median      3Q     Max 
-1.3704 -0.1987  0.2314  0.3548  0.9232 

            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   0.4406     0.3506   1.257    0.235  
Xa1          -0.1961     0.5167  -0.380    0.712  
Xa2          -0.6687     0.5035  -1.328    0.211  
Xb           -0.4277     0.2131  -2.007    0.070 .
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7828 on 11 degrees of freedom
Multiple R-squared:  0.3263,	Adjusted R-squared:  0.1426 
F-statistic: 1.776 on 3 and 11 DF,  p-value: 0.2098

Situation 2d: Contrast Coding + Centering

# contrast coding
contrasts(df$Xa) =contr.sum(3)

# centering the continuous IV
center_scale <- function(x) { scale(x, scale = FALSE)}

# linear regression with contrast coding
lm(formula = Y ~ Xa + Xb, data = df)

    Min      1Q  Median      3Q     Max 
-1.3704 -0.1987  0.2314  0.3548  0.9232 

            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.15238    0.20211   0.754    0.467  
Xa1          0.09217    0.31368   0.294    0.774  
Xa2         -0.38042    0.30645  -1.241    0.240  
Xb          -0.42773    0.21312  -2.007    0.070 .
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7828 on 11 degrees of freedom
Multiple R-squared:  0.3263,	Adjusted R-squared:  0.1426 
F-statistic: 1.776 on 3 and 11 DF,  p-value: 0.2098
CenteringN. A. N. A. No No Yes Yes
p-value 0.2460.5130.3680.8260.2350.467
Coded IV-1-0.4930-0.1672-0.19610.09217-0.19610.09217
p-value 0.3910.6110.7120.7740.7120.774
Coded IV-2-0.4844-0.1586-0.6687-0.38042-0.6687-0.38042
Continuous IV-0.4277-0.4277-0.4277-0.4277
p-value 0.0700.0700.0700.070

Summary of Situation 2 (Categorical Variable + Continuous Variable):

  1. When there is a continuous variable added to the regression, it becomes complicated, as you can no longer really think of regression coefficients as a mean comparison on the measure of Y in different levels of X. Rather, given that the continuous variable (e.g., Xb) explains part of Y, categorical IV (e.g., Xa) regression coefficients can no longer be traced back directly to mean comparisons for Y values.
  2. S2a vs. S2b: The interesting observation: Regardless of dummy or contrast coding, the regression coefficient for Xb is consistent, namely -0.4277 in the example.
  3. S2c vs. S2d: same as point 2.
  4. S2a+S2b vs. S2c+S2d: Another interesting observation: Centering continuous IV (e.g., Xb) only changes the intercept, not any of the regression coefficients, regardless of dummy or contrast coding.

Situation 3: Interaction

Compared to Situation 2, interaction items are added in Situation 3. Note that, since there are 2 categorical levels, there are 2 interaction items.

When with 2 interaction items, dummy coding and contrast coding differ in the slope of the continuous variable, as well as regression coefficients for the two interaction items. All p-values are different.

Situation 3a: Dummy Coding

# dummy coding
contrasts(df$Xa) =contr.treatment(3, base=3)

# linear regression with dummy coding
lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df)

    Min      1Q  Median      3Q     Max 
-1.3822 -0.2130  0.1822  0.3688  0.8626 

            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.4299     0.3793   1.133    0.286
Xa1          -0.3258     0.6018  -0.541    0.601
Xa2          -0.6431     0.5978  -1.076    0.310
Xb           -0.5503     0.3141  -1.752    0.114
Xa1:Xb        0.3543     0.5506   0.644    0.536
Xa2:Xb        0.1513     0.6001   0.252    0.807

Residual standard error: 0.846 on 9 degrees of freedom
Multiple R-squared:  0.3562,	Adjusted R-squared:  -0.001457 
F-statistic: 0.9959 on 5 and 9 DF,  p-value: 0.4715

Situation 3b: Contrast Coding

# contrast coding
contrasts(df$Xa) =contr.sum(3)

# linear regression with contrast coding
lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df)

    Min      1Q  Median      3Q     Max 
-1.3822 -0.2130  0.1822  0.3688  0.8626 

            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.01277    0.26246   0.049    0.962
Xa1          0.04295    0.35414   0.121    0.906
Xa2         -0.32435    0.40941  -0.792    0.449
Xb          -0.38180    0.25047  -1.524    0.162
Xa1:Xb       0.18580    0.36180   0.514    0.620
Xa2:Xb      -0.01726    0.38715  -0.045    0.965

Residual standard error: 0.846 on 9 degrees of freedom
Multiple R-squared:  0.3562,	Adjusted R-squared:  -0.001457 
F-statistic: 0.9959 on 5 and 9 DF,  p-value: 0.4715

Situation 3c: Dummy coding + centering

Based on the output below, we can see centering changes intercept and two coded variables. Centering does NOT change any output related to the centered variable, namely the continuous IV.

# dummy coding
contrasts(df$Xa) =contr.treatment(3, base=3)

# centering the continuous IV
center_scale <- function(x) { scale(x, scale = FALSE)}

# linear regression with dummy coding
lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df)

    Min      1Q  Median      3Q     Max 
-1.3822 -0.2130  0.1822  0.3688  0.8626 

            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.4299     0.3793   1.133    0.286
Xa1          -0.3258     0.6018  -0.541    0.601
Xa2          -0.6431     0.5978  -1.076    0.310
Xb           -0.5503     0.3141  -1.752    0.114
Xa1:Xb        0.3543     0.5506   0.644    0.536
Xa2:Xb        0.1513     0.6001   0.252    0.807

Residual standard error: 0.846 on 9 degrees of freedom
Multiple R-squared:  0.3562,	Adjusted R-squared:  -0.001457 
F-statistic: 0.9959 on 5 and 9 DF,  p-value: 0.4715

Situation 3d: Contrast coding + centering

# contrast coding
contrasts(df$Xa) =contr.sum(3)

# centering the continuous IV
center_scale <- function(x) { scale(x, scale = FALSE)}

# linear regression with contrast coding
lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df)

    Min      1Q  Median      3Q     Max 
-1.3822 -0.2130  0.1822  0.3688  0.8626 

             Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.106919   0.252900   0.423    0.682
Xa1         -0.002872   0.369765  -0.008    0.994
Xa2         -0.320094   0.367566  -0.871    0.406
Xb          -0.381798   0.250467  -1.524    0.162
Xa1:Xb       0.185803   0.361797   0.514    0.620
Xa2:Xb      -0.017262   0.387154  -0.045    0.965

Residual standard error: 0.846 on 9 degrees of freedom
Multiple R-squared:  0.3562,	Adjusted R-squared:  -0.001457 
F-statistic: 0.9959 on 5 and 9 DF,  p-value: 0.4715
CenteringN. A. N. A. No No Yes Yes NoNoYesYes
p-value 0.2460.5130.3680.8260.2350.4670.2860.9620.2860.682
Coded IV-1-0.4930-0.1672-0.19610.09217-0.19610.09217-0.32580.04295-0.3258-0.002872
p-value 0.3910.6110.7120.7740.7120.7740.6010.9060.6010.994
Coded IV-2-0.4844-0.1586-0.6687-0.38042-0.6687-0.38042-0.6431-0.32435-0.6431-0.320094
Continuous IV-0.4277-0.4277-0.4277-0.4277-0.5503-0.38180-0.5503-0.381798
p-value 0.0700.0700.0700.0700.1140.1620.1140.162
Interaction with Coded IV-10.35430.185800.35430.185803
Interaction with Coded IV-20.1513-0.017260.1513-0.017262

Summary for Situation 3 (Categorical Variable + Continuous Variable + Categorical Variable * Continuous)

  • S2 vs. S3: Adding an interaction item changes all other coefficients.
  • S3a vs. S3b: Coding makes a difference in both things (coefficient and p-value) for the continuous IV.
  • S3a vs. S3b: Coding makes a difference in both things (coefficient and p-value) for both interaction items. (This is VERY VERY important, especially the fact that the p-values are different between dummy coding and contrast coding!)
  • S3c vs. S3d: Same as point 2.
  • S3c vs. S3d: Same as point 3.
  • S3a+S3b vs. S3c+S3d: Centering only changes both things (coefficient and p-value) in the intercetpt and categorical IVs, when it is contrast coding, but not dummy coding. (This is important!)
  • S3a+S3b vs. S3c+S3d: Centering does not change either thing for the continuous IV or the interaction item. (This is important!)

Take-home message:

When the categorical variable has 3 levels, different codings (e.g., dummy coding vs. contrast coding) can lead to difference in the p-value in interaction items.


