Dummy and Contrast Codings in R

“Dummy” or “treatment” coding is to create dichotomous variables where each level of the categorical variable is contrasted to a specified reference level.

Basic Syntax of Dummy and Contrast Coding

1. Dummy Coding

The following is the syntax to do dummy coding in R.

contr.treatment( number_of_level_of_X )

contr.treatment(3)

2. Contrast Coding

The following is the syntax to do contrast coding in R.

contr.sum( number_of_level_of_X )

contr.sum(3)

  [,1] [,2]
1    1    0
2    0    1
3   -1   -1

Difference between Dummy and Constrast Codings

In dummy coding, the intercept is the mean of the reference level in the categorical variable.

In contrast coding, the intercept is the mean of all means from different levels of the categorical variable. Thus, this does not take into account different group sizes.

To better understand the difference, we can simulate the data, with Y (continuous data) and X (categorical data with 3 levels), to explain that. The following is the R code to simulate it.

# set seed
set.seed(123)

# Repeat a sequence of numbers:
X<-rep(c(1, 2, 3), times=5)
X<-as.factor(X)
Y<-rnorm(15)

# combine it into a data frame
df<-data.frame(X,Y)
print(df)

   X           Y
1  1 -0.56047565
2  2 -0.23017749
3  3  1.55870831
4  1  0.07050839
5  2  0.12928774
6  3  1.71506499
7  1  0.46091621
8  2 -1.26506123
9  3 -0.68685285
10 1 -0.44566197
11 2  1.22408180
12 3  0.35981383
13 1  0.40077145
14 2  0.11068272
15 3 -0.55584113

For this data, the following shows how dummy coding and contrast coding work with the 3 means. Note that, the reference level in R, by default, is the first level of the categorical variable, in alphabetical order.

X	Means		Dummy Coding	Contrst Coding
Group 1	-0.0148	Intercept	-0.0148	(-0.0148+(-0.006)+0.4782)/3=0.1524
Group 2	-0.0062	Coded Variable 1	-0.006-(-0.0148) =0.0086	-0.0062-0.1524=-0.1586
Group 3	0.4782	Coded Variable 2	0.4782-(-0.0148) =0.4930	0.4782-0.1524=0.3258

Dummy Coding in Linear Regression

The following linear regression uses dummy coding. As we can see, the intercept is -0.014788, which is the mean of Group 1. Such result is consistent with the discussion above.

# dummy coding
contrasts(df$X) =contr.treatment(3)

# linear regression with dummy coding
result<-lm(Y~X,data=df)

# summarize the result
summary(result)

Call:
lm(formula = Y ~ X, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2588 -0.4883  0.0853  0.4456  1.2369 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.014788   0.391751  -0.038    0.971
X2           0.008551   0.554020   0.015    0.988
X3           0.492967   0.554020   0.890    0.391

Residual standard error: 0.876 on 12 degrees of freedom
Multiple R-squared:  0.07959,	Adjusted R-squared:  -0.07381 
F-statistic: 0.5188 on 2 and 12 DF,  p-value: 0.608

Contrast Coding in Linear Regression

The following linear regression uses contrast coding. As we can see, the intercept is 0.1524, which is the mean of all means from 3 different levels. Such result is consistent with the discussion above.

# contrast coding
contrasts(df$X) =contr.sum(3)

# linear regression with contrast coding
result<-lm(Y~X,data=df)

# summarize the result
summary(result)

Call:
lm(formula = Y ~ X, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2588 -0.4883  0.0853  0.4456  1.2369 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.1524     0.2262   0.674    0.513
X1           -0.1672     0.3199  -0.523    0.611
X2           -0.1586     0.3199  -0.496    0.629

Residual standard error: 0.876 on 12 degrees of freedom
Multiple R-squared:  0.07959,	Adjusted R-squared:  -0.07381 
F-statistic: 0.5188 on 2 and 12 DF,  p-value: 0.608

Reference

Contrast coding in R (MARISSA BARLAZ)

Setting and Keeping Contrasts (Samuel E. Buttrey)