“Dummy” or “treatment” coding is to create dichotomous variables where each level of the categorical variable is contrasted to a specified reference level.
Basic Syntax of Dummy and Contrast Coding
1. Dummy Coding
The following is the syntax to do dummy coding in R.
contr.treatment( number_of_level_of_X )
contr.treatment(3)
2 3 1 0 0 2 1 0 3 0 1
2. Contrast Coding
The following is the syntax to do contrast coding in R.
contr.sum( number_of_level_of_X )
contr.sum(3)
[,1] [,2] 1 1 0 2 0 1 3 -1 -1
Difference between Dummy and Constrast Codings
In dummy coding, the intercept is the mean of the reference level in the categorical variable.
In contrast coding, the intercept is the mean of all means from different levels of the categorical variable. Thus, this does not take into account different group sizes.
To better understand the difference, we can simulate the data, with Y (continuous data) and X (categorical data with 3 levels), to explain that. The following is the R code to simulate it.
# set seed
set.seed(123)
# Repeat a sequence of numbers:
X<-rep(c(1, 2, 3), times=5)
X<-as.factor(X)
Y<-rnorm(15)
# combine it into a data frame
df<-data.frame(X,Y)
print(df)
X Y 1 1 -0.56047565 2 2 -0.23017749 3 3 1.55870831 4 1 0.07050839 5 2 0.12928774 6 3 1.71506499 7 1 0.46091621 8 2 -1.26506123 9 3 -0.68685285 10 1 -0.44566197 11 2 1.22408180 12 3 0.35981383 13 1 0.40077145 14 2 0.11068272 15 3 -0.55584113
For this data, the following shows how dummy coding and contrast coding work with the 3 means. Note that, the reference level in R, by default, is the first level of the categorical variable, in alphabetical order.
X | Means | Dummy Coding | Contrst Coding | ||
Group 1 | -0.0148 | Intercept | -0.0148 | (-0.0148+(-0.006)+0.4782)/3=0.1524 | |
Group 2 | -0.0062 | Coded Variable 1 | -0.006-(-0.0148) =0.0086 | -0.0062-0.1524=-0.1586 | |
Group 3 | 0.4782 | Coded Variable 2 | 0.4782-(-0.0148) =0.4930 | 0.4782-0.1524=0.3258 |
Dummy Coding in Linear Regression
The following linear regression uses dummy coding. As we can see, the intercept is -0.014788, which is the mean of Group 1. Such result is consistent with the discussion above.
# dummy coding
contrasts(df$X) =contr.treatment(3)
# linear regression with dummy coding
result<-lm(Y~X,data=df)
# summarize the result
summary(result)
Call: lm(formula = Y ~ X, data = df) Residuals: Min 1Q Median 3Q Max -1.2588 -0.4883 0.0853 0.4456 1.2369 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.014788 0.391751 -0.038 0.971 X2 0.008551 0.554020 0.015 0.988 X3 0.492967 0.554020 0.890 0.391 Residual standard error: 0.876 on 12 degrees of freedom Multiple R-squared: 0.07959, Adjusted R-squared: -0.07381 F-statistic: 0.5188 on 2 and 12 DF, p-value: 0.608
Contrast Coding in Linear Regression
The following linear regression uses contrast coding. As we can see, the intercept is 0.1524, which is the mean of all means from 3 different levels. Such result is consistent with the discussion above.
# contrast coding
contrasts(df$X) =contr.sum(3)
# linear regression with contrast coding
result<-lm(Y~X,data=df)
# summarize the result
summary(result)
Call: lm(formula = Y ~ X, data = df) Residuals: Min 1Q Median 3Q Max -1.2588 -0.4883 0.0853 0.4456 1.2369 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1524 0.2262 0.674 0.513 X1 -0.1672 0.3199 -0.523 0.611 X2 -0.1586 0.3199 -0.496 0.629 Residual standard error: 0.876 on 12 degrees of freedom Multiple R-squared: 0.07959, Adjusted R-squared: -0.07381 F-statistic: 0.5188 on 2 and 12 DF, p-value: 0.608
Reference
Contrast coding in R (MARISSA BARLAZ)
Setting and Keeping Contrasts (Samuel E. Buttrey)