You can change the reference level in dummy coding in R by using the following R code.
contr.treatment(total_levels, base = Number_reference_level)
Step 1: Prepare Data
The following R code generates a sample data.
# set seed
set.seed(123)
# Repeat a sequence of numbers:
X<-rep(c(1, 2, 3), times=5)
X<-as.factor(X)
Y<-rnorm(15)
# combine it into a data frame
df<-data.frame(X,Y)
print(df)
X Y 1 1 -0.56047565 2 2 -0.23017749 3 3 1.55870831 4 1 0.07050839 5 2 0.12928774 6 3 1.71506499 7 1 0.46091621 8 2 -1.26506123 9 3 -0.68685285 10 1 -0.44566197 11 2 1.22408180 12 3 0.35981383 13 1 0.40077145 14 2 0.11068272 15 3 -0.55584113
Step 2: Check Default Reference Level
For the data above, we can calculate means for all the 3 levels. The right-most column in the table below shows how the default dummy coding works.
X | Means | Dummy Coding by default uses group 1 as the Reference (i.e., intercept) | ||
---|---|---|---|---|
Group 1 | -0.0148 | Intercept | -0.0148 | |
Group 2 | -0.0062 | Coded Variable 1 | -0.006-(-0.0148) =0.0086 | |
Group 3 | 0.4782 | Coded Variable 2 | 0.4782-(-0.0148) =0.4930 |
We can use R code contr.treatment to do the dummy coding in linear regression as follows.
# dummy coding
contrasts(df$X) =contr.treatment(3)
# linear regression with dummy coding
result<-lm(Y~X,data=df)
# summarize the result
summary(result)
Call: lm(formula = Y ~ X, data = df) Residuals: Min 1Q Median 3Q Max -1.2588 -0.4883 0.0853 0.4456 1.2369 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.014788 0.391751 -0.038 0.971 X2 0.008551 0.554020 0.015 0.988 X3 0.492967 0.554020 0.890 0.391 Residual standard error: 0.876 on 12 degrees of freedom Multiple R-squared: 0.07959, Adjusted R-squared: -0.07381 F-statistic: 0.5188 on 2 and 12 DF, p-value: 0.608
As we can see from the output above, the default reference level is level 1 (or, group 1) given that the intercept is the mean of group 1.
Step 3: Change default reference level
We can change the default reference level to level 3 (or, group 3) using the following R code. The following first prints out how it changes.
# dummy coding by defult, using group 1 as the reference
contr.treatment(3)
# changing the reference group by adding base=No.
contr.treatment(3, base = 3)
# dummy coding by defult, using group 1 as the reference > contr.treatment(3) 2 3 1 0 0 2 1 0 3 0 1 # changing the reference group by adding base=No. > contr.treatment(3, base = 3) 1 2 1 1 0 2 0 1 3 0 0
X | Means | Default Dummy Coding (Reference: Group 1) | Changed Dummy Coding (Reference: Group 3) | ||
Group 1 | -0.0148 | Intercept | -0.0148 | 0.4782 | |
Group 2 | -0.0062 | Coded Variable 1 | -0.006-(-0.0148) =0.0086 | -0.0148-0.4782=-0.493 | |
Group 3 | 0.4782 | Coded Variable 2 | 0.4782-(-0.0148) =0.4930 | -0.0062-0.4782=-0.4844 |
The following is the R code to do the actual change.
# dummy coding, with default reference level to to group 3 (level 3)
contrasts(df$X) =contr.treatment(3, base = 3)
# linear regression with dummy coding
result<-lm(Y~X,data=df)
# summarize the result
summary(result)
Call: lm(formula = Y ~ X, data = df) Residuals: Min 1Q Median 3Q Max -1.2588 -0.4883 0.0853 0.4456 1.2369 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.4782 0.3918 1.221 0.246 X1 -0.4930 0.5540 -0.890 0.391 X2 -0.4844 0.5540 -0.874 0.399 Residual standard error: 0.876 on 12 degrees of freedom Multiple R-squared: 0.07959, Adjusted R-squared: -0.07381 F-statistic: 0.5188 on 2 and 12 DF, p-value: 0.608