# Interaction in Linear Regression

This tutorial focuses on interaction between a categorial variable and a continuous variable in linear regression. Note that, in this tutorial, we limit the the categorical variable to be 2 levels. (For a categrocial variable with 3 levels, please refer to my another tutotrial on interaction and coding in linear regression .)

## Coding Note

In this tutotiral, the dummy coding uses the group 3 as the reference group. Thus, the first comparison is between group 1 and group 3. The second comparison is between group 2 and group 3. For the detailed reasoning, please refer to my another tutorial on Dummy and Contrast Codings in Linear Regression.

## Simulated Data

``````# set seed
set.seed(123)

# Repeat a sequence of numbers:
Xa<-rep(c(1, 2), times=5)
Xa<-as.factor(Xa)
Y<-rnorm(10)
Xb<-rnorm(10)

# combine it into a data frame
df<-data.frame(Xa,Xb,Y)
print(df)``````
```   Xa         Xb           Y
1   1  1.2240818 -0.56047565
2   2  0.3598138 -0.23017749
3   1  0.4007715  1.55870831
4   2  0.1106827  0.07050839
5   1 -0.5558411  0.12928774
6   2  1.7869131  1.71506499
7   1  0.4978505  0.46091621
8   2 -1.9666172 -1.26506123
9   1  0.7013559 -0.68685285
10  2 -0.4727914 -0.44566197
```

We can also calculate the means for each level of the categorical variable in R.

``````# calculate means by group
aggregate(df\$Y, list(df\$Xa), FUN=mean) ``````
```  Group.1           x
1       1  0.18031675
2       2 -0.03106546```

## Situation 1: Categorical IV only

In situation 1, we will focus on a regression model with single IV, namely the categorical variable.

### Situation 1a: Dummy Coding:

``````# dummy coding
contrasts(df\$Xa) =contr.treatment(2, base = 2)

# linear regression with dummy coding
result<-lm(Y~Xa,data=df)

# summarize the result
summary(result)``````
```Call:
lm(formula = Y ~ Xa, data = df)

Residuals:
Min      1Q  Median      3Q     Max
-1.2340 -0.6592 -0.1251  0.2358  1.7461

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03107    0.44932  -0.069    0.947
Xa1          0.21138    0.63544   0.333    0.748

Residual standard error: 1.005 on 8 degrees of freedom
Multiple R-squared:  0.01364,	Adjusted R-squared:  -0.1097
F-statistic: 0.1107 on 1 and 8 DF,  p-value: 0.7479```

### Situation 1b: Contrast Coding:

``````# contrast coding
contrasts(df\$Xa) =contr.sum(2)

# linear regression with dummy coding
result<-lm(Y~Xa,data=df)

# summarize the result
summary(result)``````
```Call:
lm(formula = Y ~ Xa, data = df)

Residuals:
Min      1Q  Median      3Q     Max
-1.2340 -0.6592 -0.1251  0.2358  1.7461

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.07463    0.31772   0.235    0.820
Xa1          0.10569    0.31772   0.333    0.748

Residual standard error: 1.005 on 8 degrees of freedom
Multiple R-squared:  0.01364,	Adjusted R-squared:  -0.1097
F-statistic: 0.1107 on 1 and 8 DF,  p-value: 0.7479```

## Situation 2: Categorical IV + Continuous IV

### Situation 2a: Dummy Coding

``````# dummy coding
contrasts(df\$Xa) =contr.treatment(2, base=2)

# linear regression with dummy coding
result<-lm(Y~Xa+Xb,data=df)

# summarize the result
summary(result)``````
```Call:
lm(formula = Y ~ Xa + Xb, data = df)

Residuals:
Min       1Q   Median       3Q      Max
-1.15473 -0.35823 -0.07879  0.43272  1.40680

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.01151    0.39473  -0.029    0.978
Xa1         -0.05190    0.57614  -0.090    0.931
Xb           0.53727    0.29252   1.837    0.109

Residual standard error: 0.8823 on 7 degrees of freedom
Multiple R-squared:  0.3344,	Adjusted R-squared:  0.1442
F-statistic: 1.758 on 2 and 7 DF,  p-value: 0.2406```

### Situation 2b: Contrast Coding:

``````# contrast coding
contrasts(df\$Xa) =contr.sum(2)

# linear regression with contrast coding
result<-lm(Y~Xa+Xb,data=df)

# summarize the result
summary(result)``````
```Call:
lm(formula = Y ~ Xa + Xb, data = df)

Residuals:
Min       1Q   Median       3Q      Max
-1.15473 -0.35823 -0.07879  0.43272  1.40680

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03746    0.28561  -0.131    0.899
Xa1         -0.02595    0.28807  -0.090    0.931
Xb           0.53727    0.29252   1.837    0.109

Residual standard error: 0.8823 on 7 degrees of freedom
Multiple R-squared:  0.3344,	Adjusted R-squared:  0.1442
F-statistic: 1.758 on 2 and 7 DF,  p-value: 0.2406```

### Situation 2c: Dummy Coding + Centering

Here, we continue focusing on a model with the dummy coding for the categorical variable. In addition, we center the continuous variable.

``````# dummy coding
contrasts(df\$Xa) =contr.treatment(2, base=2)

# centering continuous variable
center_scale <- function(x) { scale(x, scale = FALSE)}
df\$Xb<-center_scale(df\$Xb)

# linear regression with dummy coding
result<-lm(Y~Xa+Xb,data=df)

# summarize the result
summary(result)``````
```Call:
lm(formula = Y ~ Xa + Xb, data = df)

Residuals:
Min       1Q   Median       3Q      Max
-1.15473 -0.35823 -0.07879  0.43272  1.40680

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.1006     0.4010   0.251    0.809
Xa1          -0.0519     0.5761  -0.090    0.931
Xb            0.5373     0.2925   1.837    0.109

Residual standard error: 0.8823 on 7 degrees of freedom
Multiple R-squared:  0.3344,	Adjusted R-squared:  0.1442
F-statistic: 1.758 on 2 and 7 DF,  p-value: 0.2406
```

### Situation 2d: Contrast Coding+ Centering

``````# contrast coding
contrasts(df\$Xa) =contr.sum(2)

# centering continuous variable
center_scale <- function(x) { scale(x, scale = FALSE)}
df\$Xb<-center_scale(df\$Xb)

# linear regression with contrast coding
result<-lm(Y~Xa+Xb,data=df)

# summarize the result
summary(result)
``````
```Call:
lm(formula = Y ~ Xa + Xb, data = df)

Residuals:
Min       1Q   Median       3Q      Max
-1.15473 -0.35823 -0.07879  0.43272  1.40680

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.07463    0.27901   0.267    0.797
Xa1         -0.02595    0.28807  -0.090    0.931
Xb           0.53727    0.29252   1.837    0.109

Residual standard error: 0.8823 on 7 degrees of freedom
Multiple R-squared:  0.3344,	Adjusted R-squared:  0.1442
F-statistic: 1.758 on 2 and 7 DF,  p-value: 0.2406```

## Situation 3: Interaction

### Situation 3a: Dummy Coding

``````# dummy coding
contrasts(df\$Xa) =contr.treatment(2, base = 2)

# linear regression with dummy coding
result<-lm(Y~Xa+Xb+Xa*Xb,data=df)

# summarize the result
summary(result)``````
```Call:
lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df)

Residuals:
Min       1Q   Median       3Q      Max
-0.74993 -0.47098 -0.04572  0.28724  1.35337

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.003186   0.334174  -0.010   0.9927
Xa1          0.398200   0.540029   0.737   0.4887
Xb           0.765925   0.274210   2.793   0.0314 *
Xa1:Xb      -1.239197   0.638357  -1.941   0.1003
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7469 on 6 degrees of freedom
Multiple R-squared:  0.5912,	Adjusted R-squared:  0.3868
F-statistic: 2.892 on 3 and 6 DF,  p-value: 0.1243```

### Situation 3b: Contrast Coding

``````# contrast coding
contrasts(df\$Xa) =contr.sum(2)

# linear regression with contrast coding
result<-lm(Y~Xa+Xb+Xa*Xb,data=df)

# summarize the result
summary(result)``````
```Call:
lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df)

Residuals:
Min       1Q   Median       3Q      Max
-0.74993 -0.47098 -0.04572  0.28724  1.35337

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.1959     0.2700   0.726    0.495
Xa1           0.1991     0.2700   0.737    0.489
Xb            0.1463     0.3192   0.458    0.663
Xa1:Xb       -0.6196     0.3192  -1.941    0.100

Residual standard error: 0.7469 on 6 degrees of freedom
Multiple R-squared:  0.5912,	Adjusted R-squared:  0.3868
F-statistic: 2.892 on 3 and 6 DF,  p-value: 0.1243```

### Situation 3c: Dummy Coding + Centering

``````# dummy coding
contrasts(df\$Xa) =contr.treatment(2)

# centering the continuous IV
center_scale <- function(x) { scale(x, scale = FALSE)}
df\$Xb<-center_scale(df\$Xb)

# linear regression with dummy coding
result<-lm(Y~Xa+Xb+Xa*Xb,data=df)

# summarize the result
summary(result)``````
```Call:
lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df)

Residuals:
Min       1Q   Median       3Q      Max
-0.74993 -0.47098 -0.04572  0.28724  1.35337

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.1566     0.3407   0.460   0.6620
Xa1           0.1397     0.4976   0.281   0.7884
Xb            0.7659     0.2742   2.793   0.0314 *
Xa1:Xb       -1.2392     0.6384  -1.941   0.1003
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7469 on 6 degrees of freedom
Multiple R-squared:  0.5912,	Adjusted R-squared:  0.3868
F-statistic: 2.892 on 3 and 6 DF,  p-value: 0.1243```

### Situation 3d: Contrast Coding + Centering

``````# contrast coding
contrasts(df\$Xa) =contr.sum(2)

# centering the continuous IV
center_scale <- function(x) { scale(x, scale = FALSE)}
df\$Xb<-center_scale(df\$Xb)

# linear regression with contrast coding
result<-lm(Y~Xa+Xb+Xa*Xb,data=df)

# summarize the result
summary(result)``````
```Call:
lm(formula = Y ~ Xa + Xb + Xa * Xb, data = df)

Residuals:
Min       1Q   Median       3Q      Max
-0.74993 -0.47098 -0.04572  0.28724  1.35337

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.22644    0.24880   0.910    0.398
Xa1          0.06984    0.24880   0.281    0.788
Xb           0.14633    0.31918   0.458    0.663
Xa1:Xb      -0.61960    0.31918  -1.941    0.100

Residual standard error: 0.7469 on 6 degrees of freedom
Multiple R-squared:  0.5912,	Adjusted R-squared:  0.3868
F-statistic: 2.892 on 3 and 6 DF,  p-value: 0.1243```

## Summary

Situation 2: Categorical IV + Centered Continuous IV

1. S1 vs. S2: Adding a continuous IV into the regression changes everything, both coefficient and p-value, for the categorical variable.
2. S2a vs. S2b: Coding (contrast vs. dummy) does not change anything of the continuous IV.
3. S2c vs. S2d: same point as point 2.
4. S2a vs. S2b: Coding (contrast vs. dummy) does not change p-value of the categorical variable, but does change the regression coefficient for the categorical variable.
5. S2c vs. S2d: same point as point 4.
6. S2a +S2b vs. S2c+S2d: Centering only changes both things in intercept, and does not change anything else.

Situation 3: Categorical IV + Centered Continuous IV+  Categorical Variable * Continuous

1. S2 vs. S3: Adding an interaction item changes all other coefficients.
2. S3a vs. S3b: Coding makes a difference in both things (coefficient and p-value) for the continuous IV.
3. S3c vs. S3d: same as point 2.
4. S3a vs. S3b: While coding makes a difference in the regression coefficients for categorical IV and the interaction item, the p-values for categorical IV and the interaction item do not change.
5. S3c vs. S2d: same as point 4.
6. S3a+S3b vs. S3c+S3d: Centering only changes both things (coefficient and p-value) in intercetpt and categorical IV.
7. S3a+S3b vs. S3c+S3d: Centering does not change either thing for the continuous IV or the interaction item.

## Take-home message

1. For a categorical variable with 2 levels, different codings do not impact the p-value of interaction effect.
2. For a categorical variable with 2 levels, when the interaction effect is significant, if you want to look at main effects, you better look at the simple main effect (e.g., using Johnson Neyman or just doing slope tests under each level of the categorical variable).
3. If the interaction effect is significant, it is difficult to explain any other main effect, except for the simple main effects mentioned in point 2.