Factor codings for models in R

I am holding an exercise on generalised models these days. Preparing a task on factor coding in generalised linear models, I realised that the help on the internet on that is not so easy to understand. At least what I found. So in order to help people who find this topic confusing, I want to help out a little here.

Consider the following data

set.seed(29)
x <- gl(5, 10) # 5 factor's levels, each replicated 10 times
y <- rnorm(n = length(x), mean = as.integer(x), sd = 0.1)   # means= 1,2,3,4,5
dd <- data.frame(x = x, y = y)

library(ggplot2)
ggplot(aes(x = x, y = y), data = dd) + geom_boxplot()

means <- tapply(y, x, mean)
means
##         1         2         3         4         5 
## 0.9962925 2.0227058 3.0230409 3.9482690 5.0189041

R2

Now we want to model the mean of y given x using the lm() function with the following codings: dummy-coding, treatment-coding (where the reference category is 5), effect-coding and split-coding.

To make the theory more general, we have a categorical variable \( X \) with \( K \) categories \( (a_1, \dots, a_K) \)

Dummy-coding

Look at each level separately: \[ E(Y|X=a_{k}) = \beta_k, \quad k=1,\dots,K \]

# dummy (each level singularly)
lm_dummy <- lm( y ~ x - 1, contrasts = list(x = contr.treatment(5)))
summary(lm_dummy)
## 
## Call:
## lm(formula = y ~ x - 1, contrasts = list(x = contr.treatment(5)))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.191692 -0.065303 -0.002686  0.054485  0.214752 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## x1  0.99629    0.03293   30.25   <2e-16 ***
## x2  2.02271    0.03293   61.43   <2e-16 ***
## x3  3.02304    0.03293   91.80   <2e-16 ***
## x4  3.94827    0.03293  119.90   <2e-16 ***
## x5  5.01890    0.03293  152.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1041 on 45 degrees of freedom
## Multiple R-squared:  0.9991,	Adjusted R-squared:  0.999 
## F-statistic: 1.014e+04 on 5 and 45 DF,  p-value: < 2.2e-16
coef(lm_dummy)
##        x1        x2        x3        x4        x5 
## 0.9962925 2.0227058 3.0230409 3.9482690 5.0189041
means
##         1         2         3         4         5 
## 0.9962925 2.0227058 3.0230409 3.9482690 5.0189041

Treatment-coding

Compare each category to the dummy-category \( a_d \):

\[ E(Y|X=a_{d}) = \beta_0 \] and \[ E(Y|X=a_k) = \beta_0 + \beta_k, \quad k\neq d \]

# treatment (restrict one level to constant term, all other difference from it)
lm_treatment <- lm( y ~ x, contrast = list(x = contr.treatment(5, base = 5)))
summary(lm_treatment)
## 
## Call:
## lm(formula = y ~ x, contrasts = list(x = contr.treatment(5, base = 5)))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.191692 -0.065303 -0.002686  0.054485  0.214752 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.01890    0.03293  152.41   <2e-16 ***
## x1          -4.02261    0.04657  -86.38   <2e-16 ***
## x2          -2.99620    0.04657  -64.34   <2e-16 ***
## x3          -1.99586    0.04657  -42.86   <2e-16 ***
## x4          -1.07064    0.04657  -22.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1041 on 45 degrees of freedom
## Multiple R-squared:  0.9951,	Adjusted R-squared:  0.9947 
## F-statistic:  2293 on 4 and 45 DF,  p-value: < 2.2e-16
coef(lm_treatment)
## (Intercept)          x1          x2          x3          x4 
##    5.018904   -4.022612   -2.996198   -1.995863   -1.070635
c(means[5], means[1:4] - means[5]) 
##         5         1         2         3         4 
##  5.018904 -4.022612 -2.996198 -1.995863 -1.070635

Effect-coding

Compare each category to the mean:

\[ E(Y|X=a_k) = \beta_0 + \beta_k, \quad k=1,\dots,K-1 \] and \[ E(Y|X=a_K) = \beta_0 - \sum\limits_{j=1}^{K-1} \beta_j \]

# effect(deviation from overall average)
lm_effect <- lm(y ~ x, contrast = list(x = contr.sum(5)))
summary(lm_effect)
## 
## Call:
## lm(formula = y ~ x, contrasts = list(x = contr.sum(5)))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.191692 -0.065303 -0.002686  0.054485  0.214752 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.00184    0.01473  203.84   <2e-16 ***
## x1          -2.00555    0.02945  -68.09   <2e-16 ***
## x2          -0.97914    0.02945  -33.24   <2e-16 ***
## x3           0.02120    0.02945    0.72    0.475    
## x4           0.94643    0.02945   32.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1041 on 45 degrees of freedom
## Multiple R-squared:  0.9951,	Adjusted R-squared:  0.9947 
## F-statistic:  2293 on 4 and 45 DF,  p-value: < 2.2e-16
coef(lm_effect)
## (Intercept)          x1          x2          x3          x4 
##  3.00184245 -2.00554992 -0.97913668  0.02119841  0.94642653
c(mean(means), means[1:4] - mean(means))
##                       1           2           3           4 
##  3.00184245 -2.00554992 -0.97913668  0.02119841  0.94642653

Split-coding

Compare each category to the previous category (for ordered categories):

\[ E(Y|X=a_1) = \beta_0 \] and \[ E(Y|X=a_k) = \beta_0 + \sum\limits_{j=1}^{k-1} \beta_j, \quad k=2,\dots,K \]

# split coding
c <- rbind( c(0,0,0,0),
            c(1,0,0,0),
            c(1,1,0,0),
            c(1,1,1,0),
            c(1,1,1,1) )

lm_split <- lm(y ~ x, contrast = list(x = c))
summary(lm_split)
## 
## Call:
## lm(formula = y ~ x, contrasts = list(x = c))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.191692 -0.065303 -0.002686  0.054485  0.214752 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.99629    0.03293   30.25   <2e-16 ***
## x1           1.02641    0.04657   22.04   <2e-16 ***
## x2           1.00034    0.04657   21.48   <2e-16 ***
## x3           0.92523    0.04657   19.87   <2e-16 ***
## x4           1.07064    0.04657   22.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1041 on 45 degrees of freedom
## Multiple R-squared:  0.9951,	Adjusted R-squared:  0.9947 
## F-statistic:  2293 on 4 and 45 DF,  p-value: < 2.2e-16
coef(lm_split)
## (Intercept)          x1          x2          x3          x4 
##   0.9962925   1.0264132   1.0003351   0.9252281   1.0706351
c(means[1], diff(means))
##         1         2         3         4         5 
## 0.9962925 1.0264132 1.0003351 0.9252281 1.0706351