01 Nov 2015
I am holding an exercise on generalised models these days. Preparing a task on factor coding
in generalised linear models, I realised that the help on the internet on that is not so
easy to understand. At least what I found. So in order to help people who find this topic
confusing, I want to help out a little here.
Consider the following data
set.seed ( 29 )
x <- gl ( 5 , 10 ) # 5 factor's levels, each replicated 10 times
y <- rnorm ( n = length ( x ), mean = as.integer ( x ), sd = 0.1 ) # means= 1,2,3,4,5
dd <- data.frame ( x = x , y = y )
library ( ggplot2 )
ggplot ( aes ( x = x , y = y ), data = dd ) + geom_boxplot ()
means <- tapply ( y , x , mean )
means
## 1 2 3 4 5
## 0.9962925 2.0227058 3.0230409 3.9482690 5.0189041
Now we want to model the mean of y given x using the lm()
function with the following codings:
dummy-coding , treatment-coding (where the reference category is 5), effect-coding
and split-coding .
To make the theory more general, we have a categorical variable \( X \) with \( K \) categories \( (a_1, \dots, a_K) \)
Dummy-coding
Look at each level separately:
\[ E(Y|X=a_{k}) = \beta_k, \quad k=1,\dots,K \]
# dummy (each level singularly)
lm_dummy <- lm ( y ~ x - 1 , contrasts = list ( x = contr.treatment ( 5 )))
summary ( lm_dummy )
##
## Call:
## lm(formula = y ~ x - 1, contrasts = list(x = contr.treatment(5)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.191692 -0.065303 -0.002686 0.054485 0.214752
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x1 0.99629 0.03293 30.25 <2e-16 ***
## x2 2.02271 0.03293 61.43 <2e-16 ***
## x3 3.02304 0.03293 91.80 <2e-16 ***
## x4 3.94827 0.03293 119.90 <2e-16 ***
## x5 5.01890 0.03293 152.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1041 on 45 degrees of freedom
## Multiple R-squared: 0.9991, Adjusted R-squared: 0.999
## F-statistic: 1.014e+04 on 5 and 45 DF, p-value: < 2.2e-16
coef ( lm_dummy )
## x1 x2 x3 x4 x5
## 0.9962925 2.0227058 3.0230409 3.9482690 5.0189041
means
## 1 2 3 4 5
## 0.9962925 2.0227058 3.0230409 3.9482690 5.0189041
Treatment-coding
Compare each category to the dummy-category \( a_d \):
\[ E(Y|X=a_{d}) = \beta_0 \]
and
\[ E(Y|X=a_k) = \beta_0 + \beta_k, \quad k\neq d \]
# treatment (restrict one level to constant term, all other difference from it)
lm_treatment <- lm ( y ~ x , contrast = list ( x = contr.treatment ( 5 , base = 5 )))
summary ( lm_treatment )
##
## Call:
## lm(formula = y ~ x, contrasts = list(x = contr.treatment(5, base = 5)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.191692 -0.065303 -0.002686 0.054485 0.214752
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.01890 0.03293 152.41 <2e-16 ***
## x1 -4.02261 0.04657 -86.38 <2e-16 ***
## x2 -2.99620 0.04657 -64.34 <2e-16 ***
## x3 -1.99586 0.04657 -42.86 <2e-16 ***
## x4 -1.07064 0.04657 -22.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1041 on 45 degrees of freedom
## Multiple R-squared: 0.9951, Adjusted R-squared: 0.9947
## F-statistic: 2293 on 4 and 45 DF, p-value: < 2.2e-16
coef ( lm_treatment )
## (Intercept) x1 x2 x3 x4
## 5.018904 -4.022612 -2.996198 -1.995863 -1.070635
c ( means [ 5 ], means [ 1 : 4 ] - means [ 5 ])
## 5 1 2 3 4
## 5.018904 -4.022612 -2.996198 -1.995863 -1.070635
Effect-coding
Compare each category to the mean:
\[ E(Y|X=a_k) = \beta_0 + \beta_k, \quad k=1,\dots,K-1 \]
and
\[ E(Y|X=a_K) = \beta_0 - \sum\limits_{j=1}^{K-1} \beta_j \]
# effect(deviation from overall average)
lm_effect <- lm ( y ~ x , contrast = list ( x = contr.sum ( 5 )))
summary ( lm_effect )
##
## Call:
## lm(formula = y ~ x, contrasts = list(x = contr.sum(5)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.191692 -0.065303 -0.002686 0.054485 0.214752
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.00184 0.01473 203.84 <2e-16 ***
## x1 -2.00555 0.02945 -68.09 <2e-16 ***
## x2 -0.97914 0.02945 -33.24 <2e-16 ***
## x3 0.02120 0.02945 0.72 0.475
## x4 0.94643 0.02945 32.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1041 on 45 degrees of freedom
## Multiple R-squared: 0.9951, Adjusted R-squared: 0.9947
## F-statistic: 2293 on 4 and 45 DF, p-value: < 2.2e-16
coef ( lm_effect )
## (Intercept) x1 x2 x3 x4
## 3.00184245 -2.00554992 -0.97913668 0.02119841 0.94642653
c ( mean ( means ), means [ 1 : 4 ] - mean ( means ))
## 1 2 3 4
## 3.00184245 -2.00554992 -0.97913668 0.02119841 0.94642653
Split-coding
Compare each category to the previous category (for ordered categories):
\[ E(Y|X=a_1) = \beta_0 \]
and
\[ E(Y|X=a_k) = \beta_0 + \sum\limits_{j=1}^{k-1} \beta_j, \quad k=2,\dots,K \]
# split coding
c <- rbind ( c ( 0 , 0 , 0 , 0 ),
c ( 1 , 0 , 0 , 0 ),
c ( 1 , 1 , 0 , 0 ),
c ( 1 , 1 , 1 , 0 ),
c ( 1 , 1 , 1 , 1 ) )
lm_split <- lm ( y ~ x , contrast = list ( x = c ))
summary ( lm_split )
##
## Call:
## lm(formula = y ~ x, contrasts = list(x = c))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.191692 -0.065303 -0.002686 0.054485 0.214752
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.99629 0.03293 30.25 <2e-16 ***
## x1 1.02641 0.04657 22.04 <2e-16 ***
## x2 1.00034 0.04657 21.48 <2e-16 ***
## x3 0.92523 0.04657 19.87 <2e-16 ***
## x4 1.07064 0.04657 22.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1041 on 45 degrees of freedom
## Multiple R-squared: 0.9951, Adjusted R-squared: 0.9947
## F-statistic: 2293 on 4 and 45 DF, p-value: < 2.2e-16
coef ( lm_split )
## (Intercept) x1 x2 x3 x4
## 0.9962925 1.0264132 1.0003351 0.9252281 1.0706351
c ( means [ 1 ], diff ( means ))
## 1 2 3 4 5
## 0.9962925 1.0264132 1.0003351 0.9252281 1.0706351