β^0β^
Si nous étendons un peu votre exemple pour inclure un troisième niveau à la catégorie de course (disons asiatique ) et choisissons le blanc comme référence, alors vous auriez:
- β^0=x¯White
- β^Black=x¯Black−x¯White
- β^Asian=x¯Asian−x¯White
In this case, the interpretation of all the β^ is easy and finding the mean of any level of the category is straightforward. For example:
- x¯Asian=β^Asian+β^0
Unfortunately in the case of multiple categorical variables, the correct interpretation for the intercept is no longer as clear (see note at the end). When there is n categories, each with multiple levels and one reference level (e.g. White and Male in you example), the general form for the intercept is:
β^0=∑ni=1x¯reference,i−(n−1)x¯,
where
x¯reference,i is the mean of the reference level of the i-th categorical variable,
x¯ is the mean of the whole data set
The other β^ are the same as with a single category: they are the difference between the mean of that level of the category and the mean of the reference level of the same category.
If we go back to your example, we would get:
- β^0=x¯White+x¯Male−x¯
- β^Black=x¯Black−x¯White
- β^Asian=x¯Asian−x¯White
- β^Female=x¯Female−x¯Male
You will notice that the mean of the cross categories (e.g. White males) are not present in any of the β^. As a matter of fact, you cannot calculate these means precisely from the results of this type of regression.
The reason for this is that, the number of predictor variables (i.e. the β^) is smaller then the number of cross categories (as long as you have more than 1 category) so a perfect fit is not always possible. If we go back to your example, the number of predictors is 4 (i.e. β^0, β^Black, β^Asian and β^Female) while the number of cross categories is 6.
Numerical Example
Permettez-moi d'emprunter à @Gung pour un exemple numérique en conserve:
d = data.frame(Sex=factor(rep(c("Male","Female"),times=3), levels=c("Male","Female")),
Race =factor(rep(c("White","Black","Asian"),each=2),levels=c("White","Black","Asian")),
y =c(0, 3, 7, 8, 9, 10))
d
# Sex Race y
# 1 Male White 0
# 2 Female White 3
# 3 Male Black 7
# 4 Female Black 8
# 5 Male Asian 9
# 6 Female Asian 10
Dans ce cas, les différentes moyennes qui iront dans le calcul de la β^ sont:
aggregate(y~1, d, mean)
# y
# 1 6.166667
aggregate(y~Sex, d, mean)
# Sex y
# 1 Male 5.333333
# 2 Female 7.000000
aggregate(y~Race, d, mean)
# Race y
# 1 White 1.5
# 2 Black 7.5
# 3 Asian 9.5
Nous pouvons comparer ces chiffres avec les résultats de la régression:
summary(lm(y~Sex+Race, d))
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.6667 0.6667 1.000 0.4226
# SexFemale 1.6667 0.6667 2.500 0.1296
# RaceBlack 6.0000 0.8165 7.348 0.0180
# RaceAsian 8.0000 0.8165 9.798 0.0103
Comme vous pouvez le voir, les différents β^estimés à partir de la régression, tous correspondent aux formules données ci-dessus. Par exemple,β^0 est donné par:
β^0= x¯Wh i t e+ x¯Ma l e- x¯
Qui donne:
1.5 + 5.333333 - 6.166667
# 0.66666
Remarque sur le choix du contraste
Une dernière note sur ce sujet, tous les résultats discutés ci-dessus concernent les régressions catégoriques utilisant un traitement de contraste (le type de contraste par défaut dans R). Il existe différents types de contraste qui peuvent être utilisés (notamment Helmert et sum) et cela changerait l'interprétation des différentsβ^. Cependant, cela ne changerait pas les prédictions finales des régressions (par exemple, la prédiction pour les hommes blancs est toujours la même quel que soit le type de contraste que vous utilisez).
Mon préféré est la somme de contraste, car j'estime que l'interprétation de la β^c o n t r . s u m generalises better when there are multiple categories. For this type of contrast, there is no reference level, or rather the reference is the mean of the whole sample, and you have the following β^contr.sum:
- β^contr.sum0=x¯
- β^contr.sumi=x¯i−x¯
If we go back to the previous example, you would have:
- β^contr.sum0=x¯
- β^contr.sumWhite=x¯White−x¯
- β^contr.sumBlack=x¯Black−x¯
- β^contr.sumAsian=x¯Asian−x¯
- β^contr.sumMale=x¯Male−x¯
- β^contr.sumFemale=x¯Female−x¯
You will notice that because White and Male are no longer reference levels, their β^contr.sum are no longer 0. The fact that these are 0 is specific to contrast treatment.