多元回归的例子
Swiss fertility data
?swiss
?swiss
Description
Standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888.
A data frame with 47 observations on 6 variables, each of which is in percent, i.e., in [0, 100].
- [,1] Fertility Ig, ‘ common standardized fertility measure’
- [,2] Agriculture % of males involved in agriculture as occupation
- [,3] Examination % draftees receiving highest mark on army examination
- [,4] Education % education beyond primary school for draftees.
- [,5] Catholic % ‘catholic’ (as opposed to ‘protestant’).
- [,6] Infant.Mortality live births who live less than 1 year.
All variables but ‘Fertility’ give proportions of the population.
Calling
lm
lm
summary(lm(Fertility ~ . , data = swiss))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.9152 10.70604 6.250 1.906e-07
Agriculture -0.1721 0.07030 -2.448 1.873e-02
Examination -0.2580 0.25388 -1.016 3.155e-01
Education -0.8709 0.18303 -4.758 2.431e-05
Catholic 0.1041 0.03526 2.953 5.190e-03
Infant.Mortality 1.0770 0.38172 2.822 7.336e-03
Example interpretation
- Agriculture is expressed in percentages (0 – 100)
- Estimate is -0.1721.
- We estimate an expected 0.17 decrease in standardized fertility for every 1\% increase in percentage of males involved in agriculture in holding the remaining variables constant.
- The t-test for $H_0: \beta_{Agri} = 0$ versus $H_a: \beta_{Agri} \neq 0$ is significant.
- Interestingly, the unadjusted estimate is
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.3044 4.25126 14.185 3.216e-18
Agriculture 0.1942 0.07671 2.532 1.492e-02
How can adjustment reverse the sign of an effect? Let’s try a simulation.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.618 1.200 1.349 1.806e-01
x1 95.854 2.058 46.579 1.153e-68
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0003683 0.0020141 0.1829 8.553e-01
x1 -1.0215256 0.0166372 -61.4001 1.922e-79
x2 1.0001909 0.0001681 5950.1818 1.369e-271
plot(x1, y, pch=21,col=”black”,bg=topo.colors(n)[x2], frame = FALSE, cex = 1.5)
Back to this data set
- The sign reverses itself with the inclusion of Examination and Education, but of which are negatively correlated with Agriculture.
-
The percent of males in the province working in agriculture is negatively related to educational attainment (correlation of -0.6395) and Education and Examination (correlation of 0.6984) are obviously measuring similar things.
- Is the positive marginal an artifact for not having accounted for, say, Education level? (Education does have a stronger effect, by the way.)
- At the minimum, anyone claiming that provinces that are more agricultural have higher fertility rates would immediately be open to criticism.
What if we include an unnecessary variable?
z adds no new linear information, since it’s a linear combination of variables already included. R just drops terms that are linear combinations of other terms.
Call:
lm(formula = Fertility ~ . + z, data = swiss)
Coefficients:
(Intercept) Agriculture Examination Education Catholic
66.915 -0.172 -0.258 -0.871 0.104
Infant.Mortality z
1.077 NA
Dummy variables are smart
- Consider the linear model $$ Y_i = \beta_0 + X_{i1} \beta_1 + \epsilon_{i} $$ where each $X_{i1}$ is binary so that it is a 1 if measurement $i$ is in a group and 0 otherwise. (Treated versus not in a clinical trial, for example.)
- Then for people in the group $E[Y_i] = \beta_0 + \beta_1$
- And for people not in the group $E[Y_i] = \beta_0$
- The LS fits work out to be $\hat \beta_0 + \hat \beta_1$ is the mean for those in the group and $\hat \beta_0$ is the mean for those not in the group.
- $\beta_1$ is interpretted as the increase or decrease in the mean comparing those in the group to those not.
- Note including a binary variable that is 1 for those not in the group would be redundant. It would create three parameters to describe two means.
More than 2 levels
- Consider a multilevel factor level. For didactic reasons, let’s say a three level factor (example, US political party affiliation: Republican, Democrat, Independent)
- $Y_i = \beta_0 + X_{i1} \beta_1 + X_{i2} \beta_2 + \epsilon_i$.
- $X_{i1}$ is 1 for Republicans and 0 otherwise.
- $X_{i2}$ is 1 for Democrats and 0 otherwise.
- If $i$ is Republican $E[Y_i] = \beta_0 +\beta_1$
- If $i$ is Democrat $E[Y_i] = \beta_0 + \beta_2$.
- If $i$ is Independent $E[Y_i] = \beta_0$.
- $\beta_1$ compares Republicans to Independents.
- $\beta_2$ compares Democrats to Independents.
- $\beta_1 – \beta_2$ compares Republicans to Democrats.
- (Choice of reference category changes the interpretation.)
Insect Sprays
Linear model fit, group A is the reference
之所以没用A,是因为如上A作为了参考。即成为了intercept,如系数sprayB为正表明比A大,反之则较小
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.5000 1.132 12.8074 1.471e-19
sprayB 0.8333 1.601 0.5205 6.045e-01
sprayC -12.4167 1.601 -7.7550 7.267e-11
sprayD -9.5833 1.601 -5.9854 9.817e-08
sprayE -11.0000 1.601 -6.8702 2.754e-09
sprayF 2.1667 1.601 1.3532 1.806e-01
Hard coding the dummy variables
mupliply 是为了将因子型转为数值型
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.5000 1.132 12.8074 1.471e-19
I(1 * (spray == "B")) 0.8333 1.601 0.5205 6.045e-01
I(1 * (spray == "C")) -12.4167 1.601 -7.7550 7.267e-11
I(1 * (spray == "D")) -9.5833 1.601 -5.9854 9.817e-08
I(1 * (spray == "E")) -11.0000 1.601 -6.8702 2.754e-09
I(1 * (spray == "F")) 2.1667 1.601 1.3532 1.806e-01
What if we include all 6?
因为x6=1-x1-x2-x3-x4-x5。故而当前面五个系数存在时,第六个系数则为NA
Call:
lm(formula = count ~ I(1 * (spray == "B")) + I(1 * (spray ==
"C")) + I(1 * (spray == "D")) + I(1 * (spray == "E")) + I(1 *
(spray == "F")) + I(1 * (spray == "A")), data = InsectSprays)
Coefficients:
(Intercept) I(1 * (spray == "B")) I(1 * (spray == "C")) I(1 * (spray == "D"))
14.500 0.833 -12.417 -9.583
I(1 * (spray == "E")) I(1 * (spray == "F")) I(1 * (spray == "A"))
-11.000 2.167 NA
What if we omit the intercept?
可见当省略截距时,sprayA的系数与未省截距时同,而其他各系数则并不同,可见此时系数等于各spray的mean,而且刚好能与前面对上,sprayB(有截距)=sprayB(无截距)-sprayA(无截距),依此类推。
Estimate Std. Error t value Pr(>|t|)
sprayA 14.500 1.132 12.807 1.471e-19
版权声明:本文为u014596936原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。