线性回归week3 – 小飞侠

多元回归的例子

Swiss fertility data

library(datasets); data(swiss); require(stats); require(graphics)
pairs(swiss, panel = panel.smooth, main = "Swiss data", col = 3 + (swiss$Catholic > 50))

`?swiss`

Description

Standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888.

A data frame with 47 observations on 6 variables, each of which is in percent, i.e., in [0, 100].

[,1] Fertility Ig, ‘ common standardized fertility measure’
[,2] Agriculture % of males involved in agriculture as occupation
[,3] Examination % draftees receiving highest mark on army examination
[,4] Education % education beyond primary school for draftees.
[,5] Catholic % ‘catholic’ (as opposed to ‘protestant’).
[,6] Infant.Mortality live births who live less than 1 year.

All variables but ‘Fertility’ give proportions of the population.

Calling
`lm`

summary(lm(Fertility ~ . , data = swiss))

                 Estimate Std. Error t value  Pr(>|t|)
(Intercept)       66.9152   10.70604   6.250 1.906e-07
Agriculture       -0.1721    0.07030  -2.448 1.873e-02
Examination       -0.2580    0.25388  -1.016 3.155e-01
Education         -0.8709    0.18303  -4.758 2.431e-05
Catholic           0.1041    0.03526   2.953 5.190e-03
Infant.Mortality   1.0770    0.38172   2.822 7.336e-03

Example interpretation

Agriculture is expressed in percentages (0 – 100)
Estimate is -0.1721.
We estimate an expected 0.17 decrease in standardized fertility for every 1\% increase in percentage of males involved in agriculture in holding the remaining variables constant.
The t-test for $H_0: \beta_{Agri} = 0$ versus $H_a: \beta_{Agri} \neq 0$ is significant.
Interestingly, the unadjusted estimate is

summary(lm(Fertility ~ Agriculture, data = swiss))$coefficients

            Estimate Std. Error t value  Pr(>|t|)
(Intercept)  60.3044    4.25126  14.185 3.216e-18
Agriculture   0.1942    0.07671   2.532 1.492e-02

How can adjustment reverse the sign of an effect? Let’s try a simulation.

n <- 100; x2 <- 1 : n; x1 <- .01 * x2 + runif(n, -.1, .1); y = -x1 + x2 + rnorm(n, sd = .01)
summary(lm(y ~ x1))$coef

            Estimate Std. Error t value  Pr(>|t|)
(Intercept)    1.618      1.200   1.349 1.806e-01
x1            95.854      2.058  46.579 1.153e-68

summary(lm(y ~ x1 + x2))$coef

              Estimate Std. Error   t value   Pr(>|t|)
(Intercept)  0.0003683  0.0020141    0.1829  8.553e-01
x1          -1.0215256  0.0166372  -61.4001  1.922e-79
x2           1.0001909  0.0001681 5950.1818 1.369e-271

plot(x1, y, pch=21,col=”black”,bg=topo.colors(n)[x2], frame = FALSE, cex = 1.5)

Back to this data set

The sign reverses itself with the inclusion of Examination and Education, but of which are negatively correlated with Agriculture.
The percent of males in the province working in agriculture is negatively related to educational attainment (correlation of -0.6395) and Education and Examination (correlation of 0.6984) are obviously measuring similar things.
- Is the positive marginal an artifact for not having accounted for, say, Education level? (Education does have a stronger effect, by the way.)
At the minimum, anyone claiming that provinces that are more agricultural have higher fertility rates would immediately be open to criticism.

What if we include an unnecessary variable?

z adds no new linear information, since it’s a linear combination of variables already included. R just drops terms that are linear combinations of other terms.

z <- swiss$Agriculture + swiss$Education
lm(Fertility ~ . + z, data = swiss)


Call:
lm(formula = Fertility ~ . + z, data = swiss)

Coefficients:
     (Intercept)       Agriculture       Examination         Education          Catholic  
          66.915            -0.172            -0.258            -0.871             0.104  
Infant.Mortality                 z  
           1.077                NA

Dummy variables are smart

Consider the linear model $$ Y_i = \beta_0 + X_{i1} \beta_1 + \epsilon_{i} $$ where each $X_{i1}$ is binary so that it is a 1 if measurement $i$ is in a group and 0 otherwise. (Treated versus not in a clinical trial, for example.)
Then for people in the group $E[Y_i] = \beta_0 + \beta_1$
And for people not in the group $E[Y_i] = \beta_0$
The LS fits work out to be $\hat \beta_0 + \hat \beta_1$ is the mean for those in the group and $\hat \beta_0$ is the mean for those not in the group.
$\beta_1$ is interpretted as the increase or decrease in the mean comparing those in the group to those not.
Note including a binary variable that is 1 for those not in the group would be redundant. It would create three parameters to describe two means.

More than 2 levels

Consider a multilevel factor level. For didactic reasons, let’s say a three level factor (example, US political party affiliation: Republican, Democrat, Independent)
$Y_i = \beta_0 + X_{i1} \beta_1 + X_{i2} \beta_2 + \epsilon_i$.
$X_{i1}$ is 1 for Republicans and 0 otherwise.
$X_{i2}$ is 1 for Democrats and 0 otherwise.
If $i$ is Republican $E[Y_i] = \beta_0 +\beta_1$
If $i$ is Democrat $E[Y_i] = \beta_0 + \beta_2$.
If $i$ is Independent $E[Y_i] = \beta_0$.
$\beta_1$ compares Republicans to Independents.
$\beta_2$ compares Democrats to Independents.
$\beta_1 – \beta_2$ compares Republicans to Democrats.
(Choice of reference category changes the interpretation.)

Insect Sprays

Linear model fit, group A is the reference

之所以没用A,是因为如上A作为了参考。即成为了intercept，如系数sprayB为正表明比A大，反之则较小

summary(lm(count ~ spray, data = InsectSprays))$coef

            Estimate Std. Error t value  Pr(>|t|)
(Intercept)  14.5000      1.132 12.8074 1.471e-19
sprayB        0.8333      1.601  0.5205 6.045e-01
sprayC      -12.4167      1.601 -7.7550 7.267e-11
sprayD       -9.5833      1.601 -5.9854 9.817e-08
sprayE      -11.0000      1.601 -6.8702 2.754e-09
sprayF        2.1667      1.601  1.3532 1.806e-01

Hard coding the dummy variables

mupliply 是为了将因子型转为数值型

summary(lm(count ~ 
             I(1 * (spray == 'B')) + I(1 * (spray == 'C')) + 
             I(1 * (spray == 'D')) + I(1 * (spray == 'E')) +
             I(1 * (spray == 'F'))
           , data = InsectSprays))$coef

                      Estimate Std. Error t value  Pr(>|t|)
(Intercept)            14.5000      1.132 12.8074 1.471e-19
I(1 * (spray == "B"))   0.8333      1.601  0.5205 6.045e-01
I(1 * (spray == "C")) -12.4167      1.601 -7.7550 7.267e-11
I(1 * (spray == "D"))  -9.5833      1.601 -5.9854 9.817e-08
I(1 * (spray == "E")) -11.0000      1.601 -6.8702 2.754e-09
I(1 * (spray == "F"))   2.1667      1.601  1.3532 1.806e-01

What if we include all 6?

因为x6=1-x1-x2-x3-x4-x5。故而当前面五个系数存在时，第六个系数则为NA

lm(count ~ 
   I(1 * (spray == 'B')) + I(1 * (spray == 'C')) +  
   I(1 * (spray == 'D')) + I(1 * (spray == 'E')) +
   I(1 * (spray == 'F')) + I(1 * (spray == 'A')), data = InsectSprays)


Call:
lm(formula = count ~ I(1 * (spray == "B")) + I(1 * (spray == 
    "C")) + I(1 * (spray == "D")) + I(1 * (spray == "E")) + I(1 * 
    (spray == "F")) + I(1 * (spray == "A")), data = InsectSprays)

Coefficients:
          (Intercept)  I(1 * (spray == "B"))  I(1 * (spray == "C"))  I(1 * (spray == "D"))  
               14.500                  0.833                -12.417                 -9.583  
I(1 * (spray == "E"))  I(1 * (spray == "F"))  I(1 * (spray == "A"))  
              -11.000                  2.167                     NA

What if we omit the intercept?

可见当省略截距时，sprayA的系数与未省截距时同，而其他各系数则并不同，可见此时系数等于各spray的mean,而且刚好能与前面对上，sprayB(有截距)=sprayB(无截距)-sprayA(无截距),依此类推。

summary(lm(count ~ spray - 1, data = InsectSprays))$coef

       Estimate Std. Error t value  Pr(>|t|)
sprayA   14.500      1.132  12.807 1.471e-19

原文链接：https://blog.csdn.net/u014596936/article/details/38756995