我们做出如下假设:
y
(
i
)
=
θ
⊤
x
(
i
)
+
ϵ
(
i
)
y^{(i)}=\theta^\top x^{(i)} + \epsilon^{(i)}
y(i)=θ⊤x(i)+ϵ(i)
其中
ϵ
(
i
)
∼
N
(
0
,
σ
2
)
\epsilon^{(i)} \sim N(0, \sigma^2)
ϵ(i)∼N(0,σ2),代表unmodeled effects和random noises
亦即
P
(
ϵ
(
i
)
)
=
1
2
π
σ
exp
(
−
(
ϵ
(
i
)
)
2
2
σ
2
)
P(\epsilon^{(i)}) = \dfrac{1}{\sqrt{2\pi}\sigma} \exp \left(-\dfrac{(\epsilon^{(i)})^2}{2\sigma^2} \right)
P(ϵ(i))=2πσ1exp(−2σ2(ϵ(i))2)
并且
ϵ
(
i
)
\epsilon^{(i)}
ϵ(i) 是独立同分布 IID(Independent and Identically Distribution)
这些假设意味着:
P
(
y
(
i
)
∣
x
(
i
)
;
θ
)
=
1
2
π
σ
exp
(
−
(
y
(
i
)
−
θ
⊤
x
(
i
)
)
2
2
σ
2
)
P(y^{(i)} | x^{(i)} ; \theta) = \dfrac{1}{\sqrt{2\pi}\sigma} \exp \left(-\dfrac{(y^{(i)}-\theta^\top x^{(i)})^2}{2\sigma^2} \right)
P(y(i)∣x(i);θ)=2πσ1exp(−2σ2(y(i)−θ⊤x(i))2)
使用极大似然估计MLE (Maximum Likelihood Estimation)
设
L
(
θ
)
L(\theta)
L(θ) 表示 likelihood of
θ
\theta
θ
L
(
θ
)
=
P
(
y
⃗
∣
x
⃗
;
θ
)
=
∏
i
=
1
m
P
(
y
(
i
)
∣
x
(
i
)
;
θ
)
=
∏
i
=
1
m
1
2
π
σ
exp
(
−
(
y
(
i
)
−
θ
⊤
x
(
i
)
)
2
2
σ
2
)
L(\theta) = P(\vec y | \vec x ; \theta) = \prod_{i=1}^m P(y^{(i)} | x^{(i)} ; \theta) \\ = \prod_{i=1}^m\frac{1}{\sqrt{2\pi} \sigma} \exp \left(-\dfrac{(y^{(i)}-\theta^\top x^{(i)})^2}{2\sigma^2} \right)
L(θ)=P(y∣x;θ)=i=1∏mP(y(i)∣x(i);θ)=i=1∏m2πσ1exp(−2σ2(y(i)−θ⊤x(i))2)
设
l
(
θ
)
l(\theta)
l(θ) 表示 log likelihood
l
(
θ
)
=
log
L
(
θ
)
=
log
∏
i
=
1
m
1
2
π
σ
exp
(
−
(
y
(
i
)
−
θ
⊤
x
(
i
)
)
2
2
σ
2
)
=
m
log
1
2
π
−
1
2
σ
2
∑
i
=
1
m
(
y
(
i
)
−
θ
⊤
x
(
i
)
)
2
\begin{aligned} l(\theta) & = \log L(\theta) \\ & = \log \prod_{i=1}^m\frac{1}{\sqrt{2\pi} \sigma} \exp \left(-\dfrac{(y^{(i)}-\theta^\top x^{(i)})^2}{2\sigma^2} \right) \\ & = m \log \frac{1}{\sqrt{2\pi}} – \frac{1}{2\sigma^2} \sum_{i=1}^m (y^{(i)}-\theta^\top x^{(i)})^2 \end{aligned}
l(θ)=logL(θ)=logi=1∏m2πσ1exp(−2σ2(y(i)−θ⊤x(i))2)=mlog2π1−2σ21i=1∑m(y(i)−θ⊤x(i))2
为了使
L
(
θ
)
L(\theta)
L(θ) 尽可能大,需使
∑
i
=
1
m
(
y
(
i
)
−
θ
⊤
x
(
i
)
)
2
\sum_{i=1}^m (y^{(i)}-\theta^\top x^{(i)})^2
∑i=1m(y(i)−θ⊤x(i))2 尽可能小