机器学习(一)回归算法
1. 什么是回归算法
-
回归算法是一种
有监督算法
-
回归算法是一种比较常用的机器学习算法,用来建立”解释”变量(自变量X)和观测值(因变量Y)之间的关系;从机器学习角度来讲,用于构建一个算法模型(函数)来做
属性(X)
与
标签(Y)
之间的
映射关系
,在算法学习过程中试图寻找一个函数
$h:R^d->R$
使得参数之间的关系拟合性最好。 -
回归算法中算法(函数)的最终结果是一个
连续
的数据值,
输入值(属性值)
是一个d维度的
属性/数值向量
2.线性回归、最大似然估计及二乘法
线性回归
y
(
i
)
=
θ
T
x
(
i
)
+
ϵ
(
i
)
y^{(i)}=\theta^Tx^{(i)}+\epsilon^{(i)}
y
(
i
)
=
θ
T
x
(
i
)
+
ϵ
(
i
)
-
误差
ϵ(
i
)
(
1
≤
i
≤
n
)
\epsilon^{(i)}(1\le i \le n)
ϵ
(
i
)
(
1
≤
i
≤
n
)
是独立同分布的,服从均值为0,方差为某定值
δ2
\delta^2
δ
2
的
高斯分布
。-
原因:
中心极限定理
-
原因:
-
实际问题中,很多随机现象可以看作
众多因素
的独立影响的综合反应,往往服从正态分布
似然函数
y
(
i
)
=
θ
T
x
(
i
)
+
ϵ
(
i
)
y^{(i)}=\theta^Tx^{(i)}+\epsilon^{(i)}
y
(
i
)
=
θ
T
x
(
i
)
+
ϵ
(
i
)
p
(
ϵ
(
i
)
)
=
1
δ
2
π
e
−
(
ϵ
(
i
)
)
2
2
δ
2
p(\epsilon^{(i)})=\frac{1}{\delta \sqrt{2\pi}}e^{-\frac{(\epsilon^{(i)})^2}{2\delta^2}}
p
(
ϵ
(
i
)
)
=
δ
2
π
1
e
−
2
δ
2
(
ϵ
(
i
)
)
2
p
(
y
(
i
)
∣
x
(
i
)
;
θ
)
=
1
δ
2
π
e
x
p
(
−
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
δ
2
)
p(y^{(i)}|x^{(i)};\theta)=\frac{1}{\delta \sqrt{2\pi}}exp({-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\delta^2}})
p
(
y
(
i
)
∣
x
(
i
)
;
θ
)
=
δ
2
π
1
e
x
p
(
−
2
δ
2
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
)
L
(
θ
)
=
∏
i
=
0
m
p
(
y
(
i
)
∣
x
(
i
)
;
θ
)
=
∏
i
=
0
m
1
δ
2
π
e
x
p
(
−
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
δ
2
)
L(\theta)=\prod \limits_{i=0}^mp(y^{(i)}|x^{(i)};\theta)=\prod \limits_{i=0}^m\frac{1}{\delta \sqrt{2\pi}}exp({-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\delta^2}})
L
(
θ
)
=
i
=
0
∏
m
p
(
y
(
i
)
∣
x
(
i
)
;
θ
)
=
i
=
0
∏
m
δ
2
π
1
e
x
p
(
−
2
δ
2
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
)
,希望
L
(
θ
)
L(\theta)
L
(
θ
)
越大越好
求对数
l
(
θ
)
=
l
o
g
L
(
θ
)
=
l
o
g
∑
i
=
1
m
1
δ
2
π
e
x
p
(
−
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
δ
2
)
=
m
l
o
g
1
δ
2
π
−
1
δ
2
⋅
1
2
∑
i
=
1
m
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
\begin{aligned} \mathcal{l}(\theta)=&logL(\theta)\\=&log\sum_{i=1}^m\frac{1}{\delta \sqrt{2\pi}}exp({-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\delta^2}})\\=&mlog\frac{1}{\delta \sqrt{2\pi}}-\frac{1}{\delta^2}\cdot\frac{1}{2}\sum_{i=1}^m(y^{(i)}-\theta^Tx^{(i)})^2 \end{aligned}
l
(
θ
)
=
=
=
l
o
g
L
(
θ
)
l
o
g
i
=
1
∑
m
δ
2
π
1
e
x
p
(
−
2
δ
2
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
)
m
l
o
g
δ
2
π
1
−
δ
2
1
⋅
2
1
i
=
1
∑
m
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
l
o
s
s
(
y
j
,
y
j
^
)
=
J
(
θ
)
=
1
2
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
loss(y_j,\hat{y_j})=J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2
l
o
s
s
(
y
j
,
y
j
^
)
=
J
(
θ
)
=
2
1
i
=
1
∑
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
θ
\theta
θ
的求解过程
J
(
θ
)
=
1
2
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
=
1
2
(
X
θ
−
Y
)
T
(
X
θ
−
Y
)
→
m
i
n
θ
J
(
θ
)
\begin{aligned} J(\theta)=&\frac{1}{2}\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2\\=&\frac{1}{2}(X\theta-Y)^T(X\theta-Y) \rightarrow min_{\theta}J(\theta) \end{aligned}
J
(
θ
)
=
=
2
1
i
=
1
∑
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
2
1
(
X
θ
−
Y
)
T
(
X
θ
−
Y
)
→
m
i
n
θ
J
(
θ
)
∇
J
(
θ
)
=
∇
θ
1
2
(
X
θ
−
Y
)
T
(
X
θ
−
Y
)
=
∇
θ
1
2
(
(
θ
T
X
T
−
Y
)
(
X
θ
−
Y
)
)
=
∇
θ
(
1
2
(
θ
T
X
T
X
θ
−
θ
T
X
T
Y
−
Y
T
X
θ
+
Y
T
Y
)
)
=
1
2
(
2
X
T
X
θ
−
X
T
Y
−
(
Y
T
X
)
T
)
=
X
T
X
θ
−
X
T
Y
\begin{aligned} \nabla J(\theta)=&\nabla_{\theta}\frac{1}{2}(X\theta-Y)^T(X\theta-Y) \\=& \nabla_{\theta}\frac{1}{2}((\theta ^TX^T-Y)(X\theta-Y)) \\=&\nabla_{\theta}(\frac{1}{2}(\theta^TX^TX\theta – \theta^TX^TY – Y^TX\theta+Y^TY))\\ =& \frac{1}{2}(2X^TX\theta-X^TY-(Y^TX)^T) \\=& X^TX\theta-X^TY \end{aligned}
∇
J
(
θ
)
=
=
=
=
=
∇
θ
2
1
(
X
θ
−
Y
)
T
(
X
θ
−
Y
)
∇
θ
2
1
(
(
θ
T
X
T
−
Y
)
(
X
θ
−
Y
)
)
∇
θ
(
2
1
(
θ
T
X
T
X
θ
−
θ
T
X
T
Y
−
Y
T
X
θ
+
Y
T
Y
)
)
2
1
(
2
X
T
X
θ
−
X
T
Y
−
(
Y
T
X
)
T
)
X
T
X
θ
−
X
T
Y
θ
=
(
X
T
X
)
−
1
X
T
Y
\theta=(X^TX)^{-1}X^TY
θ
=
(
X
T
X
)
−
1
X
T
Y
最小二乘法的参数最优求解
-
参数解析式
θ=
(
X
T
X
)
−
1
X
T
Y
\theta=(X^TX)^{-1}X^TY
θ
=
(
X
T
X
)
−
1
X
T
Y
-
最小二乘法要求矩阵
XT
X
X^TX
X
T
X
是
可逆
的;为了防止不可逆或者过拟合的问题存在,可以二外增加二额外数据的影响,导致最终的矩阵是可逆的
θ=
(
X
X
+
λ
I
)
−
1
X
T
y
\theta=(X^X+\lambda I)^{-1}X^Ty
θ
=
(
X
X
+
λ
I
)
−
1
X
T
y
-
最小二乘法直接求解的难点:矩阵逆的求
3.目标函数(loss/cost function)
-
0-1损失函数
J(
θ
)
=
{
1
,
Y
≠
f
(
X
)
0
,
Y
=
f
(
X
)
J(\theta)=\left\{ \begin{aligned}1,Y\neq f(X)\\ 0,Y=f(X) \end{aligned} \right.
J
(
θ
)
=
{
1
,
Y
=
f
(
X
)
0
,
Y
=
f
(
X
)
-
感知损失函数
J(
θ
)
=
{
1
,
∣
Y
−
f
(
X
)
∣
>
t
0
,
∣
Y
−
f
(
X
)
∣
≤
t
J(\theta)=\left\{ \begin{aligned}1,|Y-f(X)|>t\\ 0,|Y-f(X)|\leq t \end{aligned} \right.
J
(
θ
)
=
{
1
,
∣
Y
−
f
(
X
)
∣
>
t
0
,
∣
Y
−
f
(
X
)
∣
≤
t
-
平方和损失函数
J(
θ
)
=
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
J(\theta)=\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2
J
(
θ
)
=
i
=
1
∑
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
-
绝对值损失函数
J(
θ
)
=
∑
i
=
1
m
∣
h
θ
(
x
(
i
)
−
y
(
i
)
∣
J(\theta)=\sum_{i=1}^m|h_{\theta}(x^{(i)}-y^{(i)}|
J
(
θ
)
=
i
=
1
∑
m
∣
h
θ
(
x
(
i
)
−
y
(
i
)
∣
-
对数损失函数
J(
θ
)
=
∑
i
=
1
m
(
y
(
i
)
h
θ
(
x
(
i
)
)
)
J(\theta)=\sum_{i=1}^m(y^{(i)}h_{\theta}(x^{(i)}))
J
(
θ
)
=
i
=
1
∑
m
(
y
(
i
)
h
θ
(
x
(
i
)
)
)
4.线性回归的过拟合
-
目标函数
J(
θ
)
=
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
J(\theta)=\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2
J
(
θ
)
=
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
-
为了防止数据过拟合,也就是
θ\theta
θ
值在样本空间中不能过大/国过小,可以在目标函数之上增加一个平方和损失:
J(
θ
)
=
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
+
λ
∑
i
=
1
n
θ
j
2
J(\theta)=\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2+\lambda\sum_{i=1}^n\theta_j^2
J
(
θ
)
=
i
=
1
∑
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
+
λ
i
=
1
∑
n
θ
j
2
-
正则项(norm):
λ∑
i
=
1
n
θ
j
2
\lambda\sum_{i=1}^n\theta_j^2
λ
∑
i
=
1
n
θ
j
2
-
L2-norm:
J(
θ
)
=
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
+
λ
∑
j
=
1
n
θ
j
2
λ
>
0
J(\theta)=\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2+\lambda\sum_{j=1}^n\theta_j^2\quad\lambda>0
J
(
θ
)
=
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
+
λ
∑
j
=
1
n
θ
j
2
λ
>
0
-
L1-norm:
J(
θ
)
=
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
+
λ
∑
j
=
1
n
∣
θ
j
∣
λ
>
0
J(\theta)=\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2+\lambda\sum_{j=1}^n|\theta_j|\quad\lambda>0
J
(
θ
)
=
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
+
λ
∑
j
=
1
n
∣
θ
j
∣
λ
>
0
-
Ridge(L2-norm)和LASSO(L1-norm)比较
-
L2-norm中,由于对于各个维度的参数缩放是在一个圆内缩放的,不可能导致有维度参数变为0的情况,那么也就不会产生稀疏解;实际应用中,数据的维度中是存在
噪音
和
冗余
的,系数的解可以找到有用的维度并且减少冗余,提高回归预测的
准确性
和
鲁棒性
(减少了overfitting)(L1-norm可以达到最终解的
稀疏性
的要求) - Ridge模型有较高的准确性、鲁棒性以及稳定性;LASSO模型具有较高的求解速度,主要用于特征选择。
- 如果稳定性和求解速度都考虑,就使用Elasitc Net
Elasitc Net
-
同时使用L1正则和L2正则的线性回归模型就成为Elasitc Net算法(弹性网络算法)
-
J(
θ
)
=
1
2
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
+
λ
(
p
∑
j
=
1
n
∣
θ
j
∣
+
(
1
−
p
)
s
u
m
j
=
1
n
θ
j
2
)
J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2+\lambda(p\sum_{j=1}^n|\theta_j|+(1-p)sum_{j=1}^n\theta_j^2)
J
(
θ
)
=
2
1
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
+
λ
(
p
∑
j
=
1
n
∣
θ
j
∣
+
(
1
−
p
)
s
u
m
j
=
1
n
θ
j
2
)
{λ
>
0
p
∈
[
0
,
1
]
\left\{\begin{aligned}&\lambda > 0\\&p\in [0,1]\end{aligned}\right.
{
λ
>
0
p
∈
[
0
,
1
]
5.模型效果判断
- MSE:误差平方和,越趋近于0表示模型越拟合训练数据。
- RMSE:MSE的平方根,作用同MSE
-
R2
R^2
R
2
:取值范围(负无穷,1],值越大表示模型越拟合训练数据;最优解是1;当模型预测为随机值的时候,有可能为负;若预测值恒为样本期望,
R2
R^2
R
2
为0 -
TSS:总平方和TSS(Total Sum of Squares),表示
样本之间
的差异情况,是伪方差的m倍 - RSS:残差平方和RSS(Residual Sum of Squares),表示预测值和样本值之间的差异情况,是MSE的m倍
M
S
E
=
1
m
∑
i
=
1
m
(
y
i
−
y
i
^
)
2
R
M
S
E
=
M
S
E
=
1
m
∑
i
=
1
m
(
y
i
−
y
i
^
)
2
R
2
=
1
−
R
S
S
T
S
S
=
1
−
∑
i
=
1
m
(
y
i
−
y
i
^
)
2
∑
i
=
1
m
(
y
i
−
y
i
ˉ
)
2
\begin{aligned} &MSE=\frac{1}{m}\sum_{i=1}^m(y_i-\hat{y_i})^2\\ &RMSE=\sqrt{MSE}=\sqrt{\frac{1}{m}\sum_{i=1}^m(y_i-\hat{y_i})^2}\\ &R^2=1-\frac{RSS}{TSS}=1-\frac{\sum_{i=1}^m(y_i-\hat{y_i})^2}{\sum_{i=1}^m(y_i-\bar{y_i})^2} \end{aligned}
M
S
E
=
m
1
i
=
1
∑
m
(
y
i
−
y
i
^
)
2
R
M
S
E
=
M
S
E
=
m
1
i
=
1
∑
m
(
y
i
−
y
i
^
)
2
R
2
=
1
−
T
S
S
R
S
S
=
1
−
∑
i
=
1
m
(
y
i
−
y
i
ˉ
)
2
∑
i
=
1
m
(
y
i
−
y
i
^
)
2
6.梯度下降算法
-
目标函数
θ\theta
θ
求解
J(
θ
)
=
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
J(\theta)=\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2
J
(
θ
)
=
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
-
初始化
θ\theta
θ
(随机初始化,可以初始为0) -
沿着负梯度方向迭代,更新后的
θ\theta
θ
使得
J(
θ
)
J(\theta)
J
(
θ
)
更小
θ
=
θ
−
α
⋅
∂
J
(
θ
)
∂
θ
\theta=\theta-\alpha\cdot \frac{\partial J(\theta)}{\partial \theta}
θ
=
θ
−
α
⋅
∂
θ
∂
J
(
θ
)
α
\alpha
α
:学习率、步长
梯度方向
∂
∂
θ
j
J
(
θ
)
=
∂
∂
θ
j
1
2
(
h
θ
(
x
)
−
y
)
2
=
2
⋅
1
2
(
h
θ
(
x
)
−
y
)
⋅
∂
∂
θ
j
(
h
θ
(
x
)
−
y
)
=
(
h
θ
(
x
)
−
y
)
∂
∂
θ
j
(
∑
i
=
1
n
θ
i
x
i
−
y
)
=
(
h
θ
(
x
)
−
y
)
x
j
\begin{aligned} \frac{\partial}{\partial \theta_j}J(\theta)=&\frac{\partial}{\partial \theta_j}\frac{1}{2}(h_{\theta}(x)-y)^2\\ =&2\cdot \frac{1}{2}(h_{\theta}(x)-y)\cdot \frac{\partial}{\partial \theta_j}(h_{\theta}(x)-y)=(h_{\theta}(x)-y)\frac{\partial}{\partial \theta_j}(\sum_{i=1}^n\theta_ix_i-y)\\ =&(h_{\theta}(x)-y)x_j \end{aligned}
∂
θ
j
∂
J
(
θ
)
=
=
=
∂
θ
j
∂
2
1
(
h
θ
(
x
)
−
y
)
2
2
⋅
2
1
(
h
θ
(
x
)
−
y
)
⋅
∂
θ
j
∂
(
h
θ
(
x
)
−
y
)
=
(
h
θ
(
x
)
−
y
)
∂
θ
j
∂
(
i
=
1
∑
n
θ
i
x
i
−
y
)
(
h
θ
(
x
)
−
y
)
x
j
J
(
θ
)
=
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
J(\theta)=\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2
J
(
θ
)
=
i
=
1
∑
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
批量梯度下降算法(BGD)
∂
∂
θ
j
J
(
θ
)
=
(
h
θ
(
x
)
−
y
)
x
i
\frac{\partial}{\partial \theta_j}J(\theta)=(h_{\theta}(x)-y)x_i
∂
θ
j
∂
J
(
θ
)
=
(
h
θ
(
x
)
−
y
)
x
i
∂
J
(
θ
)
∂
θ
j
=
∑
i
=
1
m
∂
∂
θ
j
=
∑
i
=
1
m
(
x
j
(
i
)
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
)
=
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
)
x
j
(
i
)
\frac{\partial J(\theta)}{\partial \theta_j}=\sum_{i=1}^m\frac{\partial }{\partial \theta_j}=\sum_{i=1}^m(x_j^{(i)}(h_{\theta}(x^{(i)})-y^{(i)}))=\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)}))x_j^{(i)}
∂
θ
j
∂
J
(
θ
)
=
i
=
1
∑
m
∂
θ
j
∂
=
i
=
1
∑
m
(
x
j
(
i
)
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
)
=
i
=
1
∑
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
)
x
j
(
i
)
θ
j
=
θ
j
+
α
∑
i
=
1
m
(
y
(
i
)
−
h
θ
(
x
(
i
)
)
)
)
x
j
(
i
)
\theta_j=\theta_j+\alpha\sum_{i=1}^m(y^{(i)}-h_{\theta}(x^{(i)})))x_j^{(i)}
θ
j
=
θ
j
+
α
i
=
1
∑
m
(
y
(
i
)
−
h
θ
(
x
(
i
)
)
)
)
x
j
(
i
)
随机梯度下降算法(SGD)
∂
∂
θ
j
J
(
θ
)
=
(
h
θ
(
x
)
−
y
)
x
i
\frac{\partial}{\partial \theta_j}J(\theta)=(h_{\theta}(x)-y)x_i
∂
θ
j
∂
J
(
θ
)
=
(
h
θ
(
x
)
−
y
)
x
i
for i=1 to m,{
θ
j
=
θ
j
+
α
∑
i
=
1
m
(
y
(
i
)
−
h
θ
(
x
(
i
)
)
)
)
x
j
(
i
)
\theta_j=\theta_j+\alpha\sum_{i=1}^m(y^{(i)}-h_{\theta}(x^{(i)})))x_j^{(i)}
θ
j
=
θ
j
+
α
i
=
1
∑
m
(
y
(
i
)
−
h
θ
(
x
(
i
)
)
)
)
x
j
(
i
)
}
BGD和SGD算法比较
- SGD速度比BGD快(迭代次数少)
-
SGD在某些情况下(全局存在多个相对最优解/
J(
θ
)
J(\theta)
J
(
θ
)
不是一个二次),又肯跳出某些小的局部最优解,所以不会比BGD坏 - BGD一定能够得到一个局部最优解(在线回归模型中一定是得到一个全局最优解),SGD由于随机性的存在可能导致最终结果比BGD的差
-
优先选择SGD
梯度下降法
-
由于梯度下降法中负梯度方向作为变量的变化方向,所以有可能导致最终求解的值是局部最优解,所以在使用梯度下降的时候,一般需要进行一些调优策略:
- **学习率的选择:**学习率过大,表示每次迭代更新的时候变化比较大,可能会跳过最优解;学习率过小,就会导致迭代速度过慢,很长时间都不能结束。
- **算法初始值的选择:**初始值不同,最终获得的最小值也有可能不同,因为梯度下降法求解的是局部最优解,所以一般情况下,选择多次不同初始值运行算法,并最终返回损失函数最小情况下的结果值;
- **标准化:**由于样本不同特征的取值范围不同,可能会导致在各个不同参数上迭代速度不同,为了减少特征取值的影响,可以将特征进行标准化操作。
线性回归总结
- 算法模型:线性回归(Linear)、岭回归(Ridge)、LASSO回归、Elastic Net
- 正则化:L1-norm、L2-norm
-
损失函数/目标函数:
J(
θ
)
=
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
→
min
θ
J
(
θ
)
J(\theta)=\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2 \rightarrow \min_{\theta}J(\theta)
J
(
θ
)
=
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
→
min
θ
J
(
θ
)
-
θ\theta
θ
求解方式:最小二乘法(直接计算,目标函数是平方和损失函数)、梯度下降(BGD\SGD\MBGD)
补充知识
局部加权回归-损失函数
-
普通线性回归损失函数:
J(
θ
)
=
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
J(\theta)=\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2
J
(
θ
)
=
i
=
1
∑
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
-
局部加权回归损失函数:
J(
θ
)
=
∑
i
=
1
m
w
(
i
)
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
J(\theta)=\sum_{i=1}^mw^{(i)}(h_{\theta}(x^{(i)}-y^{(i)})^2
J
(
θ
)
=
i
=
1
∑
m
w
(
i
)
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
局部加权回归-权重值设置
-
w(
i
)
w^{(i)}
w
(
i
)
是权重,它根据要预测的点与数据集中的点的距离来为数据集中的点赋权值。当某点离要预测的点越远,其权重越小,否则越大。常用公式选择为:
w(
i
)
=
e
x
p
(
−
(
x
(
i
)
)
−
x
ˉ
)
2
2
k
2
)
w^{(i)}=exp(-\frac{(x^{(i)})-\bar{x})^2}{2k^2})
w
(
i
)
=
e
x
p
(
−
2
k
2
(
x
(
i
)
)
−
x
ˉ
)
2
)
-
该函数称为指数衰减函数,其中k为波长参数,它控制了权值随距离下降的速率
-
使用该方式主要应用到样本之间的相似性考虑,主要内容在SVM中再考虑(核函数)
Logistic回归
-
Logistic/sigmoid函数
p=
h
θ
(
x
)
=
g
(
θ
T
x
)
=
1
1
+
e
−
θ
T
x
p=h_{\theta}(x)=g(\theta^Tx)=\frac{1}{1+e^{-\theta^Tx}}
p
=
h
θ
(
x
)
=
g
(
θ
T
x
)
=
1
+
e
−
θ
T
x
1
或者写成
g(
z
)
=
1
1
+
e
−
z
g(z)=\frac{1}{1+e^{-z}}
g
(
z
)
=
1
+
e
−
z
1
KaTeX parse error: Expected ‘EOF’, got ‘&’ at position 2: &̲y=\left\{\begin…
g′
(
z
)
=
(
1
1
+
e
−
z
)
′
=
e
−
z
(
1
+
e
−
z
)
2
=
1
1
+
e
−
z
⋅
e
−
z
1
+
e
−
z
=
1
1
+
e
−
z
⋅
(
1
−
1
1
+
e
−
z
)
g'(z)=(\frac{1}{1+e^{-z}})’=\frac{e^{-z}}{(1+e^{-z})^2}=\frac{1}{1+e^{-z}}\cdot\frac{e^{-z}}{1+e^{-z}}=\frac{1}{1+e^{-z}}\cdot (1-\frac{1}{1+e^{-z}})
g
′
(
z
)
=
(
1
+
e
−
z
1
)
′
=
(
1
+
e
−
z
)
2
e
−
z
=
1
+
e
−
z
1
⋅
1
+
e
−
z
e
−
z
=
1
+
e
−
z
1
⋅
(
1
−
1
+
e
−
z
1
)
Logistic回归及似然函数
-
假设:
P(
y
=
1
∣
x
;
θ
)
=
h
θ
(
x
)
P(y=1|x;\theta)=h_{\theta}(x)
P
(
y
=
1
∣
x
;
θ
)
=
h
θ
(
x
)
P(
y
=
0
∣
x
;
θ
)
=
1
−
h
θ
(
x
)
P(y=0|x;\theta)=1-h_{\theta}(x)
P
(
y
=
0
∣
x
;
θ
)
=
1
−
h
θ
(
x
)
P(
y
∣
x
;
θ
)
=
(
h
θ
(
x
)
)
y
(
1
−
h
θ
(
x
)
)
(
1
−
y
)
P(y|x;\theta)=(h_{\theta}(x))^y(1-h_{\theta}(x))^{(1-y)}
P
(
y
∣
x
;
θ
)
=
(
h
θ
(
x
)
)
y
(
1
−
h
θ
(
x
)
)
(
1
−
y
)
y=
1
y=1
y
=
1
y=
0
y=0
y
=
0
p(
y
∣
x
)
p(y|x)
p
(
y
∣
x
)
θ\theta
θ
1−
θ
1-\theta
1
−
θ
-
似然函数:
L(
θ
)
=
p
(
y
⃗
∣
X
;
θ
)
=
∏
i
=
1
m
p
(
y
(
i
)
∣
X
(
i
)
;
θ
)
=
∏
i
=
1
m
(
h
θ
(
x
(
i
)
)
)
y
(
i
)
(
1
−
(
h
θ
(
x
(
i
)
)
)
1
−
y
(
i
)
\begin{aligned}L(\theta)=&p(\vec{y}|X;\theta)=\prod\limits_{i=1}^mp(y^{(i)}|X^{(i)};\theta)\\=&\prod\limits_{i=1}^m(h_{\theta}(x^{(i)}))^{y^{(i)}}(1-(h_{\theta}(x^{(i)}))^{1-y^{(i)}}\end{aligned}
L
(
θ
)
=
=
p
(
y
∣
X
;
θ
)
=
i
=
1
∏
m
p
(
y
(
i
)
∣
X
(
i
)
;
θ
)
i
=
1
∏
m
(
h
θ
(
x
(
i
)
)
)
y
(
i
)
(
1
−
(
h
θ
(
x
(
i
)
)
)
1
−
y
(
i
)
-
对数似然函数:
l(
θ
)
=
l
o
g
L
(
θ
)
=
∑
i
=
1
m
(
y
(
i
)
l
o
g
h
θ
(
x
(
i
)
)
+
(
1
−
y
(
i
)
)
l
o
g
(
1
−
h
θ
(
x
(
i
)
)
)
)
\mathcal{l}(\theta)=logL(\theta)=\sum_{i=1}^m(y^{(i)}logh_{\theta}(x^{(i)})+(1-y^{(i)})log(1-h_{\theta}(x^{(i)})))
l
(
θ
)
=
l
o
g
L
(
θ
)
=
∑
i
=
1
m
(
y
(
i
)
l
o
g
h
θ
(
x
(
i
)
)
+
(
1
−
y
(
i
)
)
l
o
g
(
1
−
h
θ
(
x
(
i
)
)
)
)
最大似然/极大似然函数的随机梯度
∂
l
(
θ
)
∂
θ
j
=
∑
i
=
1
m
(
y
(
i
)
h
θ
(
x
(
i
)
)
−
1
−
y
(
i
)
1
−
h
θ
(
x
(
i
)
)
)
⋅
∂
h
θ
(
x
(
i
)
)
∂
θ
j
=
∑
i
=
1
m
(
y
(
i
)
g
(
θ
T
(
x
(
i
)
)
−
1
−
y
(
i
)
1
−
g
(
θ
T
(
x
(
i
)
)
)
⋅
∂
(
g
(
θ
T
x
(
i
)
)
∂
θ
j
=
∑
i
=
1
m
(
y
(
i
)
g
(
θ
T
(
x
(
i
)
)
−
1
−
y
(
i
)
1
−
g
(
θ
T
(
x
(
i
)
)
)
⋅
g
(
θ
T
x
(
i
)
)
(
1
−
g
(
θ
T
x
(
i
)
)
)
⋅
∂
θ
T
x
(
i
)
∂
θ
j
=
∑
i
=
1
m
(
y
(
i
)
(
1
−
g
(
θ
T
x
(
i
)
)
+
(
1
−
y
(
i
)
)
g
(
θ
T
(
x
(
i
)
)
)
⋅
x
j
(
i
)
=
∑
i
=
1
m
(
y
(
i
)
−
g
(
θ
T
(
x
(
i
)
)
)
⋅
x
j
(
i
)
\begin{aligned} \frac{\partial \mathcal{l}(\theta)}{\partial \theta_j}=\sum_{i=1}^m(\frac{y^{(i)}}{h_{\theta}(x^{(i)})}-\frac{1-y^{(i)}}{1-h_{\theta}(x^{(i)})})\cdot \frac{
{\partial h_{\theta}(x^{(i)})}}{\partial \theta_{j}}\\ =\sum_{i=1}^m(\frac{y^{(i)}}{g(\theta^T(x^{(i)})}-\frac{1-y^{(i)}}{1-g(\theta^T(x^{(i)})})\cdot \frac{
{\partial (g(\theta^Tx^{(i)})}}{\partial \theta_{j}}\\ =\sum_{i=1}^m(\frac{y^{(i)}}{g(\theta^T(x^{(i)})}-\frac{1-y^{(i)}}{1-g(\theta^T(x^{(i)})})\cdot g(\theta^Tx^{(i)})(1-g(\theta^Tx^{(i)}))\cdot \frac{
{\partial \theta^Tx^{(i)}}}{\partial \theta_{j}}\\ =\sum_{i=1}^m(y^{(i)}(1-g(\theta^Tx^{(i)})+(1-y^{(i)})g(\theta^T(x^{(i)}))\cdot x_j^{(i)}=\sum_{i=1}^m(y^{(i)}-g(\theta^T(x^{(i)}))\cdot x_j^{(i)} \end{aligned}
∂
θ
j
∂
l
(
θ
)
=
i
=
1
∑
m
(
h
θ
(
x
(
i
)
)
y
(
i
)
−
1
−
h
θ
(
x
(
i
)
)
1
−
y
(
i
)
)
⋅
∂
θ
j
∂
h
θ
(
x
(
i
)
)
=
i
=
1
∑
m
(
g
(
θ
T
(
x
(
i
)
)
y
(
i
)
−
1
−
g
(
θ
T
(
x
(
i
)
)
1
−
y
(
i
)
)
⋅
∂
θ
j
∂
(
g
(
θ
T
x
(
i
)
)
=
i
=
1
∑
m
(
g
(
θ
T
(
x
(
i
)
)
y
(
i
)
−
1
−
g
(
θ
T
(
x
(
i
)
)
1
−
y
(
i
)
)
⋅
g
(
θ
T
x
(
i
)
)
(
1
−
g
(
θ
T
x
(
i
)
)
)
⋅
∂
θ
j
∂
θ
T
x
(
i
)
=
i
=
1
∑
m
(
y
(
i
)
(
1
−
g
(
θ
T
x
(
i
)
)
+
(
1
−
y
(
i
)
)
g
(
θ
T
(
x
(
i
)
)
)
⋅
x
j
(
i
)
=
i
=
1
∑
m
(
y
(
i
)
−
g
(
θ
T
(
x
(
i
)
)
)
⋅
x
j
(
i
)
极大似然估计与Logistic回归损失函数
l
(
θ
)
∏
i
=
1
m
p
(
y
(
i
)
∣
x
(
i
)
;
θ
)
=
∏
i
=
1
m
p
y
(
i
)
(
1
−
p
i
)
(
1
−
y
(
i
)
)
\mathcal{l}(\theta)\prod \limits_{i=1}^mp(y^{(i)}|x^{(i)};{\theta})=\prod \limits_{i=1}^mp^{y^{(i)}}(1-p_i)^{(1-y^{(i)})}
l
(
θ
)
i
=
1
∏
m
p
(
y
(
i
)
∣
x
(
i
)
;
θ
)
=
i
=
1
∏
m
p
y
(
i
)
(
1
−
p
i
)
(
1
−
y
(
i
)
)
p
i
=
h
θ
(
x
(
i
)
)
=
1
1
+
e
−
θ
T
x
(
i
)
p_{i}=h_{\theta}(x^{(i)})=\frac{1}{1+e^{-\theta^Tx^{(i)}}}
p
i
=
h
θ
(
x
(
i
)
)
=
1
+
e
−
θ
T
x
(
i
)
1
l
(
θ
)
=
l
o
g
L
(
θ
)
=
∑
i
=
1
m
l
n
[
p
y
(
i
)
(
1
−
p
i
)
(
1
−
y
(
i
)
)
]
\mathcal{l}(\theta)=logL(\theta)=\sum_{i=1}^mln[p^{y^{(i)}}(1-p_i)^{(1-y^{(i)})}]
l
(
θ
)
=
l
o
g
L
(
θ
)
=
∑
i
=
1
m
l
n
[
p
y
(
i
)
(
1
−
p
i
)
(
1
−
y
(
i
)
)
]
l
o
s
s
=
l
(
θ
)
=
−
∑
i
=
1
m
[
y
(
i
)
l
n
(
p
i
)
+
(
1
−
y
(
i
)
)
l
n
(
1
−
p
i
)
]
=
∑
i
=
1
m
[
−
y
(
i
)
l
n
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
l
n
(
1
−
h
θ
(
x
(
i
)
)
)
]
\begin{aligned}loss=&\mathcal{l}(\theta)\\=&-\sum_{i=1}^m[y^{(i)}ln(p_i)+(1-y^{(i)})ln(1-p_i)]\\=&\sum_{i=1}^m[-y^{(i)}ln(h_{\theta}(x^{(i)}))+(1-y^{(i)})ln(1-h_{\theta}(x^{(i)}))]\end{aligned}
l
o
s
s
=
=
=
l
(
θ
)
−
i
=
1
∑
m
[
y
(
i
)
l
n
(
p
i
)
+
(
1
−
y
(
i
)
)
l
n
(
1
−
p
i
)
]
i
=
1
∑
m
[
−
y
(
i
)
l
n
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
l
n
(
1
−
h
θ
(
x
(
i
)
)
)
]
Logistic回归实例
Softmax回归
-
softmax回归是logistic回归的一般化,适用于K分类的问题,第k类的参数为向量
θk
\theta_k
θ
k
,组成的二维矩阵为
θk
∗
n
\theta_{k*n}
θ
k
∗
n
; - softmax函数的本质就是将一个K维的任意实数向量压缩(映射成另一个k维的实数向量,其中向量中的每个元素取值都介于(0,1)之间。
-
softmax回归概率函数为:
-
p(
y
=
k
∣
x
;
θ
)
=
e
θ
k
T
x
∑
l
=
1
K
e
θ
l
T
x
,
k
=
1
,
2
,
.
.
.
,
K
p(y=k|x;\theta)=\frac{e^{\theta^T_kx}}{\sum_{l=1}^Ke^{\theta^T_lx}},k=1,2,…,K
p
(
y
=
k
∣
x
;
θ
)
=
∑
l
=
1
K
e
θ
l
T
x
e
θ
k
T
x
,
k
=
1
,
2
,
.
.
.
,
K
-
总结
- 线性模型一般用于回归问题,Logistic和Softmax模型一般用于分类问题
-
求
θ\theta
θ
的中主要方式是梯度下降算法,梯度下降算法是优化参数的重要手段,主要是SGD,适用于在线学习以及跳出局部极小值。 - Logistic/Softmax回归是实践中解决分类问题的最重要的方法
- 广义线性模型对样本要求不必服从正态分布\只要服从指数分布簇(二分布、伯努利分布、分布等)即可;广义线性模型的自变量可以是连续的也可以是离散的。
回归算法实例代码