机器学习(一)回归算法

  • Post author:
  • Post category:其他




1. 什么是回归算法

  • 回归算法是一种

    有监督算法
  • 回归算法是一种比较常用的机器学习算法,用来建立”解释”变量(自变量X)和观测值(因变量Y)之间的关系;从机器学习角度来讲,用于构建一个算法模型(函数)来做

    属性(X)



    标签(Y)

    之间的

    映射关系

    ,在算法学习过程中试图寻找一个函数

    $h:R^d->R$

    使得参数之间的关系拟合性最好。
  • 回归算法中算法(函数)的最终结果是一个

    连续

    的数据值,

    输入值(属性值)

    是一个d维度的

    属性/数值向量



2.线性回归、最大似然估计及二乘法



线性回归




y

(

i

)

=

θ

T

x

(

i

)

+

ϵ

(

i

)

y^{(i)}=\theta^Tx^{(i)}+\epsilon^{(i)}







y











(


i


)












=









θ










T










x











(


i


)












+









ϵ











(


i


)












  • 误差



    ϵ

    (

    i

    )

    (

    1

    i

    n

    )

    \epsilon^{(i)}(1\le i \le n)







    ϵ











    (


    i


    )










    (


    1













    i













    n


    )





    是独立同分布的,服从均值为0,方差为某定值



    δ

    2

    \delta^2







    δ










    2














    高斯分布

    • 原因:

      中心极限定理
  • 实际问题中,很多随机现象可以看作

    众多因素

    的独立影响的综合反应,往往服从正态分布



似然函数




y

(

i

)

=

θ

T

x

(

i

)

+

ϵ

(

i

)

y^{(i)}=\theta^Tx^{(i)}+\epsilon^{(i)}







y











(


i


)












=









θ










T










x











(


i


)












+









ϵ











(


i


)















p

(

ϵ

(

i

)

)

=

1

δ

2

π

e

(

ϵ

(

i

)

)

2

2

δ

2

p(\epsilon^{(i)})=\frac{1}{\delta \sqrt{2\pi}}e^{-\frac{(\epsilon^{(i)})^2}{2\delta^2}}






p


(



ϵ











(


i


)










)




=




















δ










2


π






































1






















e


























2



δ









2






















(



ϵ










(


i


)










)









2








































p

(

y

(

i

)

x

(

i

)

;

θ

)

=

1

δ

2

π

e

x

p

(

(

y

(

i

)

θ

T

x

(

i

)

)

2

2

δ

2

)

p(y^{(i)}|x^{(i)};\theta)=\frac{1}{\delta \sqrt{2\pi}}exp({-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\delta^2}})






p


(



y











(


i


)














x











(


i


)










;




θ


)




=




















δ










2


π






































1





















e


x


p


(


















2



δ










2























(



y











(


i


)














θ










T










x











(


i


)











)










2





























)







L

(

θ

)

=

i

=

0

m

p

(

y

(

i

)

x

(

i

)

;

θ

)

=

i

=

0

m

1

δ

2

π

e

x

p

(

(

y

(

i

)

θ

T

x

(

i

)

)

2

2

δ

2

)

L(\theta)=\prod \limits_{i=0}^mp(y^{(i)}|x^{(i)};\theta)=\prod \limits_{i=0}^m\frac{1}{\delta \sqrt{2\pi}}exp({-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\delta^2}})






L


(


θ


)




=

















i


=


0


















m



















p


(



y











(


i


)














x











(


i


)










;




θ


)




=

















i


=


0


















m































δ










2


π






































1





















e


x


p


(


















2



δ










2























(



y











(


i


)














θ










T










x











(


i


)











)










2





























)





,希望



L

(

θ

)

L(\theta)






L


(


θ


)





越大越好

求对数





l

(

θ

)

=

l

o

g

L

(

θ

)

=

l

o

g

i

=

1

m

1

δ

2

π

e

x

p

(

(

y

(

i

)

θ

T

x

(

i

)

)

2

2

δ

2

)

=

m

l

o

g

1

δ

2

π

1

δ

2

1

2

i

=

1

m

(

y

(

i

)

θ

T

x

(

i

)

)

2

\begin{aligned} \mathcal{l}(\theta)=&logL(\theta)\\=&log\sum_{i=1}^m\frac{1}{\delta \sqrt{2\pi}}exp({-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\delta^2}})\\=&mlog\frac{1}{\delta \sqrt{2\pi}}-\frac{1}{\delta^2}\cdot\frac{1}{2}\sum_{i=1}^m(y^{(i)}-\theta^Tx^{(i)})^2 \end{aligned}

















l



(


θ


)




=








=








=



























l


o


g


L


(


θ


)










l


o


g













i


=


1


















m






























δ










2


π




































1




















e


x


p


(

















2



δ










2





















(



y











(


i


)


















θ










T










x











(


i


)











)










2




























)










m


l


o


g













δ










2


π




































1







































δ










2





















1






































2














1































i


=


1


















m

















(



y











(


i


)


















θ










T










x











(


i


)











)










2

































l

o

s

s

(

y

j

,

y

j

^

)

=

J

(

θ

)

=

1

2

i

=

1

m

(

h

θ

(

x

(

i

)

y

(

i

)

)

2

loss(y_j,\hat{y_j})=J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2






l


o


s


s


(



y










j


















,













y










j























^
















)




=








J


(


θ


)




=



















2














1































i


=


1


















m

















(



h











θ



















(



x











(


i


)






















y











(


i


)











)










2

















θ

\theta






θ





的求解过程





J

(

θ

)

=

1

2

i

=

1

m

(

h

θ

(

x

(

i

)

y

(

i

)

)

2

=

1

2

(

X

θ

Y

)

T

(

X

θ

Y

)

m

i

n

θ

J

(

θ

)

\begin{aligned} J(\theta)=&\frac{1}{2}\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2\\=&\frac{1}{2}(X\theta-Y)^T(X\theta-Y) \rightarrow min_{\theta}J(\theta) \end{aligned}
















J


(


θ


)




=








=






































2














1































i


=


1


















m

















(



h











θ



















(



x











(


i


)


















y











(


i


)











)










2




























2














1




















(


X


θ









Y



)










T









(


X


θ









Y


)









m


i



n











θ



















J


(


θ


)


























J

(

θ

)

=

θ

1

2

(

X

θ

Y

)

T

(

X

θ

Y

)

=

θ

1

2

(

(

θ

T

X

T

Y

)

(

X

θ

Y

)

)

=

θ

(

1

2

(

θ

T

X

T

X

θ

θ

T

X

T

Y

Y

T

X

θ

+

Y

T

Y

)

)

=

1

2

(

2

X

T

X

θ

X

T

Y

(

Y

T

X

)

T

)

=

X

T

X

θ

X

T

Y

\begin{aligned} \nabla J(\theta)=&\nabla_{\theta}\frac{1}{2}(X\theta-Y)^T(X\theta-Y) \\=& \nabla_{\theta}\frac{1}{2}((\theta ^TX^T-Y)(X\theta-Y)) \\=&\nabla_{\theta}(\frac{1}{2}(\theta^TX^TX\theta – \theta^TX^TY – Y^TX\theta+Y^TY))\\ =& \frac{1}{2}(2X^TX\theta-X^TY-(Y^TX)^T) \\=& X^TX\theta-X^TY \end{aligned}



















J


(


θ


)




=








=








=








=








=








































θ






























2














1




















(


X


θ









Y



)










T









(


X


θ









Y


)























θ






























2














1




















(


(



θ










T










X










T
















Y


)


(


X


θ









Y


)


)























θ



















(













2














1




















(



θ










T










X










T









X


θ










θ










T










X










T









Y










Y










T









X


θ




+





Y










T









Y


)


)





















2














1




















(


2



X










T









X


θ










X










T









Y









(



Y










T









X



)










T









)











X










T









X


θ










X










T









Y

























θ

=

(

X

T

X

)

1

X

T

Y

\theta=(X^TX)^{-1}X^TY






θ




=








(



X










T









X



)














1











X










T









Y






最小二乘法的参数最优求解

  • 参数解析式




    θ

    =

    (

    X

    T

    X

    )

    1

    X

    T

    Y

    \theta=(X^TX)^{-1}X^TY






    θ




    =








    (



    X










    T









    X



    )














    1











    X










    T









    Y




  • 最小二乘法要求矩阵



    X

    T

    X

    X^TX







    X










    T









    X







    可逆

    的;为了防止不可逆或者过拟合的问题存在,可以二外增加二额外数据的影响,导致最终的矩阵是可逆的





    θ

    =

    (

    X

    X

    +

    λ

    I

    )

    1

    X

    T

    y

    \theta=(X^X+\lambda I)^{-1}X^Ty






    θ




    =








    (



    X










    X











    +








    λ


    I



    )














    1











    X










    T









    y





  • 最小二乘法直接求解的难点:矩阵逆的求



3.目标函数(loss/cost function)

  • 0-1损失函数





    J

    (

    θ

    )

    =

    {

    1

    ,

    Y

    f

    (

    X

    )

    0

    ,

    Y

    =

    f

    (

    X

    )

    J(\theta)=\left\{ \begin{aligned}1,Y\neq f(X)\\ 0,Y=f(X) \end{aligned} \right.






    J


    (


    θ


    )




    =










    {














    1


    ,




    Y







































    =





    f


    (


    X


    )








    0


    ,




    Y




    =




    f


    (


    X


    )

























  • 感知损失函数





    J

    (

    θ

    )

    =

    {

    1

    ,

    Y

    f

    (

    X

    )

    >

    t

    0

    ,

    Y

    f

    (

    X

    )

    t

    J(\theta)=\left\{ \begin{aligned}1,|Y-f(X)|>t\\ 0,|Y-f(X)|\leq t \end{aligned} \right.






    J


    (


    θ


    )




    =










    {














    1


    ,







    Y









    f


    (


    X


    )







    >




    t








    0


    ,







    Y









    f


    (


    X


    )












    t

























  • 平方和损失函数





    J

    (

    θ

    )

    =

    i

    =

    1

    m

    (

    h

    θ

    (

    x

    (

    i

    )

    y

    (

    i

    )

    )

    2

    J(\theta)=\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2






    J


    (


    θ


    )




    =

















    i


    =


    1


















    m

















    (



    h











    θ



















    (



    x











    (


    i


    )






















    y











    (


    i


    )











    )










    2












  • 绝对值损失函数





    J

    (

    θ

    )

    =

    i

    =

    1

    m

    h

    θ

    (

    x

    (

    i

    )

    y

    (

    i

    )

    J(\theta)=\sum_{i=1}^m|h_{\theta}(x^{(i)}-y^{(i)}|






    J


    (


    θ


    )




    =

















    i


    =


    1


















    m























    h











    θ



















    (



    x











    (


    i


    )






















    y











    (


    i


    )
















  • 对数损失函数





    J

    (

    θ

    )

    =

    i

    =

    1

    m

    (

    y

    (

    i

    )

    h

    θ

    (

    x

    (

    i

    )

    )

    )

    J(\theta)=\sum_{i=1}^m(y^{(i)}h_{\theta}(x^{(i)}))






    J


    (


    θ


    )




    =

















    i


    =


    1


















    m

















    (



    y











    (


    i


    )











    h











    θ



















    (



    x











    (


    i


    )










    )


    )







4.线性回归的过拟合

  • 目标函数




    J

    (

    θ

    )

    =

    i

    =

    1

    m

    (

    h

    θ

    (

    x

    (

    i

    )

    y

    (

    i

    )

    )

    2

    J(\theta)=\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2






    J


    (


    θ


    )




    =





















    i


    =


    1









    m


















    (



    h











    θ



















    (



    x











    (


    i


    )






















    y











    (


    i


    )











    )










    2











  • 为了防止数据过拟合,也就是



    θ

    \theta






    θ





    值在样本空间中不能过大/国过小,可以在目标函数之上增加一个平方和损失:





    J

    (

    θ

    )

    =

    i

    =

    1

    m

    (

    h

    θ

    (

    x

    (

    i

    )

    y

    (

    i

    )

    )

    2

    +

    λ

    i

    =

    1

    n

    θ

    j

    2

    J(\theta)=\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2+\lambda\sum_{i=1}^n\theta_j^2






    J


    (


    θ


    )




    =

















    i


    =


    1


















    m

















    (



    h











    θ



















    (



    x











    (


    i


    )






















    y











    (


    i


    )











    )










    2











    +








    λ













    i


    =


    1


















    n




















    θ










    j








    2





















  • 正则项(norm):



    λ

    i

    =

    1

    n

    θ

    j

    2

    \lambda\sum_{i=1}^n\theta_j^2






    λ

















    i


    =


    1









    n





















    θ










    j








    2




















    • L2-norm:




      J

      (

      θ

      )

      =

      i

      =

      1

      m

      (

      h

      θ

      (

      x

      (

      i

      )

      y

      (

      i

      )

      )

      2

      +

      λ

      j

      =

      1

      n

      θ

      j

      2

      λ

      >

      0

      J(\theta)=\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2+\lambda\sum_{j=1}^n\theta_j^2\quad\lambda>0






      J


      (


      θ


      )




      =





















      i


      =


      1









      m


















      (



      h











      θ



















      (



      x











      (


      i


      )






















      y











      (


      i


      )











      )










      2











      +








      λ

















      j


      =


      1









      n





















      θ










      j








      2




















      λ




      >








      0




    • L1-norm:




      J

      (

      θ

      )

      =

      i

      =

      1

      m

      (

      h

      θ

      (

      x

      (

      i

      )

      y

      (

      i

      )

      )

      2

      +

      λ

      j

      =

      1

      n

      θ

      j

      λ

      >

      0

      J(\theta)=\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2+\lambda\sum_{j=1}^n|\theta_j|\quad\lambda>0






      J


      (


      θ


      )




      =





















      i


      =


      1









      m


















      (



      h











      θ



















      (



      x











      (


      i


      )






















      y











      (


      i


      )











      )










      2











      +








      λ

















      j


      =


      1









      n
























      θ










      j























      λ




      >








      0






Ridge(L2-norm)和LASSO(L1-norm)比较

  • L2-norm中,由于对于各个维度的参数缩放是在一个圆内缩放的,不可能导致有维度参数变为0的情况,那么也就不会产生稀疏解;实际应用中,数据的维度中是存在

    噪音



    冗余

    的,系数的解可以找到有用的维度并且减少冗余,提高回归预测的

    准确性



    鲁棒性

    (减少了overfitting)(L1-norm可以达到最终解的

    稀疏性

    的要求)
  • Ridge模型有较高的准确性、鲁棒性以及稳定性;LASSO模型具有较高的求解速度,主要用于特征选择。
  • 如果稳定性和求解速度都考虑,就使用Elasitc Net



Elasitc Net

  • 同时使用L1正则和L2正则的线性回归模型就成为Elasitc Net算法(弹性网络算法)




  • J

    (

    θ

    )

    =

    1

    2

    i

    =

    1

    m

    (

    h

    θ

    (

    x

    (

    i

    )

    y

    (

    i

    )

    )

    2

    +

    λ

    (

    p

    j

    =

    1

    n

    θ

    j

    +

    (

    1

    p

    )

    s

    u

    m

    j

    =

    1

    n

    θ

    j

    2

    )

    J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2+\lambda(p\sum_{j=1}^n|\theta_j|+(1-p)sum_{j=1}^n\theta_j^2)






    J


    (


    θ


    )




    =




















    2
















    1




































    i


    =


    1









    m


















    (



    h











    θ



















    (



    x











    (


    i


    )






















    y











    (


    i


    )











    )










    2











    +








    λ


    (


    p

















    j


    =


    1









    n
























    θ










    j























    +








    (


    1













    p


    )


    s


    u



    m











    j


    =


    1









    n



















    θ










    j








    2


















    )







    {

    λ

    >

    0

    p

    [

    0

    ,

    1

    ]

    \left\{\begin{aligned}&\lambda > 0\\&p\in [0,1]\end{aligned}\right.








    {













































    λ




    >




    0










    p









    [


    0


    ,




    1


    ]


























5.模型效果判断

  • MSE:误差平方和,越趋近于0表示模型越拟合训练数据。
  • RMSE:MSE的平方根,作用同MSE



  • R

    2

    R^2







    R










    2












    :取值范围(负无穷,1],值越大表示模型越拟合训练数据;最优解是1;当模型预测为随机值的时候,有可能为负;若预测值恒为样本期望,



    R

    2

    R^2







    R










    2












    为0

  • TSS:总平方和TSS(Total Sum of Squares),表示

    样本之间

    的差异情况,是伪方差的m倍
  • RSS:残差平方和RSS(Residual Sum of Squares),表示预测值和样本值之间的差异情况,是MSE的m倍





M

S

E

=

1

m

i

=

1

m

(

y

i

y

i

^

)

2

R

M

S

E

=

M

S

E

=

1

m

i

=

1

m

(

y

i

y

i

^

)

2

R

2

=

1

R

S

S

T

S

S

=

1

i

=

1

m

(

y

i

y

i

^

)

2

i

=

1

m

(

y

i

y

i

ˉ

)

2

\begin{aligned} &MSE=\frac{1}{m}\sum_{i=1}^m(y_i-\hat{y_i})^2\\ &RMSE=\sqrt{MSE}=\sqrt{\frac{1}{m}\sum_{i=1}^m(y_i-\hat{y_i})^2}\\ &R^2=1-\frac{RSS}{TSS}=1-\frac{\sum_{i=1}^m(y_i-\hat{y_i})^2}{\sum_{i=1}^m(y_i-\bar{y_i})^2} \end{aligned}





















































M


S


E




=















m














1































i


=


1


















m

















(



y










i


































y










i























^

















)










2

















R


M


S


E




=












M


S


E


























=























m














1































i


=


1


















m

















(



y










i


































y










i























^

















)










2








































R










2











=




1




















T


S


S














R


S


S






















=




1

































i


=


1









m


















(



y










i


































y










i























ˉ

















)










2


































i


=


1









m


















(



y










i


































y










i























^

















)










2

















































6.梯度下降算法

  • 目标函数



    θ

    \theta






    θ





    求解



    J

    (

    θ

    )

    =

    i

    =

    1

    m

    (

    h

    θ

    (

    x

    (

    i

    )

    y

    (

    i

    )

    )

    2

    J(\theta)=\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2






    J


    (


    θ


    )




    =





















    i


    =


    1









    m


















    (



    h











    θ



















    (



    x











    (


    i


    )






















    y











    (


    i


    )











    )










    2











  • 初始化



    θ

    \theta






    θ





    (随机初始化,可以初始为0)

  • 沿着负梯度方向迭代,更新后的



    θ

    \theta






    θ





    使得



    J

    (

    θ

    )

    J(\theta)






    J


    (


    θ


    )





    更小





θ

=

θ

α

J

(

θ

)

θ

\theta=\theta-\alpha\cdot \frac{\partial J(\theta)}{\partial \theta}






θ




=








θ













α



























θ

















J


(


θ


)



























α

\alpha






α





:学习率、步长



梯度方向





θ

j

J

(

θ

)

=

θ

j

1

2

(

h

θ

(

x

)

y

)

2

=

2

1

2

(

h

θ

(

x

)

y

)

θ

j

(

h

θ

(

x

)

y

)

=

(

h

θ

(

x

)

y

)

θ

j

(

i

=

1

n

θ

i

x

i

y

)

=

(

h

θ

(

x

)

y

)

x

j

\begin{aligned} \frac{\partial}{\partial \theta_j}J(\theta)=&\frac{\partial}{\partial \theta_j}\frac{1}{2}(h_{\theta}(x)-y)^2\\ =&2\cdot \frac{1}{2}(h_{\theta}(x)-y)\cdot \frac{\partial}{\partial \theta_j}(h_{\theta}(x)-y)=(h_{\theta}(x)-y)\frac{\partial}{\partial \theta_j}(\sum_{i=1}^n\theta_ix_i-y)\\ =&(h_{\theta}(x)-y)x_j \end{aligned}































θ










j



















































J


(


θ


)




=








=








=










































θ










j






























































2














1




















(



h











θ



















(


x


)









y



)










2

















2




















2














1




















(



h











θ



















(


x


)









y


)
























θ










j



















































(



h











θ



















(


x


)









y


)




=




(



h











θ



















(


x


)









y


)

















θ










j



















































(











i


=


1


















n




















θ










i



















x










i

























y


)










(



h











θ



















(


x


)









y


)



x










j










































J

(

θ

)

=

i

=

1

m

(

h

θ

(

x

(

i

)

y

(

i

)

)

2

J(\theta)=\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2






J


(


θ


)




=

















i


=


1


















m

















(



h











θ



















(



x











(


i


)






















y











(


i


)











)










2














批量梯度下降算法(BGD)





θ

j

J

(

θ

)

=

(

h

θ

(

x

)

y

)

x

i

\frac{\partial}{\partial \theta_j}J(\theta)=(h_{\theta}(x)-y)x_i





















θ










j



















































J


(


θ


)




=








(



h











θ



















(


x


)













y


)



x










i

























J

(

θ

)

θ

j

=

i

=

1

m

θ

j

=

i

=

1

m

(

x

j

(

i

)

(

h

θ

(

x

(

i

)

)

y

(

i

)

)

)

=

i

=

1

m

(

h

θ

(

x

(

i

)

)

y

(

i

)

)

)

x

j

(

i

)

\frac{\partial J(\theta)}{\partial \theta_j}=\sum_{i=1}^m\frac{\partial }{\partial \theta_j}=\sum_{i=1}^m(x_j^{(i)}(h_{\theta}(x^{(i)})-y^{(i)}))=\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)}))x_j^{(i)}





















θ










j

































J


(


θ


)






















=

















i


=


1


















m


































θ










j





















































=

















i


=


1


















m

















(



x










j









(


i


)



















(



h











θ



















(



x











(


i


)










)














y











(


i


)










)


)




=

















i


=


1


















m

















(



h











θ



















(



x











(


i


)










)














y











(


i


)










)


)



x










j









(


i


)


























θ

j

=

θ

j

+

α

i

=

1

m

(

y

(

i

)

h

θ

(

x

(

i

)

)

)

)

x

j

(

i

)

\theta_j=\theta_j+\alpha\sum_{i=1}^m(y^{(i)}-h_{\theta}(x^{(i)})))x_j^{(i)}







θ










j




















=









θ










j




















+








α













i


=


1


















m

















(



y











(


i


)






















h











θ



















(



x











(


i


)










)


)


)



x










j









(


i


)
























随机梯度下降算法(SGD)





θ

j

J

(

θ

)

=

(

h

θ

(

x

)

y

)

x

i

\frac{\partial}{\partial \theta_j}J(\theta)=(h_{\theta}(x)-y)x_i





















θ










j



















































J


(


θ


)




=








(



h











θ



















(


x


)













y


)



x










i





















for i=1 to m,{





θ

j

=

θ

j

+

α

i

=

1

m

(

y

(

i

)

h

θ

(

x

(

i

)

)

)

)

x

j

(

i

)

\theta_j=\theta_j+\alpha\sum_{i=1}^m(y^{(i)}-h_{\theta}(x^{(i)})))x_j^{(i)}







θ










j




















=









θ










j




















+








α













i


=


1


















m

















(



y











(


i


)






















h











θ



















(



x











(


i


)










)


)


)



x










j









(


i


)






















}



BGD和SGD算法比较

  • SGD速度比BGD快(迭代次数少)
  • SGD在某些情况下(全局存在多个相对最优解/



    J

    (

    θ

    )

    J(\theta)






    J


    (


    θ


    )





    不是一个二次),又肯跳出某些小的局部最优解,所以不会比BGD坏

  • BGD一定能够得到一个局部最优解(在线回归模型中一定是得到一个全局最优解),SGD由于随机性的存在可能导致最终结果比BGD的差

  • 优先选择SGD



梯度下降法

  • 由于梯度下降法中负梯度方向作为变量的变化方向,所以有可能导致最终求解的值是局部最优解,所以在使用梯度下降的时候,一般需要进行一些调优策略:

    • **学习率的选择:**学习率过大,表示每次迭代更新的时候变化比较大,可能会跳过最优解;学习率过小,就会导致迭代速度过慢,很长时间都不能结束。
    • **算法初始值的选择:**初始值不同,最终获得的最小值也有可能不同,因为梯度下降法求解的是局部最优解,所以一般情况下,选择多次不同初始值运行算法,并最终返回损失函数最小情况下的结果值;
    • **标准化:**由于样本不同特征的取值范围不同,可能会导致在各个不同参数上迭代速度不同,为了减少特征取值的影响,可以将特征进行标准化操作。



线性回归总结

  • 算法模型:线性回归(Linear)、岭回归(Ridge)、LASSO回归、Elastic Net
  • 正则化:L1-norm、L2-norm
  • 损失函数/目标函数:



    J

    (

    θ

    )

    =

    i

    =

    1

    m

    (

    h

    θ

    (

    x

    (

    i

    )

    y

    (

    i

    )

    )

    2

    min

    θ

    J

    (

    θ

    )

    J(\theta)=\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2 \rightarrow \min_{\theta}J(\theta)






    J


    (


    θ


    )




    =





















    i


    =


    1









    m


















    (



    h











    θ



















    (



    x











    (


    i


    )






















    y











    (


    i


    )











    )










    2





















    min











    θ





















    J


    (


    θ


    )







  • θ

    \theta






    θ





    求解方式:最小二乘法(直接计算,目标函数是平方和损失函数)、梯度下降(BGD\SGD\MBGD)



补充知识



局部加权回归-损失函数

  • 普通线性回归损失函数:




    J

    (

    θ

    )

    =

    i

    =

    1

    m

    (

    h

    θ

    (

    x

    (

    i

    )

    y

    (

    i

    )

    )

    2

    J(\theta)=\sum_{i=1}^m(h_{\theta}(x^{(i)}-y^{(i)})^2






    J


    (


    θ


    )




    =

















    i


    =


    1


















    m

















    (



    h











    θ



















    (



    x











    (


    i


    )






















    y











    (


    i


    )











    )










    2












  • 局部加权回归损失函数:




    J

    (

    θ

    )

    =

    i

    =

    1

    m

    w

    (

    i

    )

    (

    h

    θ

    (

    x

    (

    i

    )

    y

    (

    i

    )

    )

    2

    J(\theta)=\sum_{i=1}^mw^{(i)}(h_{\theta}(x^{(i)}-y^{(i)})^2






    J


    (


    θ


    )




    =

















    i


    =


    1


















    m




















    w











    (


    i


    )










    (



    h











    θ



















    (



    x











    (


    i


    )






















    y











    (


    i


    )











    )










    2














局部加权回归-权重值设置




  • w

    (

    i

    )

    w^{(i)}







    w











    (


    i


    )













    是权重,它根据要预测的点与数据集中的点的距离来为数据集中的点赋权值。

    当某点离要预测的点越远,其权重越小,否则越大。常用公式选择为:




    w

    (

    i

    )

    =

    e

    x

    p

    (

    (

    x

    (

    i

    )

    )

    x

    ˉ

    )

    2

    2

    k

    2

    )

    w^{(i)}=exp(-\frac{(x^{(i)})-\bar{x})^2}{2k^2})







    w











    (


    i


    )












    =








    e


    x


    p


    (

















    2



    k










    2























    (



    x











    (


    i


    )










    )













    x








    ˉ









    )










    2




























    )




  • 该函数称为指数衰减函数,其中k为波长参数,它控制了权值随距离下降的速率

  • 使用该方式主要应用到样本之间的相似性考虑,主要内容在SVM中再考虑(核函数)



Logistic回归

  • Logistic/sigmoid函数



    p

    =

    h

    θ

    (

    x

    )

    =

    g

    (

    θ

    T

    x

    )

    =

    1

    1

    +

    e

    θ

    T

    x

    p=h_{\theta}(x)=g(\theta^Tx)=\frac{1}{1+e^{-\theta^Tx}}






    p




    =









    h











    θ



















    (


    x


    )




    =








    g


    (



    θ










    T









    x


    )




    =




















    1


    +



    e















    θ









    T








    x
























    1























    或者写成



    g

    (

    z

    )

    =

    1

    1

    +

    e

    z

    g(z)=\frac{1}{1+e^{-z}}






    g


    (


    z


    )




    =




















    1


    +



    e














    z
























    1


























    KaTeX parse error: Expected ‘EOF’, got ‘&’ at position 2: &̲y=\left\{\begin…





    g

    (

    z

    )

    =

    (

    1

    1

    +

    e

    z

    )

    =

    e

    z

    (

    1

    +

    e

    z

    )

    2

    =

    1

    1

    +

    e

    z

    e

    z

    1

    +

    e

    z

    =

    1

    1

    +

    e

    z

    (

    1

    1

    1

    +

    e

    z

    )

    g'(z)=(\frac{1}{1+e^{-z}})’=\frac{e^{-z}}{(1+e^{-z})^2}=\frac{1}{1+e^{-z}}\cdot\frac{e^{-z}}{1+e^{-z}}=\frac{1}{1+e^{-z}}\cdot (1-\frac{1}{1+e^{-z}})







    g






















    (


    z


    )




    =








    (














    1


    +



    e














    z
























    1






















    )
























    =




















    (


    1


    +



    e














    z











    )










    2
























    e














    z































    =




















    1


    +



    e














    z
























    1












































    1


    +



    e














    z

























    e














    z































    =




















    1


    +



    e














    z
























    1
































    (


    1

























    1


    +



    e














    z
























    1





















    )






Logistic回归及似然函数

  • 假设:



    P

    (

    y

    =

    1

    x

    ;

    θ

    )

    =

    h

    θ

    (

    x

    )

    P(y=1|x;\theta)=h_{\theta}(x)






    P


    (


    y




    =








    1





    x


    ;




    θ


    )




    =









    h











    θ



















    (


    x


    )








    P

    (

    y

    =

    0

    x

    ;

    θ

    )

    =

    1

    h

    θ

    (

    x

    )

    P(y=0|x;\theta)=1-h_{\theta}(x)






    P


    (


    y




    =








    0





    x


    ;




    θ


    )




    =








    1














    h











    θ



















    (


    x


    )








    P

    (

    y

    x

    ;

    θ

    )

    =

    (

    h

    θ

    (

    x

    )

    )

    y

    (

    1

    h

    θ

    (

    x

    )

    )

    (

    1

    y

    )

    P(y|x;\theta)=(h_{\theta}(x))^y(1-h_{\theta}(x))^{(1-y)}






    P


    (


    y





    x


    ;




    θ


    )




    =








    (



    h











    θ



















    (


    x


    )



    )










    y









    (


    1














    h











    θ



















    (


    x


    )



    )











    (


    1





    y


    )















    y

    =

    1

    y=1






    y




    =








    1







    y

    =

    0

    y=0






    y




    =








    0







    p

    (

    y

    x

    )

    p(y|x)






    p


    (


    y





    x


    )







    θ

    \theta






    θ







    1

    θ

    1-\theta






    1













    θ




  • 似然函数:




    L

    (

    θ

    )

    =

    p

    (

    y

    X

    ;

    θ

    )

    =

    i

    =

    1

    m

    p

    (

    y

    (

    i

    )

    X

    (

    i

    )

    ;

    θ

    )

    =

    i

    =

    1

    m

    (

    h

    θ

    (

    x

    (

    i

    )

    )

    )

    y

    (

    i

    )

    (

    1

    (

    h

    θ

    (

    x

    (

    i

    )

    )

    )

    1

    y

    (

    i

    )

    \begin{aligned}L(\theta)=&p(\vec{y}|X;\theta)=\prod\limits_{i=1}^mp(y^{(i)}|X^{(i)};\theta)\\=&\prod\limits_{i=1}^m(h_{\theta}(x^{(i)}))^{y^{(i)}}(1-(h_{\theta}(x^{(i)}))^{1-y^{(i)}}\end{aligned}
















    L


    (


    θ


    )




    =








    =



























    p


    (










    y





























    X


    ;




    θ


    )




    =













    i


    =


    1


















    m



















    p


    (



    y











    (


    i


    )














    X











    (


    i


    )










    ;




    θ


    )





















    i


    =


    1


















    m

















    (



    h











    θ



















    (



    x











    (


    i


    )










    )



    )












    y











    (


    i


    )


















    (


    1









    (



    h











    θ



















    (



    x











    (


    i


    )










    )



    )











    1






    y











    (


    i


    )






































  • 对数似然函数:



    l

    (

    θ

    )

    =

    l

    o

    g

    L

    (

    θ

    )

    =

    i

    =

    1

    m

    (

    y

    (

    i

    )

    l

    o

    g

    h

    θ

    (

    x

    (

    i

    )

    )

    +

    (

    1

    y

    (

    i

    )

    )

    l

    o

    g

    (

    1

    h

    θ

    (

    x

    (

    i

    )

    )

    )

    )

    \mathcal{l}(\theta)=logL(\theta)=\sum_{i=1}^m(y^{(i)}logh_{\theta}(x^{(i)})+(1-y^{(i)})log(1-h_{\theta}(x^{(i)})))







    l



    (


    θ


    )




    =








    l


    o


    g


    L


    (


    θ


    )




    =





















    i


    =


    1









    m


















    (



    y











    (


    i


    )










    l


    o


    g



    h











    θ



















    (



    x











    (


    i


    )










    )




    +








    (


    1














    y











    (


    i


    )










    )


    l


    o


    g


    (


    1














    h











    θ



















    (



    x











    (


    i


    )










    )


    )


    )






最大似然/极大似然函数的随机梯度





l

(

θ

)

θ

j

=

i

=

1

m

(

y

(

i

)

h

θ

(

x

(

i

)

)

1

y

(

i

)

1

h

θ

(

x

(

i

)

)

)

h

θ

(

x

(

i

)

)

θ

j

=

i

=

1

m

(

y

(

i

)

g

(

θ

T

(

x

(

i

)

)

1

y

(

i

)

1

g

(

θ

T

(

x

(

i

)

)

)

(

g

(

θ

T

x

(

i

)

)

θ

j

=

i

=

1

m

(

y

(

i

)

g

(

θ

T

(

x

(

i

)

)

1

y

(

i

)

1

g

(

θ

T

(

x

(

i

)

)

)

g

(

θ

T

x

(

i

)

)

(

1

g

(

θ

T

x

(

i

)

)

)

θ

T

x

(

i

)

θ

j

=

i

=

1

m

(

y

(

i

)

(

1

g

(

θ

T

x

(

i

)

)

+

(

1

y

(

i

)

)

g

(

θ

T

(

x

(

i

)

)

)

x

j

(

i

)

=

i

=

1

m

(

y

(

i

)

g

(

θ

T

(

x

(

i

)

)

)

x

j

(

i

)

\begin{aligned} \frac{\partial \mathcal{l}(\theta)}{\partial \theta_j}=\sum_{i=1}^m(\frac{y^{(i)}}{h_{\theta}(x^{(i)})}-\frac{1-y^{(i)}}{1-h_{\theta}(x^{(i)})})\cdot \frac{

{\partial h_{\theta}(x^{(i)})}}{\partial \theta_{j}}\\ =\sum_{i=1}^m(\frac{y^{(i)}}{g(\theta^T(x^{(i)})}-\frac{1-y^{(i)}}{1-g(\theta^T(x^{(i)})})\cdot \frac{

{\partial (g(\theta^Tx^{(i)})}}{\partial \theta_{j}}\\ =\sum_{i=1}^m(\frac{y^{(i)}}{g(\theta^T(x^{(i)})}-\frac{1-y^{(i)}}{1-g(\theta^T(x^{(i)})})\cdot g(\theta^Tx^{(i)})(1-g(\theta^Tx^{(i)}))\cdot \frac{

{\partial \theta^Tx^{(i)}}}{\partial \theta_{j}}\\ =\sum_{i=1}^m(y^{(i)}(1-g(\theta^Tx^{(i)})+(1-y^{(i)})g(\theta^T(x^{(i)}))\cdot x_j^{(i)}=\sum_{i=1}^m(y^{(i)}-g(\theta^T(x^{(i)}))\cdot x_j^{(i)} \end{aligned}































θ










j


































l



(


θ


)






















=













i


=


1


















m

















(














h











θ



















(



x











(


i


)










)















y











(


i


)














































1










h











θ



















(



x











(


i


)










)














1










y











(


i


)




























)
























θ











j




































h











θ



















(



x











(


i


)










)



























=













i


=


1


















m

















(













g


(



θ










T









(



x











(


i


)










)















y











(


i


)














































1









g


(



θ










T









(



x











(


i


)










)














1










y











(


i


)




























)
























θ











j



































(


g


(



θ










T










x











(


i


)










)



























=













i


=


1


















m

















(













g


(



θ










T









(



x











(


i


)










)















y











(


i


)














































1









g


(



θ










T









(



x











(


i


)










)














1










y











(


i


)




























)









g


(



θ










T










x











(


i


)










)


(


1









g


(



θ










T










x











(


i


)










)


)
























θ











j




































θ










T










x











(


i


)



































=













i


=


1


















m

















(



y











(


i


)










(


1









g


(



θ










T










x











(


i


)










)




+




(


1










y











(


i


)










)


g


(



θ










T









(



x











(


i


)










)


)










x










j









(


i


)





















=













i


=


1


















m

















(



y











(


i


)

















g


(



θ










T









(



x











(


i


)










)


)










x










j









(


i


)









































极大似然估计与Logistic回归损失函数




l

(

θ

)

i

=

1

m

p

(

y

(

i

)

x

(

i

)

;

θ

)

=

i

=

1

m

p

y

(

i

)

(

1

p

i

)

(

1

y

(

i

)

)

\mathcal{l}(\theta)\prod \limits_{i=1}^mp(y^{(i)}|x^{(i)};{\theta})=\prod \limits_{i=1}^mp^{y^{(i)}}(1-p_i)^{(1-y^{(i)})}







l



(


θ


)













i


=


1


















m



















p


(



y











(


i


)














x











(


i


)










;





θ



)




=

















i


=


1


















m




















p












y











(


i


)


















(


1














p










i



















)











(


1






y











(


i


)










)















p

i

=

h

θ

(

x

(

i

)

)

=

1

1

+

e

θ

T

x

(

i

)

p_{i}=h_{\theta}(x^{(i)})=\frac{1}{1+e^{-\theta^Tx^{(i)}}}







p











i





















=









h











θ



















(



x











(


i


)










)




=




















1


+



e















θ









T









x










(


i


)































1


























l

(

θ

)

=

l

o

g

L

(

θ

)

=

i

=

1

m

l

n

[

p

y

(

i

)

(

1

p

i

)

(

1

y

(

i

)

)

]

\mathcal{l}(\theta)=logL(\theta)=\sum_{i=1}^mln[p^{y^{(i)}}(1-p_i)^{(1-y^{(i)})}]







l



(


θ


)




=








l


o


g


L


(


θ


)




=





















i


=


1









m




















l


n


[



p












y











(


i


)


















(


1














p










i



















)











(


1






y











(


i


)










)










]








l

o

s

s

=

l

(

θ

)

=

i

=

1

m

[

y

(

i

)

l

n

(

p

i

)

+

(

1

y

(

i

)

)

l

n

(

1

p

i

)

]

=

i

=

1

m

[

y

(

i

)

l

n

(

h

θ

(

x

(

i

)

)

)

+

(

1

y

(

i

)

)

l

n

(

1

h

θ

(

x

(

i

)

)

)

]

\begin{aligned}loss=&\mathcal{l}(\theta)\\=&-\sum_{i=1}^m[y^{(i)}ln(p_i)+(1-y^{(i)})ln(1-p_i)]\\=&\sum_{i=1}^m[-y^{(i)}ln(h_{\theta}(x^{(i)}))+(1-y^{(i)})ln(1-h_{\theta}(x^{(i)}))]\end{aligned}
















l


o


s


s




=








=








=




























l



(


θ


)


























i


=


1


















m

















[



y











(


i


)










l


n


(



p










i


















)




+




(


1










y











(


i


)










)


l


n


(


1










p










i


















)


]





















i


=


1


















m

















[






y











(


i


)










l


n


(



h











θ



















(



x











(


i


)










)


)




+




(


1










y











(


i


)










)


l


n


(


1










h











θ



















(



x











(


i


)










)


)


]

























Logistic回归实例



Softmax回归

  • softmax回归是logistic回归的一般化,适用于K分类的问题,第k类的参数为向量



    θ

    k

    \theta_k







    θ










    k





















    ,组成的二维矩阵为



    θ

    k

    n

    \theta_{k*n}







    θ











    k





    n






















    ;

  • softmax函数的本质就是将一个K维的任意实数向量压缩(映射成另一个k维的实数向量,其中向量中的每个元素取值都介于(0,1)之间。
  • softmax回归概率函数为:




    • p

      (

      y

      =

      k

      x

      ;

      θ

      )

      =

      e

      θ

      k

      T

      x

      l

      =

      1

      K

      e

      θ

      l

      T

      x

      ,

      k

      =

      1

      ,

      2

      ,

      .

      .

      .

      ,

      K

      p(y=k|x;\theta)=\frac{e^{\theta^T_kx}}{\sum_{l=1}^Ke^{\theta^T_lx}},k=1,2,…,K






      p


      (


      y




      =








      k





      x


      ;




      θ


      )




      =

































      l


      =


      1









      K





















      e












      θ









      l






      T

















      x

























      e












      θ









      k






      T

















      x





























      ,




      k




      =








      1


      ,




      2


      ,




      .


      .


      .


      ,




      K






总结

  • 线性模型一般用于回归问题,Logistic和Softmax模型一般用于分类问题




  • θ

    \theta






    θ





    的中主要方式是梯度下降算法,梯度下降算法是优化参数的重要手段,主要是SGD,适用于在线学习以及跳出局部极小值。

  • Logistic/Softmax回归是实践中解决分类问题的最重要的方法
  • 广义线性模型对样本要求不必服从正态分布\只要服从指数分布簇(二分布、伯努利分布、分布等)即可;广义线性模型的自变量可以是连续的也可以是离散的。




回归算法实例代码



版权声明:本文为VIAww原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。