第8章 集成学习

  • Post author:
  • Post category:其他




第8章 集成学习



8.1 个体与集成


集成学习(ensemble learning)

通过构建并结合多个学习器来完成学习任务,有时候也被称为

多分类器系统(multi-classifier system)

个体学习器通常由一个现有的学习算法从训练数据产生。

集成中只包含同种类型的个体学习器,这样的集成是同质。同质集成中的个体学习器亦称为

基学习器(base learner)

,相应的学习算法亦称为

基学习算法(base learning algorithm)

集成中包含不同类型的个体学习器,这样的集成是异质。异质集成中的个体学习器由不同的学习算法生成,这时个体学习器常称为组件学习器


分类

个体学习器间存在强依赖关系、必须串行生成的序列化方法

个体学习器间不存在依赖关系、同时生成的并行化方法



8.2 Boosting

Boosting是一族可将弱学习器提升为强学习器的算法。


工作机制

先从初始训练集训练出一个基学习器,再根据基学习器的表现对训练样本分布进行调整,使得先前基学习器做错的训练样本在后续受到更多关注,然后基于调整后的样本分布来训练下一个基学习器;如此重复进行,直至基学习器数目达到事先指定的值T,最终将这T个基学习器进行加权结合。

AdaBoost——加性模型(additive model)

基学习器的线性组合





H

(

x

)

=

t

=

1

T

α

t

h

t

(

x

)

H\left( x \right) = \sum_{t = 1}^{T}{\alpha_{t}h_{t}\left( x \right)}






H





(


x


)





=

















t


=


1



















T






















α











t




















h











t






















(


x


)







最小化指数损失函数(exponential loss function)





l

exp

(

H

D

)

=

E

x

D

[

e

f

(

x

)

H

(

x

)

]

\mathcal{l}_{\exp}\left( H \middle| \mathcal{D} \right) = \mathbb{E}_{x\sim\mathcal{D}}\left\lbrack e^{- f\left( x \right)H\left( x \right)} \right\rbrack








l












exp






















(


H






D



)





=










E












x






D
























[




e














f




(



x



)




H




(



x



)













]









在这里插入图片描述





H

(

x

)

H\left( x \right)






H





(


x


)






能令指数损失函数最小化,则





l

exp

(

H

D

)

H

(

x

)

=

e

H

(

x

)

P

(

f

(

x

)

=

1

x

)

+

e

H

(

x

)

P

(

f

(

x

)

=

1

x

)

\frac{\partial\mathcal{l}_{\exp}\left( H \middle| \mathcal{D} \right)}{\partial H\left( x \right)} = – e^{- H\left( x \right)}P\left( f\left( x \right) = 1 \middle| x \right) + e^{H\left( x \right)}P\left( f\left( x \right) = – 1 \middle| x \right)




















H





(


x


)




















l












exp






















(


H






D



)























=












e














H




(



x



)












P





(


f





(


x


)





=




1





x


)





+









e











H




(



x



)












P





(


f





(


x


)





=







1





x


)






令上式为0,可得





H

(

x

)

=

1

2

ln

P

(

f

(

x

)

=

1

x

)

P

(

f

(

x

)

=

1

x

)

H\left( x \right) = \frac{1}{2}\ln\frac{P\left( f\left( x \right) = 1 \middle| x \right)}{P\left( f\left( x \right) = – 1 \middle| x \right)}






H





(


x


)





=



















2














1






















ln















P





(


f





(


x


)





=







1





x


)















P





(


f





(


x


)





=




1





x


)
























因此,有





sign

(

H

(

x

)

)

=

sign

(

1

2

ln

P

(

f

(

x

)

=

1

x

)

P

(

f

(

x

)

=

1

x

)

)

\text{sign}\left( H\left( x \right) \right) = \text{sign}\left( \frac{1}{2}\ln\frac{P\left( f\left( x \right) = 1 \middle| x \right)}{P\left( f\left( x \right) = – 1 \middle| x \right)} \right)







sign






(


H





(


x


)



)





=









sign







(














2














1






















ln















P





(


f





(


x


)





=







1





x


)















P





(


f





(


x


)





=




1





x


)






















)











=

{

1

,

P

(

f

(

x

)

=

1

x

)

>

P

(

f

(

x

)

=

1

x

)

1

,

P

(

f

(

x

)

=

1

x

)

<

P

(

f

(

x

)

=

1

x

)

 

= \left\{ \begin{matrix} 1,P\left( f\left( x \right) = 1 \middle| x \right) > P\left( f\left( x \right) = – 1 \middle| x \right) \\ – 1,P\left( f\left( x \right) = 1 \middle| x \right) < P\left( f\left( x \right) = – 1 \middle| x \right) \\ \end{matrix} \right.\






=










{














1


,




P





(


f





(


x


)





=




1





x


)





>




P





(


f





(


x


)





=







1





x


)












1


,




P





(


f





(


x


)





=




1





x


)





<




P





(


f





(


x


)





=







1





x


)
































=

a

r

g

max

P

(

f

(

x

)

=

y

x

)

= arg\operatorname{max}{P\left( f\left( x \right) = y \middle| x \right)}






=








a


r


g





m


a


x






P





(


f





(


x


)





=




y





x


)







在AdaBoost算法中,第一个基学习器



h

1

h_{1}







h











1






















是通过直接将基学习算法用于初始数据分布而得;此后迭代地生成



h

t

h_{t}







h











t


























α

t

\alpha_{t}







α











t






















,当基分类器



h

t

h_{t}







h











t






















基于分布



D

t

\mathcal{D}_{t}








D












t






















产生后,该基分类器的权重



α

t

\alpha_{t}







α











t






















应使得



α

t

h

t

\alpha_{t}h_{t}







α











t




















h











t






















最小化指数损失函数





l

exp

(

α

t

h

t

D

t

)

=

E

x

D

t

[

e

f

(

x

)

α

t

h

t

(

x

)

]

\mathcal{l}_{\exp}\left( \alpha_{t}h_{t} \middle| \mathcal{D}_{t} \right) = \mathbb{E}_{x\sim\mathcal{D}_{t}}\left\lbrack e^{- f\left( x \right)\alpha_{t}h_{t}\left( x \right)} \right\rbrack








l












exp






















(



α











t




















h











t
























D












t



















)





=










E












x







D












t








































[




e














f




(



x



)





α











t




















h











t





















(



x



)













]











=

E

x

D

t

[

e

α

t

I

(

f

(

x

)

=

h

t

(

x

)

)

+

e

α

t

I

(

f

(

x

)

h

t

(

x

)

)

]

= \mathbb{E}_{x\sim\mathcal{D}_{t}}\left\lbrack e^{- \alpha_{t}}\mathbb{I}\left( f\left( x \right) = h_{t}\left( x \right) \right) + e^{\alpha_{t}}\mathbb{I}\left( f\left( x \right) \neq h_{t}\left( x \right) \right) \right\rbrack






=










E












x







D












t








































[




e















α











t




























I






(


f





(


x


)





=





h











t






















(


x


)



)





+





e












α











t




























I






(


f





(


x


)


















̸





















=






h











t






















(


x


)



)




]











=

e

α

t

P

x

D

t

(

f

(

x

)

=

h

t

(

x

)

)

+

e

α

t

P

x

D

t

(

f

(

x

)

h

t

(

x

)

)

= e^{- \alpha_{t}}P_{x\sim\mathcal{D}_{t}}\left( f\left( x \right) = h_{t}\left( x \right) \right) + e^{\alpha_{t}}P_{x\sim\mathcal{D}_{t}}\left( f\left( x \right) \neq h_{t}\left( x \right) \right)






=









e















α











t




























P











x







D












t







































(


f





(


x


)





=





h











t






















(


x


)



)





+









e












α











t




























P











x







D












t







































(


f





(


x


)


















̸





















=






h











t






















(


x


)



)










=

e

α

t

(

1

ϵ

t

)

+

e

α

t

ϵ

t

= e^{- \alpha_{t}}\left( 1 – \epsilon_{t} \right) + e^{\alpha_{t}}\epsilon_{t}






=









e















α











t






























(


1










ϵ











t



















)





+









e












α











t




























ϵ











t






















其中



ϵ

t

=

P

x

D

t

(

f

(

x

)

h

t

(

x

)

)

\epsilon_{t} = P_{x\sim\mathcal{D}_{t}}\left( f\left( x \right) \neq h_{t}\left( x \right) \right)







ϵ











t





















=









P











x







D












t







































(


f





(


x


)


















̸





















=






h











t






















(


x


)



)






。考虑指数损失函数的导数





l

exp

(

α

t

h

t

D

t

)

α

t

=

e

α

t

(

1

ϵ

t

)

+

e

α

t

ϵ

t

\frac{\partial\mathcal{l}_{\exp}\left( \alpha_{t}h_{t} \middle| \mathcal{D}_{t} \right)}{\partial\alpha_{t}} = – e^{- \alpha_{t}}\left( 1 – \epsilon_{t} \right) + e^{\alpha_{t}}\epsilon_{t}





















α











t




































l












exp






















(



α











t




















h











t
























D












t



















)























=












e















α











t






























(


1










ϵ











t



















)





+









e












α











t




























ϵ











t






















令上式等于0,得

分类器权重更新公式





α

t

=

1

2

ln

(

1

ϵ

t

ϵ

t

)

\alpha_{t} = \frac{1}{2}\ln\left( \frac{1 – \epsilon_{t}}{\epsilon_{t}} \right)







α











t





















=



















2














1






















ln






(















ϵ











t































1










ϵ











t






































)







理想的



h

t

h_{t}







h











t






















能纠正



H

t

1

H_{t – 1}







H











t





1






















的全部错误,即





l

exp

(

H

t

1

+

h

t

D

)

=

E

x

D

[

e

f

(

x

)

(

H

t

1

(

x

)

h

t

(

x

)

)

]

\mathcal{l}_{\exp}\left( H_{t – 1} + h_{t} \middle| \mathcal{D} \right) = \mathbb{E}_{x\sim\mathcal{D}}\left\lbrack e^{- f\left( x \right)\left( H_{t – 1}\left( x \right) – h_{t}\left( x \right) \right)} \right\rbrack








l












exp






















(



H











t





1





















+





h











t























D



)





=










E












x






D
























[




e














f




(



x



)






(




H











t





1





















(



x



)








h











t





















(



x



)





)













]











E

x

D

[

e

f

(

x

)

H

t

1

(

x

)

e

f

(

x

)

h

t

(

x

)

]

\mathbb{E}_{x\sim\mathcal{D}}\left\lbrack e^{- f\left( x \right)H_{t – 1}\left( x \right)}e^{- f\left( x \right)h_{t}\left( x \right)} \right\rbrack








E












x






D
























[




e














f




(



x



)





H











t





1





















(



x



)













e














f




(



x



)





h











t





















(



x



)













]







使用泰勒展式近似为





l

exp

(

H

t

1

+

h

t

D

)

E

x

D

[

e

f

(

x

)

H

t

1

(

x

)

(

1

f

(

x

)

h

t

(

x

)

+

f

2

(

x

)

h

t

2

(

x

)

2

)

]

=

E

x

D

[

e

f

(

x

)

H

t

1

(

x

)

(

1

f

(

x

)

h

t

(

x

)

+

1

2

)

]

\mathcal{l}_{\exp}\left( H_{t – 1} + h_{t} \middle| \mathcal{D} \right) \simeq \mathbb{E}_{x\sim\mathcal{D}}\left\lbrack e^{- f\left( x \right)H_{t – 1}\left( x \right)}\left( 1 – f\left( x \right)h_{t}\left( x \right) + \frac{f^{2}\left( x \right)h_{t}^{2}\left( x \right)}{2} \right) \right\rbrack = \mathbb{E}_{x\sim\mathcal{D}}\left\lbrack e^{- f\left( x \right)H_{t – 1}\left( x \right)}\left( 1 – f\left( x \right)h_{t}\left( x \right) + \frac{1}{2} \right) \right\rbrack








l












exp






















(



H











t





1





















+





h











t























D



)
















E












x






D
























[




e














f




(



x



)





H











t





1





















(



x



)
















(



1









f





(


x


)






h











t






















(


x


)





+















2















f











2













(


x


)






h











t










2






















(


x


)






















)





]






=










E












x






D
























[




e














f




(



x



)





H











t





1





















(



x



)
















(



1









f





(


x


)






h











t






















(


x


)





+















2














1





















)





]







理想学习器





h

t

(

x

)

=

arg

l

exp

(

H

t

1

+

h

D

)

=

arg

E

x

D

[

e

f

(

x

)

H

t

1

(

x

)

(

1

f

(

x

)

h

(

x

)

+

1

2

)

]

=

arg

E

x

D

[

e

f

(

x

)

H

t

1

(

x

)

f

(

x

)

h

(

x

)

]

=

arg

E

x

D

[

e

f

(

x

)

H

t

1

(

x

)

E

x

D

[

e

f

(

x

)

H

t

1

(

x

)

]

f

(

x

)

h

(

x

)

]

h_{t}\left( x \right) = \arg\operatorname{}{\mathcal{l}_{\exp}\left( H_{t – 1} + h \middle| \mathcal{D} \right)} = \arg\operatorname{}{\mathbb{E}_{x\sim\mathcal{D}}\left\lbrack e^{- f\left( x \right)H_{t – 1}\left( x \right)}\left( 1 – f\left( x \right)h\left( x \right) + \frac{1}{2} \right) \right\rbrack} = \arg\operatorname{}{\mathbb{E}_{x\sim\mathcal{D}}\left\lbrack e^{- f\left( x \right)H_{t – 1}\left( x \right)}f\left( x \right)h\left( x \right) \right\rbrack = \arg\operatorname{}{\mathbb{E}_{x\sim\mathcal{D}}\left\lbrack \frac{e^{- f\left( x \right)H_{t – 1}\left( x \right)}}{\mathbb{E}_{x\sim\mathcal{D}}\left\lbrack e^{- f\left( x \right)H_{t – 1}\left( x \right)} \right\rbrack}f\left( x \right)h\left( x \right) \right\rbrack}}







h











t






















(


x


)





=








ar

g












l












exp






















(



H











t





1





















+




h






D



)






=








ar

g












E












x






D
























[




e














f




(



x



)





H











t





1





















(



x



)
















(



1









f





(


x


)





h





(


x


)





+















2














1





















)





]







=








ar

g












E












x






D
























[




e














f




(



x



)





H











t





1





















(



x



)












f





(


x


)





h





(


x


)




]






=




ar

g












E












x






D
























[
















E












x






D
























[




e














f




(



x



)





H











t





1





















(



x



)













]

















e














f




(



x



)





H











t





1





















(



x



)






























f





(


x


)





h





(


x


)




]









注意到



E

x

D

[

e

f

(

x

)

H

t

1

(

x

)

]

\mathbb{E}_{x\sim\mathcal{D}}\left\lbrack e^{- f\left( x \right)H_{t – 1}\left( x \right)} \right\rbrack








E












x






D
























[




e














f




(



x



)





H











t





1





















(



x



)













]







是一个常数,令



D

t

\mathcal{D}_{t}








D












t






















表示一个分布





D

t

(

x

)

=

D

(

x

)

e

f

(

x

)

H

t

1

(

x

)

E

x

D

[

e

f

(

x

)

H

t

1

(

x

)

]

\mathcal{D}_{t}\left( x \right) = \frac{\mathcal{D}\left( x \right)e^{- f\left( x \right)H_{t – 1}\left( x \right)}}{\mathbb{E}_{x\sim\mathcal{D}}\left\lbrack e^{- f\left( x \right)H_{t – 1}\left( x \right)} \right\rbrack}








D












t






















(


x


)





=





















E












x






D
























[




e














f




(



x



)





H











t





1





















(



x



)













]

















D






(


x


)






e














f




(



x



)





H











t





1





















(



x



)





































h

t

(

x

)

=

arg

E

x

D

[

e

f

(

x

)

H

t

1

(

x

)

E

x

D

[

e

f

(

x

)

H

t

1

(

x

)

]

f

(

x

)

h

(

x

)

]

=

arg

E

x

D

t

[

f

(

x

)

h

(

x

)

]

h_{t}\left( x \right) = \arg\operatorname{}{\mathbb{E}_{x\sim\mathcal{D}}\left\lbrack \frac{e^{- f\left( x \right)H_{t – 1}\left( x \right)}}{\mathbb{E}_{x\sim\mathcal{D}}\left\lbrack e^{- f\left( x \right)H_{t – 1}\left( x \right)} \right\rbrack}f\left( x \right)h\left( x \right) \right\rbrack} = \arg\operatorname{}{\mathbb{E}_{x\sim\mathcal{D}_{t}}\left\lbrack f\left( x \right)h\left( x \right) \right\rbrack}







h











t






















(


x


)





=








ar

g












E












x






D
























[
















E












x






D
























[




e














f




(



x



)





H











t





1





















(



x



)













]

















e














f




(



x



)





H











t





1





















(



x



)






























f





(


x


)





h





(


x


)




]







=








ar

g












E












x







D












t







































[


f





(


x


)





h





(


x


)



]











f

(

x

)

,

h

(

x

)

{

1

,

1

}

f\left( x \right),h\left( x \right) \in \left\{ – 1,1 \right\}






f





(


x


)





,




h





(


x


)















{






1


,




1


}






,有





f

(

x

)

h

(

x

)

=

1

2

I

(

f

(

x

)

h

(

x

)

)

f\left( x \right)h\left( x \right) = 1 – 2\mathbb{I}\left( f\left( x \right) \neq h\left( x \right) \right)






f





(


x


)





h





(


x


)





=








1













2



I






(


f





(


x


)


















̸





















=





h





(


x


)



)






则理想的基学习器





h

t

(

x

)

=

arg

min

E

x

D

t

[

I

(

f

(

x

)

h

(

x

)

)

]

h_{t}\left( x \right) = \arg\operatorname{min}{\mathbb{E}_{x\sim\mathcal{D}_{t}}\left\lbrack \mathbb{I}\left( f\left( x \right) \neq h\left( x \right) \right) \right\rbrack}







h











t






















(


x


)





=








ar

g






m


i


n








E












x







D












t







































[



I






(


f





(


x


)


















̸





















=





h





(


x


)



)



]







理想的



h

t

h_{t}







h











t






















将在分布



D

t

\mathcal{D}_{t}








D












t






















下最小化分类误差。考虑到



D

t

\mathcal{D}_{t}








D












t


























D

t

+

1

\mathcal{D}_{t+ 1}








D












t


+


1






















的关系,有样本分布更新公式:





D

t

+

1

=

D

(

x

)

e

f

(

x

)

H

t

(

x

)

E

x

D

[

e

f

(

x

)

H

t

(

x

)

]

=

D

(

x

)

e

f

(

x

)

H

t

1

(

x

)

e

f

(

x

)

α

t

h

t

(

x

)

E

x

D

[

e

f

(

x

)

H

t

(

x

)

]

=

D

t

(

x

)

e

f

(

x

)

α

t

h

t

(

x

)

E

x

D

[

e

f

(

x

)

H

t

1

(

x

)

]

E

x

D

[

e

f

(

x

)

H

t

(

x

)

]

\mathcal{D}_{t + 1} = \frac{\mathcal{D}\left( x \right)e^{- f\left( x \right)H_{t}\left( x \right)}}{\mathbb{E}_{x\sim\mathcal{D}}\left\lbrack e^{- f\left( x \right)H_{t}\left( x \right)} \right\rbrack} = \frac{\mathcal{D}\left( x \right)e^{- f\left( x \right)H_{t – 1}\left( x \right)}e^{- f\left( x \right)\alpha_{t}h_{t}\left( x \right)}}{\mathbb{E}_{x\sim\mathcal{D}}\left\lbrack e^{- f\left( x \right)H_{t}\left( x \right)} \right\rbrack} = \mathcal{D}_{t}\left( x \right)*e^{- f\left( x \right)\alpha_{t}h_{t}\left( x \right)}\frac{\mathbb{E}_{x\sim\mathcal{D}}\left\lbrack e^{- f\left( x \right)H_{t – 1}\left( x \right)} \right\rbrack}{\mathbb{E}_{x\sim\mathcal{D}}\left\lbrack e^{- f\left( x \right)H_{t}\left( x \right)} \right\rbrack}








D












t


+


1





















=





















E












x






D
























[




e














f




(



x



)





H











t





















(



x



)













]

















D






(


x


)






e














f




(



x



)





H











t





















(



x



)
































=





















E












x






D
























[




e














f




(



x



)





H











t





















(



x



)













]

















D






(


x


)






e














f




(



x



)





H











t





1





















(



x



)













e














f




(



x



)





α











t




















h











t





















(



x



)
































=










D












t






















(


x


)















e














f




(



x



)





α











t




















h











t





















(



x



)

























E












x






D
























[




e














f




(



x



)





H











t





















(



x



)













]


















E












x






D
























[




e














f




(



x



)





H











t





1





















(



x



)













]



























8.3 Bagging与随机森林



8.3.1 Bagging

Bagging是并行式集成学习方法,基于自助采样法。

给定包含m个样本的数据集,先随机取出一个样本放入采样集中,再把该样本放回初始数据集。使得下次采样时该样本仍有可能被选中,经过m次随机采样操作,得到包含m个样本的采样集。


Bagging流程

采样T个含m个训练样本的采样集,然后基于每个采样集训练出一个基学习器,再将这些基学习器进行结合。

Bagging通常对分类任务使用简单投票法,对回归任务使用简单平均法。

在这里插入图片描述

假定基学习器的计算复杂度为



O

(

m

)

O\left( m \right)






O





(


m


)






,则Bagging的复杂度大致为



T

(

O

(

m

)

+

O

(

s

)

)

T\left( O\left( m \right) + O\left( s \right) \right)






T





(


O





(


m


)





+




O





(


s


)



)









D

t

\mathcal{D}_{t}








D












t






















表示



h

t

(

x

)

h_{t}\left( x \right)







h











t






















(


x


)






实际使用的训练样本集,令



H

oob

(

x

)

H^{\text{oob}}\left( x \right)







H












oob














(


x


)






表示对样本的包外预测,即仅考虑那些未使用x训练的基学习器在x上的预测,有





H

oob

(

x

)

=

arg

max

t

=

1

T

I

(

h

t

(

x

)

=

y

)

I

(

x

D

t

)

H^{\text{oob}}\left( x \right) = \arg\operatorname{max}{\sum_{t = 1}^{T}{\mathbb{I}\left( h_{t}\left( x \right) = y \right)}}\mathbb{*I}\left( x \notin \mathcal{D}_{t} \right)







H












oob














(


x


)





=








ar

g






m


a


x















t


=


1



















T






















I






(



h











t






















(


x


)





=




y


)









I






(


x






















/




























D












t



















)






则Bagging泛化误差的包外估计为





ϵ

oob

=

1

D

(

x

,

y

)

D

I

(

H

oob

(

x

)

y

)

\epsilon^{\text{oob}} = \frac{1}{\left| \mathcal{D} \right|}\sum_{\left( x,y \right)\mathcal{\in D}}^{}{\mathbb{I}\left( H^{\text{oob}}\left( x \right) \neq y \right)}







ϵ












oob













=
























D



















1

































(



x


,


y



)








D








































I







(




H












oob














(


x


)


















̸





















=





y



)










8.3.2 随机森林


随机森林

(Random Forest,RF)是Bagging的一个扩展变体。RF在以决策树为基础学习器构建Bagging基础的基础上,进一步在决策树的训练过程中引入随机属性选择。


RF

对基决策树的每个结点,先从该结点的属性集合中随机选择一个包含k个属性的子集,然后再从这个子集中选择一个最优属性用于划分。一般情况下,参数



k

=

d

k= \operatorname{}d






k




=












d




优点:随机森林简单、容易实现、计算开销小。



8.4 结合策略

学习器结合的好处

1、减小单学习器泛化性能不佳的风险

2、降低陷入糟糕局部极小点的风险

3、扩大假设空间



8.4.1 平均法


简单平均法(simple average)





H

(

x

)

=

1

T

i

=

1

T

h

i

(

x

)

H\left( x \right) = \frac{1}{T}\sum_{i = 1}^{T}{h_{i}\left( x \right)}






H





(


x


)





=



















T














1































i


=


1



















T






















h











i






















(


x


)








加权平均法





H

(

x

)

=

i

=

1

T

ω

i

h

i

(

x

)

H\left( x \right) = \sum_{i = 1}^{T}{\omega_{i}h_{i}\left( x \right)}






H





(


x


)





=

















i


=


1



















T






















ω











i




















h











i






















(


x


)







其中



ω

i

\omega_{i}







ω











i






















是个体学习器



h

i

h_{i}







h











i






















的权重,通常要求



ω

i

0

,

i

=

1

T

ω

i

=

1

\omega_{i} \geq 0,\sum_{i = 1}^{T}\omega_{i} = 1







ω











i






























0


,

















i


=


1










T






















ω











i





















=








1




集成学习研究的基本出发点,对给定的基学习器,不同的集成学习方法可视为通过不同的方式来确定加权平均法中的基学习器权重。



8.4.2 投票法

学习器



h

i

h_{i}







h











i






















将从类别集合



{

c

1

,

c

2

,

,

c

N

}

\left\{ c_{1},c_{2},\ldots,c_{N} \right\}







{




c











1



















,





c











2



















,









,





c











N



















}






中预测出一个标记,将



h

i

h_{i}







h











i






















在样本x上的预测输出表示为一个N维向量



(

h

i

1

(

x

)

;

h

i

2

(

x

)

;

;

h

i

N

(

x

)

)

\left( h_{i}^{1}\left( x \right);h_{i}^{2}\left( x \right);\ldots;h_{i}^{N}\left( x \right) \right)








(




h











i










1






















(


x


)





;





h











i










2






















(


x


)





;









;





h











i










N






















(


x


)




)







,其中



h

i

j

(

x

)

h_{i}^{j}\left( x \right)







h











i










j






















(


x


)










h

i

h_{i}







h











i






















在类别标记



c

j

c_{j}







c











j






















上的输出


绝对多数投票法(majority voting)





H

(

x

)

=

{

c

j

,

i

f

 

i

=

1

T

h

i

j

(

x

)

&gt;

0.5

k

=

1

N

i

=

1

T

h

i

k

(

x

)

reject

,

o

t

h

e

r

w

i

s

e

 

H\left( x \right) = \left\{ \begin{matrix} c_{j},if\ \sum_{i = 1}^{T}h_{i}^{j}\left( x \right) &gt; 0.5\sum_{k = 1}^{N}{\sum_{i = 1}^{T}{h_{i}^{k}\left( x \right)}} \\\text{reject},otherwise \\\end{matrix} \right.\






H





(


x


)





=










{















c











j



















,




i


f



















i


=


1










T






















h











i










j






















(


x


)





>




0


.


5

















k


=


1










N



































i


=


1










T























h











i










k






















(


x


)












reject



,




o


t


h


e


r


w


i


s


e



























即若某标记得票过半数,则预测为该标记,否则拒绝预测。


相对多数投票法(plurality voting)





H

(

x

)

=

c

arg

max

i

=

1

T

h

i

j

(

x

)

H\left( x \right) = c_{\arg\operatorname{max}{\sum_{i = 1}^{T}h_{i}^{j}\left( x \right)}}






H





(


x


)





=









c











ar

g






m


a


x



















i


=


1










T






















h











i










j





















(



x



)

























即预测为得票最多的标记,若同时有多个标记获最高票,则从中随机选取一个。


加权投票法(weighted voting)





H

(

x

)

=

c

arg

max

i

=

1

T

ω

i

h

i

j

(

x

)

H\left( x \right) = c_{\arg\operatorname{max}{\sum_{i = 1}^{T}{\omega_{i}h_{i}^{j}}\left( x \right)}}






H





(


x


)





=









c











ar

g






m


a


x



















i


=


1










T























ω











i




















h











i










j






















(



x



)

























其中



ω

i

\omega_{i}







ω











i






















是个体学习器



h

i

h_{i}







h











i






















的权重,通常要求



ω

i

0

,

i

=

1

T

ω

i

=

1

\omega_{i} \geq 0,\sum_{i = 1}^{T}\omega_{i} = 1







ω











i






























0


,

















i


=


1










T






















ω











i





















=








1





硬投票(hard voting)

:类标记



h

i

j

(

x

)

{

0.1

}

h_{i}^{j}\left( x \right) \in \left\{ 0.1 \right\}







h











i










j






















(


x


)















{



0


.


1


}






,若



h

i

h_{i}







h











i






















将样本x预测为类别



c

j

c_{j}







c











j






















则取值为1,否则为0。


软投票(soft voting)

:类概率



h

i

j

(

x

)

[

0

,

1

]

h_{i}^{j}\left( x \right) \in \left\lbrack 0,1 \right\rbrack







h











i










j






















(


x


)















[


0


,




1


]






,若



h

i

h_{i}







h











i






















将样本x预测为类别



c

j

c_{j}







c











j






















则取值为1,否则为0。



8.4.3 学习法

通过另一个学习器来进行结合。把个体学习器称为初级学习器,用于结合的学习器称为次级学习器或元学习器(meta-learner)

在这里插入图片描述



8.5 多样性

8.5.1 误差-分歧分解

假定用个体学习器



h

1

,

h

2

,

,

h

T

h_{1},h_{2},\ldots,h_{T}







h











1



















,





h











2



















,









,





h











T






















通过加权平均法结合产生的集成来完成回归学习任务



f

:

R

d

R

f:\mathbb{R}^{d}\mathbb{\rightarrow R}






f




:










R












d
















R






。对示例x,定义学习器



h

i

h_{i}







h











i






















的分歧(ambiguity)为





A

(

h

i

x

)

=

(

h

i

(

x

)

H

(

x

)

)

2

A\left( h_{i} \middle| x \right) = \left( h_{i}\left( x \right) – H\left( x \right) \right)^{2}






A





(



h











i






















x


)





=










(



h











i






















(


x


)










H





(


x


)



)












2













则集成的分歧为





A

(

h

x

)

=

i

=

1

T

ω

i

A

(

h

i

x

)

=

i

=

1

T

ω

i

(

h

i

(

x

)

H

(

x

)

)

2

\overset{\overline{}}{A}\left( h \middle| x \right) = \sum_{i = 1}^{T}\omega_{i}A\left( h_{i} \middle| x \right) = \sum_{i = 1}^{T}{\omega_{i}\left( h_{i}\left( x \right) – H\left( x \right) \right)^{2}}















A








































(


h





x


)





=

















i


=


1



















T





















ω











i



















A





(



h











i






















x


)





=

















i


=


1



















T






















ω











i























(



h











i






















(


x


)










H





(


x


)



)












2














分歧项表征了个体学习器在样本x上的不一致性。个体学习器



h

i

h_{i}







h











i






















和集成H的平方误差分别为





E

(

h

i

x

)

=

(

f

(

x

)

h

i

(

x

)

)

2

E\left( h_{i} \middle| x \right) = \left( f\left( x \right) – h_{i}\left( x \right) \right)^{2}






E





(



h











i






















x


)





=










(


f





(


x


)











h











i






















(


x


)



)












2

















E

(

h

x

)

=

(

f

(

x

)

H

(

x

)

)

2

E\left( h \middle| x \right) = \left( f\left( x \right) – H\left( x \right) \right)^{2}






E





(


h





x


)





=










(


f





(


x


)










H





(


x


)



)












2

















E

(

h

x

)

=

i

=

1

T

ω

i

E

(

h

i

x

)

\overset{\overline{}}{E}\left( h \middle| x \right) = \sum_{i = 1}^{T}{\omega_{i}E\left( h_{i} \middle| x \right)}















E








































(


h





x


)





=





















i


=


1










T























ω











i



















E





(



h











i






















x


)







表示个体学习器误差的加权均值,有





A

(

h

x

)

=

i

=

1

T

ω

i

E

(

h

i

x

)

E

(

H

x

)

=

E

(

h

x

)

E

(

H

x

)

\overset{\overline{}}{A}\left( h \middle| x \right) = \sum_{i = 1}^{T}{\omega_{i}E\left( h_{i} \middle| x \right) -}E\left( H \middle| x \right) = \overset{\overline{}}{E}\left( h \middle| x \right) – E\left( H \middle| x \right)















A








































(


h





x


)





=

















i


=


1



















T






















ω











i



















E





(



h











i






















x


)









E





(


H





x


)





=

















E








































(


h





x


)














E





(


H





x


)










p

(

x

)

p\left( x \right)






p





(


x


)






表示样本的概率密度,则在全样本上有





i

=

1

T

ω

i

A

(

h

i

x

)

p

(

x

)

dx

=

i

=

1

T

ω

i

E

(

h

i

x

)

p

(

x

)

d

x

E

(

H

x

)

p

(

x

)

dx

\sum_{i = 1}^{T}{\omega_{i}\int_{}^{}{A\left( h_{i} \middle| x \right)p\left( x \right)}}\text{dx} = \sum_{i = 1}^{T}{\omega_{i}\int_{}^{}{E\left( h_{i} \middle| x \right)p\left( x \right)}}dx – \int_{}^{}{E\left( H \middle| x \right)p\left( x \right)\text{dx}}















i


=


1



















T






















ω











i






























































A





(



h











i






















x


)





p





(


x


)






dx





=

















i


=


1



















T






















ω











i






























































E





(



h











i






















x


)





p





(


x


)





d


x






















































E





(


H





x


)





p





(


x


)






dx







个体学习器



h

i

h_{i}







h











i






















在全样本上的泛化误差和分歧项分别为





E

i

=

E

(

h

i

x

)

p

(

x

)

dx

E_{i} = \int_{}^{}{E\left( h_{i} \middle| x \right)p\left( x \right)\text{dx}}







E











i





















=

















































E





(



h











i






















x


)





p





(


x


)






dx











A

i

=

A

(

h

i

x

)

p

(

x

)

dx

A_{i} = \int_{}^{}{A\left( h_{i} \middle| x \right)p\left( x \right)\text{dx}}







A











i





















=

















































A





(



h











i






















x


)





p





(


x


)






dx







集成的泛化误差为





E

=

E

(

H

x

)

p

(

x

)

dx

E = \int_{}^{}{E\left( H \middle| x \right)p\left( x \right)\text{dx}}






E




=

















































E





(


H





x


)





p





(


x


)






dx







再令



E

=

i

=

1

T

ω

i

E

i

\overset{\overline{}}{E} = \sum_{i = 1}^{T}{\omega_{i}E_{i}}















E







































=





















i


=


1










T























ω











i




















E











i























表示个体学习器泛化误差的加权均值,



A

=

i

=

1

T

ω

i

A

i

\overset{\overline{}}{A} = \sum_{i = 1}^{T}{\omega_{i}A_{i}}















A







































=





















i


=


1










T























ω











i




















A











i























表示个体学习器的加权分歧值,有





E

=

E

A

E = \overset{\overline{}}{E} – \overset{\overline{}}{A}






E




=

















E

























































A








































个体学习器准确性越高、多样性越大,则集成越好。

8.5.2 多样性度量

多样性度量(diversity measure)是用于度量集成中个体分类器的多样性。即估计个体学习器的多样化程度。

给定数据集



D

=

{

(

x

1

,

y

1

)

,

(

x

2

,

y

2

)

,

,

(

x

m

,

y

m

)

}

D = \left\{ \left( x_{1},y_{1} \right),\left( x_{2},y_{2} \right),\ldots,\left( x_{m},y_{m} \right) \right\}






D




=









{




(



x











1



















,





y











1



















)





,





(



x











2



















,





y











2



















)





,









,





(



x











m



















,





y











m



















)



}






,对二分类任务,



y

i

{

1

,

1

}

y_{i} \in \left\{ – 1,1 \right\}







y











i































{






1


,




1


}






,分类器



h

i

h_{i}







h











i


























h

j

h_{j}







h











j






















的预测结果列联表(contingency table)为

在这里插入图片描述

其中,a表示



h

i

h_{i}







h











i


























h

j

h_{j}







h











j






















均预测为正类的样本数目;b、c、d含义由此类推;



a

+

b

+

c

+

d

=

m

a+ b + c + d = m






a




+








b




+








c




+








d




=








m





不合度量(disagreement measure)





dis

ij

=

b

+

c

m

\text{dis}_{\text{ij}} = \frac{b + c}{m}








dis













ij






















=



















m














b




+




c


























dis

ij

\text{dis}_{\text{ij}}








dis













ij























的值域为



[

0

,

1

]

\left\lbrack 0,1 \right\rbrack







[


0


,




1


]






,值越大则多样性越大。


相关系数(correlation coefficient)





ρ

ij

=

a

d

b

c

(

a

+

b

)

(

a

+

c

)

(

c

+

d

)

(

b

+

d

)

\rho_{\text{ij}} = \frac{ad – bc}{\sqrt{\left( a + b \right)\left( a + c \right)\left( c + d \right)\left( b + d \right)}}







ρ












ij






















=




























(


a




+




b


)






(


a




+




c


)






(


c




+




d


)






(


b




+




d


)





































a


d









b


c


























ρ

ij

\rho_{\text{ij}}







ρ












ij























的值域为



[

1

,

1

]

\left\lbrack – 1,1 \right\rbrack







[





1


,




1


]






,若



h

i

h_{i}







h











i


























h

j

h_{j}







h











j






















无关,则值为0;若



h

i

h_{i}







h











i


























h

j

h_{j}







h











j






















正相关则值为正,否则为负。


Q-统计量(Q-statistic)





Q

ij

=

a

d

b

c

a

d

+

b

c

Q_{\text{ij}} = \frac{ad – bc}{ad + bc}







Q












ij






















=



















a


d




+




b


c














a


d









b


c


























Q

ij

Q_{\text{ij}}







Q












ij























与相关系数



ρ

ij

\rho_{\text{ij}}







ρ












ij























的符号相同,且



Q

ij

&gt;

ρ

ij

\left| Q_{\text{ij}} \right| &gt; \left| \rho_{\text{ij}} \right|











Q












ij


























>













ρ












ij





























K

\mathcal{K}







K







-统计量(




K

\mathcal{K}







K







-statistic)





K

=

p

1

p

2

1

p

2

\mathcal{K}\mathbf{=}\frac{p_{1} – p_{2}}{1 – p_{2}}







K




=














1










p











2
































p











1



























p











2








































其中,



p

1

p_{1}







p











1






















是两个分类器取得一致的概率;



p

2

p_{2}







p











2






















是两个分类器偶然达成一致的概率,由数据集D估算:





p

1

=

a

+

d

m

p_{1} = \frac{a + d}{m}







p











1





















=



















m














a




+




d



























p

2

=

(

a

+

b

)

(

a

+

c

)

+

(

c

+

d

)

(

b

+

d

)

m

2

p_{2} = \frac{\left( a + b \right)\left( a + c \right) + \left( c + d \right)\left( b + d \right)}{m^{2}}







p











2





















=




















m











2























(


a




+




b


)






(


a




+




c


)





+





(


c




+




d


)






(


b




+




d


)




























h

i

h_{i}







h











i


























h

j

h_{j}







h











j






















在D上完全一致,则



K

=

1

\mathcal{K} = 1







K





=








1





;若



h

i

h_{i}







h











i


























h

j

h_{j}







h











j






















偶尔达成一致,则



K

=

0

\mathcal{K} = 0







K





=








0









K

\mathcal{K}







K






通常为非负值。



8.5.3 多样性增强


数据样本扰动

通常是基于采样法,此类做法简单高效,使用最广。


输入属性扰动

训练样本通常由一组属性描述,不同的子空间提供了观察数据的不同视角。


随机子空间(random subspace)算法

:依赖于输入属性扰动,该算法从初始属性集中抽取若干个属性子集,再基于每个属性子集训练一个基学习器。

在这里插入图片描述


输出表示扰动

基本思想:对输出表示进行操纵以增强多样性。


算法参数扰动

基学习算法一般都有参数需进行设置,通过随机设置不同的参数,往往可产生差别较大的个体学习器。



版权声明:本文为williananjhon原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。