聚类算法小结

  • Post author:
  • Post category:其他




介绍

聚类算法在于对每一条样本生成固定长度的特征向量,通过数学运算将空间中满足聚类要求的相似样本聚为一类,即我们说的簇。由于聚类算法通常为无监督学习,不需要样本标签,因而成本较低,广泛应用于相似性数据挖掘工作中。应用到推荐领域,可以为用户和产品分组。在介绍常见的聚类算法前,我们先呈现各类相似度的判断标准:


  • 余弦相似度

    (Cosine Similarity):空间中向量夹角的余弦值,用于衡量向量的方向是否一致;





    C

    o

    s

    i

    n

    e

    _

    S

    i

    m

    i

    l

    a

    r

    i

    t

    y

    (

    x

    ,

    y

    )

    =

    x

    y

    x

    y

    Cosine\_Similarity(x,y)=\frac{x\cdot y}{||x||\cdot ||y||}






    C


    o


    s


    i


    n


    e


    _


    S


    i


    m


    i


    l


    a


    r


    i


    t


    y


    (


    x


    ,




    y


    )




    =

























    x





















    y




















    x









    y
























  • 欧式距离

    (Euclidean Distance):两点之间的最短距离,是对于向量长度和方向的综合评价标准;





    E

    u

    c

    l

    i

    d

    e

    a

    n

    (

    x

    ,

    y

    )

    =

    x

    y

    =

    (

    i

    x

    i

    y

    i

    2

    )

    1

    2

    Euclidean(x,y)=||x-y||=\big(\sum_i|x_i-y_i|^2\big)^{\frac{1}{2}}






    E


    u


    c


    l


    i


    d


    e


    a


    n


    (


    x


    ,




    y


    )




    =














    x













    y










    =









    (













    i
































    x










    i






























    y










    i






























    2











    )
























    2
















    1

































  • 曼哈顿距离

    (Manhattan Distance):两点之间的棋盘距离,在特定场景下效用显著;





    M

    a

    n

    h

    a

    t

    t

    a

    n

    (

    x

    ,

    y

    )

    =

    i

    x

    i

    y

    i

    Manhattan(x,y)=\sum_i|x_i-y_i|






    M


    a


    n


    h


    a


    t


    t


    a


    n


    (


    x


    ,




    y


    )




    =
















    i
































    x










    i






























    y










    i

























  • 闵氏距离

    (Minkowski Distance):欧式距离和曼哈顿距离的泛化版本;





    M

    i

    n

    k

    o

    w

    s

    k

    i

    (

    x

    ,

    y

    )

    =

    (

    i

    x

    i

    y

    i

    p

    )

    1

    p

    Minkowski(x,y)=\big(\sum_i|x_i-y_i|^p\big)^{\frac{1}{p}}






    M


    i


    n


    k


    o


    w


    s


    k


    i


    (


    x


    ,




    y


    )




    =









    (













    i
































    x










    i






























    y










    i






























    p











    )
























    p
















    1

































  • 相关系数

    (Correlation):统计学的常见概念,衡量同质化向量的元素协同变化趋势;





    C

    o

    r

    r

    e

    l

    a

    t

    i

    o

    n

    (

    x

    ,

    y

    )

    =

    C

    o

    v

    (

    x

    ,

    y

    )

    σ

    x

    σ

    y

    =

    E

    (

    x

    y

    )

    E

    (

    x

    )

    E

    (

    y

    )

    V

    a

    r

    (

    x

    )

    V

    a

    r

    (

    y

    )

    Correlation(x,y)=\frac{Cov(x,y)}{\sigma_x\cdot \sigma_y}=\frac{E(xy)-E(x)E(y)}{\sqrt{Var(x)\cdot Var(y)}}






    C


    o


    r


    r


    e


    l


    a


    t


    i


    o


    n


    (


    x


    ,




    y


    )




    =




















    σ










    x


























    σ










    y






























    C


    o


    v


    (


    x


    ,




    y


    )






















    =



























    V


    a


    r


    (


    x


    )









    V


    a


    r


    (


    y


    )




































    E


    (


    x


    y


    )









    E


    (


    x


    )


    E


    (


    y


    )
























  • KL 散度

    (Kullback-Leibler Divergence):又称为相对熵,是信息论领域的重要概念,衡量两种概率分布的单向相似度;





    K

    L

    _

    D

    i

    v

    e

    r

    g

    e

    n

    c

    e

    (

    p

    q

    )

    =

    x

    q

    p

    (

    x

    )

    l

    o

    g

    p

    (

    x

    )

    q

    (

    x

    )

    KL\_Divergence(p||q)=\sum_{x\in q}p(x)log\frac{p(x)}{q(x)}






    K


    L


    _


    D


    i


    v


    e


    r


    g


    e


    n


    c


    e


    (


    p








    q


    )




    =

















    x





    q





























    p


    (


    x


    )


    l


    o


    g













    q


    (


    x


    )














    p


    (


    x


    )
























  • Jaccard 相似度

    (Jaccard Similarity):交集长度除以并集长度,衡量两种集合或布尔序列的相似度;





    J

    a

    c

    c

    a

    r

    d

    _

    S

    i

    m

    i

    l

    a

    r

    i

    t

    y

    (

    A

    ,

    B

    )

    =

    A

    B

    A

    B

    Jaccard\_Similarity(A,B)=\frac{|A\cap B|}{|A\cup B|}






    J


    a


    c


    c


    a


    r


    d


    _


    S


    i


    m


    i


    l


    a


    r


    i


    t


    y


    (


    A


    ,




    B


    )




    =






















    A









    B




















    A









    B




























K-Means

在这里插入图片描述

最常见的聚类算法,通过多轮迭代,将空间内的每一个点汇聚到距离最近的簇中,生成 K 个簇。


Algorithm

KMeans(



D

D






D





,



K

K






K





):



Input


: Samples



D

=

{

x

i

i

=

1

,

.

.

.

,

N

}

D=\{x_i|i=1,…,N\}






D




=








{




x










i





















i




=








1


,




.


.


.


,




N


}





, number of clusters



K

K






K





.



Output


: Cluster index list of samples



I

=

{

y

i

i

=

1

,

.

.

.

,

N

}

I=\{y_i|i=1,…,N\}






I




=








{




y










i





















i




=








1


,




.


.


.


,




N


}





.

initialize



K

K






K





central points

obtain



I

I






I





by clustering samples (calculation of Enclidean distance between each sample and each central point)


while

not converged

do


update the



K

K






K





centrol points (calculation of mean vector in each cluster)

update



I

I






I







if




I

I






I





doesn’t make sufficient difference in this epoch

then


break the loop


return




I

I






I




K-Means 聚类的缺点是显而易见的:K 值不容易确认;无法适用于非凸形状的簇。



BIRCH

在这里插入图片描述

abbr. Balanced Iterative Reducing and Clustering Using Hierarchies,层次聚类的主流算法之一,通过构建

聚类特征树

(Clustering Feature Tree) 将样本的特征向量自根节点沿内部路径导入到不同的叶节点簇中。每一个聚类特征 CF 包含一个三元组。假设该节点下的样本为



N

N






N









M

M






M





维的特征向量:



x

i

=

(

x

i

1

,

.

.

.

,

x

i

M

)

,

i

=

1

,

.

.

.

,

N

x_i=(x_{i1},…,x_{iM}),i=1,…,N







x










i




















=








(



x











i


1



















,




.


.


.


,





x











i


M



















)


,




i




=








1


,




.


.


.


,




N





,则三元组可以表示为 <



N

,

(

i

x

i

1

,

.

.

.

,

i

x

i

M

)

,

(

i

x

i

1

2

,

.

.

.

,

i

x

i

M

2

)

N,(\sum_ix_{i1},…,\sum_ix_{iM}),(\sum_ix^2_{i1},…,\sum_ix^2_{iM})






N


,




(














i





















x











i


1



















,




.


.


.


,
















i





















x











i


M



















)


,




(














i





















x











i


1









2


















,




.


.


.


,
















i





















x











i


M









2


















)





>。这样做是为了方便将样本与簇的整体信息进行适配,逐层导入更深的节点之中,同时也是为了方便在树构建完成后对树的优化。


Algorithm

BIRCH(



D

D






D





,



R

R






R





,



M

a

x

_

C

h

i

l

d

Max\_Child






M


a


x


_


C


h


i


l


d





):



Input


: Samples



D

=

{

x

i

i

=

1

,

.

.

.

,

N

}

D=\{x_i|i=1,…,N\}






D




=








{




x










i





















i




=








1


,




.


.


.


,




N


}





, cluster radius



R

R






R





, max num of child nodes or samples in a CF



M

a

x

_

C

h

i

l

d

Max\_Child






M


a


x


_


C


h


i


l


d





.



Output


: Clustering tree object



T

T






T





.

initialize clustering tree object



T

T






T







for

each sample vector

do


# – – – Insert Operation – – – #

search closest leaf cluster from top node


if

the vector falls in no existing cluster

then


create a new leaf


else


insert the vector into the matched cluster

# – – – Backward Adjustment – – – #


for

each internal node on the path from leaf to top

do



if

the number of child nodes exceed



M

a

x

_

C

h

i

l

d

Max\_Child






M


a


x


_


C


h


i


l


d






then


create a new leaf in parent node and accept the vector

# – – – (Optional) Outlier Exclusion – – – #

delete clusters with just one sample

# – – – (Optional) Cluster Merging – – – #

merging close clusters by backward searching (ignoring



M

a

x

_

C

h

i

l

d

Max\_Child






M


a


x


_


C


h


i


l


d





)


return




T

T






T




与 K-Means 相同,BIRCH 无法适用于非凸形状的聚类,但超球体半径



ϵ

\epsilon






ϵ





的设定,最终能帮助模型方便地将离群点识别出来。



DBSCAN

在这里插入图片描述

abbr. Density-Based Spatial Clustering of Applications with Noise,经典的密度聚类算法,不同于 K-Means 和 BIRCH,可识别任何形状的聚类模式。算法的原理可以用一种极其简单的方式去理解。首先将全部样本点分为三类,核心点、边界点和噪声点。如上图所示,分类的标准在于设定数量阈值,超球体范围内邻点数量大于阈值则为核心点,小于阈值为非核心点;非核心点中,存在核心点邻点的样本点称为边界点,不存在核心点邻点的则为噪声点。上图中的阈值为 2。相邻的核心点及所有附近的边界点组合而成的闭路即为一簇。


Algorithm

DBSCAN(



D

D






D





,



ϵ

\epsilon






ϵ





,



M

i

n

_

P

t

s

Min\_Pts






M


i


n


_


P


t


s





):



Input


: Samples



D

=

{

x

i

i

=

1

,

.

.

.

,

N

}

D=\{x_i|i=1,…,N\}






D




=








{




x










i





















i




=








1


,




.


.


.


,




N


}





, threshold of distance



ϵ

\epsilon






ϵ





, min num of neighbors to recognize core points



M

i

n

_

P

t

s

Min\_Pts






M


i


n


_


P


t


s





.



Output


: Cluster index list of samples



I

=

{

y

i

i

=

1

,

.

.

.

,

N

}

I=\{y_i|i=1,…,N\}






I




=








{




y










i





















i




=








1


,




.


.


.


,




N


}





.

# – – – Search Core Points – – – #

initialize set of core points



Ω

\Omega






Ω







for

each sample vector

do


find number of neighbors with relative distance smaller than



ϵ

\epsilon






ϵ







if

the number gets larger than



M

i

n

_

P

t

s

Min\_Pts






M


i


n


_


P


t


s






then


add the vector into core points



Ω

\Omega






Ω






# – – – Clustering – – – #

initialize cluster index list



I

I






I






clustering core points into sets



{

C

1

,

.

.

.

,

C

K

}

\{C_1,…,C_K\}






{




C










1


















,




.


.


.


,





C










K


















}





and label corresponding items in



I

I






I







for

each non-core point

do



if

it has at least one core point neighbors

then


label the sample with the cluster index



C

k

C_k







C










k





















of nearest core point


else


label the sample as noise point


return




I

I






I




DBSCAN 的好处是显而易见的,既能除噪,也能应用于诸如螺旋数据集等复杂的聚类结构。缺点是超参数



ϵ

\epsilon






ϵ





不易控制。当数据呈现既非凸形状,同时非稠密形态时,DBSCAN 算法也无能为力。



谱聚类

(于近日补充…)



版权声明:本文为weixin_43269174原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。