聚类算法小结 – 小飞侠

介绍

聚类算法在于对每一条样本生成固定长度的特征向量，通过数学运算将空间中满足聚类要求的相似样本聚为一类，即我们说的簇。由于聚类算法通常为无监督学习，不需要样本标签，因而成本较低，广泛应用于相似性数据挖掘工作中。应用到推荐领域，可以为用户和产品分组。在介绍常见的聚类算法前，我们先呈现各类相似度的判断标准：

余弦相似度

(Cosine Similarity)：空间中向量夹角的余弦值，用于衡量向量的方向是否一致；

$Cosine\_Similarity(x,y)=\frac{x\cdot y}{||x||\cdot ||y||} C o s i n e _ S i m i l a r i t y ( x , y ) = ∣ ∣ x ∣ ∣ ⋅ ∣ ∣ y ∣ ∣ x ⋅ y $
欧式距离

(Euclidean Distance)：两点之间的最短距离，是对于向量长度和方向的综合评价标准；

$Euclidean(x,y)=||x-y||=\big(\sum_i|x_i-y_i|^2\big)^{\frac{1}{2}} E u c l i d e a n ( x , y ) = ∣ ∣ x − y ∣ ∣ = ( i ∑ ∣ x i − y i ∣ 2 ) 2 1 $
曼哈顿距离

(Manhattan Distance)：两点之间的棋盘距离，在特定场景下效用显著；

$Manhattan(x,y)=\sum_i|x_i-y_i| M a n h a t t a n ( x , y ) = i ∑ ∣ x i − y i ∣$
闵氏距离

(Minkowski Distance)：欧式距离和曼哈顿距离的泛化版本；

$Minkowski(x,y)=\big(\sum_i|x_i-y_i|^p\big)^{\frac{1}{p}} M i n k o w s k i ( x , y ) = ( i ∑ ∣ x i − y i ∣ p ) p 1 $
相关系数

(Correlation)：统计学的常见概念，衡量同质化向量的元素协同变化趋势；

$Correlation(x,y)=\frac{Cov(x,y)}{\sigma_x\cdot \sigma_y}=\frac{E(xy)-E(x)E(y)}{\sqrt{Var(x)\cdot Var(y)}} C o r r e l a t i o n ( x , y ) = σ x ⋅ σ y C o v ( x , y ) = V a r ( x ) ⋅ V a r ( y ) E ( x y ) − E ( x ) E ( y ) $
KL 散度

(Kullback-Leibler Divergence)：又称为相对熵，是信息论领域的重要概念，衡量两种概率分布的单向相似度；

$KL\_Divergence(p||q)=\sum_{x\in q}p(x)log\frac{p(x)}{q(x)} K L _ D i v e r g e n c e ( p ∣ ∣ q ) = x ∈ q ∑ p ( x ) l o g q ( x ) p ( x ) $
Jaccard 相似度

(Jaccard Similarity)：交集长度除以并集长度，衡量两种集合或布尔序列的相似度；

$Jaccard\_Similarity(A,B)=\frac{|A\cap B|}{|A\cup B|} J a c c a r d _ S i m i l a r i t y ( A , B ) = ∣ A ∪ B ∣ ∣ A ∩ B ∣ $

K-Means

在这里插入图片描述

最常见的聚类算法，通过多轮迭代，将空间内的每一个点汇聚到距离最近的簇中，生成 K 个簇。

Algorithm

KMeans(

D

$D$

,

K

$K$

):

Input

: Samples

=

{

x

i

∣

i

=

1

,

.

.

.

,

N

}

D=\{x_i|i=1,…,N\}

$D = {x_{i} ∣ i = 1, . . ., N}$

, number of clusters

K

$K$

.

Output

: Cluster index list of samples

=

{

y

i

∣

i

=

1

,

.

.

.

,

N

}

I=\{y_i|i=1,…,N\}

$I = {y_{i} ∣ i = 1, . . ., N}$

.

initialize

K

$K$

central points

obtain

I

$I$

by clustering samples (calculation of Enclidean distance between each sample and each central point)

while

not converged

do

update the

K

$K$

centrol points (calculation of mean vector in each cluster)

update

I

$I$

if

I

$I$

doesn’t make sufficient difference in this epoch

then

break the loop

return

I

$I$

K-Means 聚类的缺点是显而易见的：K 值不容易确认；无法适用于非凸形状的簇。

BIRCH

在这里插入图片描述

abbr. Balanced Iterative Reducing and Clustering Using Hierarchies，层次聚类的主流算法之一，通过构建

聚类特征树

(Clustering Feature Tree) 将样本的特征向量自根节点沿内部路径导入到不同的叶节点簇中。每一个聚类特征 CF 包含一个三元组。假设该节点下的样本为

N

$N$

个

M

$M$

维的特征向量：

(

)

x_i=(x_{i1},…,x_{iM}),i=1,…,N

$x_{i} = (x_{i 1}, . . ., x_{i M}), i = 1, . . ., N$

，则三元组可以表示为 <

(

∑

)

(

∑

)

N,(\sum_ix_{i1},…,\sum_ix_{iM}),(\sum_ix^2_{i1},…,\sum_ix^2_{iM})

$N, (\sum_{i} x_{i 1}, . . ., \sum_{i} x_{i M}), (\sum_{i} x_{i 1}^{2}, . . ., \sum_{i} x_{i M}^{2})$

>。这样做是为了方便将样本与簇的整体信息进行适配，逐层导入更深的节点之中，同时也是为了方便在树构建完成后对树的优化。

Algorithm

BIRCH(

D

$D$

,

R

$R$

,

a

x

_

C

h

i

l

d

Max\_Child

$M a x_C h i l d$

):

Input

: Samples

=

{

x

i

∣

i

=

1

,

.

.

.

,

N

}

D=\{x_i|i=1,…,N\}

$D = {x_{i} ∣ i = 1, . . ., N}$

, cluster radius

R

$R$

, max num of child nodes or samples in a CF

a

x

_

C

h

i

l

d

Max\_Child

$M a x_C h i l d$

.

Output

: Clustering tree object

T

$T$

.

initialize clustering tree object

T

$T$

for

each sample vector

do

# – – – Insert Operation – – – #

search closest leaf cluster from top node

if

the vector falls in no existing cluster

then

create a new leaf

else

insert the vector into the matched cluster

# – – – Backward Adjustment – – – #

for

each internal node on the path from leaf to top

do

if

the number of child nodes exceed

a

x

_

C

h

i

l

d

Max\_Child

$M a x_C h i l d$

then

create a new leaf in parent node and accept the vector

# – – – (Optional) Outlier Exclusion – – – #

delete clusters with just one sample

# – – – (Optional) Cluster Merging – – – #

merging close clusters by backward searching (ignoring

a

x

_

C

h

i

l

d

Max\_Child

$M a x_C h i l d$

)

return

T

$T$

与 K-Means 相同，BIRCH 无法适用于非凸形状的聚类，但超球体半径

\epsilon

$ϵ$

的设定，最终能帮助模型方便地将离群点识别出来。

DBSCAN

在这里插入图片描述

abbr. Density-Based Spatial Clustering of Applications with Noise，经典的密度聚类算法，不同于 K-Means 和 BIRCH，可识别任何形状的聚类模式。算法的原理可以用一种极其简单的方式去理解。首先将全部样本点分为三类，核心点、边界点和噪声点。如上图所示，分类的标准在于设定数量阈值，超球体范围内邻点数量大于阈值则为核心点，小于阈值为非核心点；非核心点中，存在核心点邻点的样本点称为边界点，不存在核心点邻点的则为噪声点。上图中的阈值为 2。相邻的核心点及所有附近的边界点组合而成的闭路即为一簇。

Algorithm

DBSCAN(

D

$D$

,

\epsilon

$ϵ$

,

i

n

_

P

t

s

Min\_Pts

$M i n_P t s$

):

Input

: Samples

=

{

x

i

∣

i

=

1

,

.

.

.

,

N

}

D=\{x_i|i=1,…,N\}

$D = {x_{i} ∣ i = 1, . . ., N}$

, threshold of distance

\epsilon

$ϵ$

, min num of neighbors to recognize core points

i

n

_

P

t

s

Min\_Pts

$M i n_P t s$

.

Output

: Cluster index list of samples

=

{

y

i

∣

i

=

1

,

.

.

.

,

N

}

I=\{y_i|i=1,…,N\}

$I = {y_{i} ∣ i = 1, . . ., N}$

.

# – – – Search Core Points – – – #

initialize set of core points

\Omega

$Ω$

for

each sample vector

do

find number of neighbors with relative distance smaller than

\epsilon

$ϵ$

if

the number gets larger than

i

n

_

P

t

s

Min\_Pts

$M i n_P t s$

then

add the vector into core points

\Omega

$Ω$

# – – – Clustering – – – #

initialize cluster index list

I

$I$

clustering core points into sets

C

1

,

.

.

.

,

C

K

}

\{C_1,…,C_K\}

${C_{1}, . . ., C_{K}}$

and label corresponding items in

I

$I$

for

each non-core point

do

if

it has at least one core point neighbors

then

label the sample with the cluster index

k

C_k

$C_{k}$

of nearest core point

else

label the sample as noise point

return

I

$I$

DBSCAN 的好处是显而易见的，既能除噪，也能应用于诸如螺旋数据集等复杂的聚类结构。缺点是超参数

\epsilon

$ϵ$

不易控制。当数据呈现既非凸形状，同时非稠密形态时，DBSCAN 算法也无能为力。

谱聚类

（于近日补充…）

原文链接：https://blog.csdn.net/weixin_43269174/article/details/96992326

目录

介绍

K-Means

BIRCH

DBSCAN

谱聚类

你可能也喜欢