介绍
聚类算法在于对每一条样本生成固定长度的特征向量,通过数学运算将空间中满足聚类要求的相似样本聚为一类,即我们说的簇。由于聚类算法通常为无监督学习,不需要样本标签,因而成本较低,广泛应用于相似性数据挖掘工作中。应用到推荐领域,可以为用户和产品分组。在介绍常见的聚类算法前,我们先呈现各类相似度的判断标准:
-
余弦相似度
(Cosine Similarity):空间中向量夹角的余弦值,用于衡量向量的方向是否一致;
Co
s
i
n
e
_
S
i
m
i
l
a
r
i
t
y
(
x
,
y
)
=
x
⋅
y
∣
∣
x
∣
∣
⋅
∣
∣
y
∣
∣
Cosine\_Similarity(x,y)=\frac{x\cdot y}{||x||\cdot ||y||}
C
o
s
i
n
e
_
S
i
m
i
l
a
r
i
t
y
(
x
,
y
)
=
∣
∣
x
∣
∣
⋅
∣
∣
y
∣
∣
x
⋅
y
-
欧式距离
(Euclidean Distance):两点之间的最短距离,是对于向量长度和方向的综合评价标准;
Eu
c
l
i
d
e
a
n
(
x
,
y
)
=
∣
∣
x
−
y
∣
∣
=
(
∑
i
∣
x
i
−
y
i
∣
2
)
1
2
Euclidean(x,y)=||x-y||=\big(\sum_i|x_i-y_i|^2\big)^{\frac{1}{2}}
E
u
c
l
i
d
e
a
n
(
x
,
y
)
=
∣
∣
x
−
y
∣
∣
=
(
i
∑
∣
x
i
−
y
i
∣
2
)
2
1
-
曼哈顿距离
(Manhattan Distance):两点之间的棋盘距离,在特定场景下效用显著;
Ma
n
h
a
t
t
a
n
(
x
,
y
)
=
∑
i
∣
x
i
−
y
i
∣
Manhattan(x,y)=\sum_i|x_i-y_i|
M
a
n
h
a
t
t
a
n
(
x
,
y
)
=
i
∑
∣
x
i
−
y
i
∣
-
闵氏距离
(Minkowski Distance):欧式距离和曼哈顿距离的泛化版本;
Mi
n
k
o
w
s
k
i
(
x
,
y
)
=
(
∑
i
∣
x
i
−
y
i
∣
p
)
1
p
Minkowski(x,y)=\big(\sum_i|x_i-y_i|^p\big)^{\frac{1}{p}}
M
i
n
k
o
w
s
k
i
(
x
,
y
)
=
(
i
∑
∣
x
i
−
y
i
∣
p
)
p
1
-
相关系数
(Correlation):统计学的常见概念,衡量同质化向量的元素协同变化趋势;
Co
r
r
e
l
a
t
i
o
n
(
x
,
y
)
=
C
o
v
(
x
,
y
)
σ
x
⋅
σ
y
=
E
(
x
y
)
−
E
(
x
)
E
(
y
)
V
a
r
(
x
)
⋅
V
a
r
(
y
)
Correlation(x,y)=\frac{Cov(x,y)}{\sigma_x\cdot \sigma_y}=\frac{E(xy)-E(x)E(y)}{\sqrt{Var(x)\cdot Var(y)}}
C
o
r
r
e
l
a
t
i
o
n
(
x
,
y
)
=
σ
x
⋅
σ
y
C
o
v
(
x
,
y
)
=
V
a
r
(
x
)
⋅
V
a
r
(
y
)
E
(
x
y
)
−
E
(
x
)
E
(
y
)
-
KL 散度
(Kullback-Leibler Divergence):又称为相对熵,是信息论领域的重要概念,衡量两种概率分布的单向相似度;
KL
_
D
i
v
e
r
g
e
n
c
e
(
p
∣
∣
q
)
=
∑
x
∈
q
p
(
x
)
l
o
g
p
(
x
)
q
(
x
)
KL\_Divergence(p||q)=\sum_{x\in q}p(x)log\frac{p(x)}{q(x)}
K
L
_
D
i
v
e
r
g
e
n
c
e
(
p
∣
∣
q
)
=
x
∈
q
∑
p
(
x
)
l
o
g
q
(
x
)
p
(
x
)
-
Jaccard 相似度
(Jaccard Similarity):交集长度除以并集长度,衡量两种集合或布尔序列的相似度;
Ja
c
c
a
r
d
_
S
i
m
i
l
a
r
i
t
y
(
A
,
B
)
=
∣
A
∩
B
∣
∣
A
∪
B
∣
Jaccard\_Similarity(A,B)=\frac{|A\cap B|}{|A\cup B|}
J
a
c
c
a
r
d
_
S
i
m
i
l
a
r
i
t
y
(
A
,
B
)
=
∣
A
∪
B
∣
∣
A
∩
B
∣
K-Means
最常见的聚类算法,通过多轮迭代,将空间内的每一个点汇聚到距离最近的簇中,生成 K 个簇。
Algorithm
KMeans(
DD
D
,
KK
K
):
Input
: Samples
D=
{
x
i
∣
i
=
1
,
.
.
.
,
N
}
D=\{x_i|i=1,…,N\}
D
=
{
x
i
∣
i
=
1
,
.
.
.
,
N
}
, number of clusters
KK
K
.
Output
: Cluster index list of samples
I=
{
y
i
∣
i
=
1
,
.
.
.
,
N
}
I=\{y_i|i=1,…,N\}
I
=
{
y
i
∣
i
=
1
,
.
.
.
,
N
}
.
initialize
KK
K
central points
obtain
II
I
by clustering samples (calculation of Enclidean distance between each sample and each central point)
while
not converged
do
update the
KK
K
centrol points (calculation of mean vector in each cluster)
update
II
I
if
II
I
doesn’t make sufficient difference in this epoch
then
break the loop
return
II
I
K-Means 聚类的缺点是显而易见的:K 值不容易确认;无法适用于非凸形状的簇。
BIRCH
abbr. Balanced Iterative Reducing and Clustering Using Hierarchies,层次聚类的主流算法之一,通过构建
聚类特征树
(Clustering Feature Tree) 将样本的特征向量自根节点沿内部路径导入到不同的叶节点簇中。每一个聚类特征 CF 包含一个三元组。假设该节点下的样本为
N
N
N
个
M
M
M
维的特征向量:
x
i
=
(
x
i
1
,
.
.
.
,
x
i
M
)
,
i
=
1
,
.
.
.
,
N
x_i=(x_{i1},…,x_{iM}),i=1,…,N
x
i
=
(
x
i
1
,
.
.
.
,
x
i
M
)
,
i
=
1
,
.
.
.
,
N
,则三元组可以表示为 <
N
,
(
∑
i
x
i
1
,
.
.
.
,
∑
i
x
i
M
)
,
(
∑
i
x
i
1
2
,
.
.
.
,
∑
i
x
i
M
2
)
N,(\sum_ix_{i1},…,\sum_ix_{iM}),(\sum_ix^2_{i1},…,\sum_ix^2_{iM})
N
,
(
∑
i
x
i
1
,
.
.
.
,
∑
i
x
i
M
)
,
(
∑
i
x
i
1
2
,
.
.
.
,
∑
i
x
i
M
2
)
>。这样做是为了方便将样本与簇的整体信息进行适配,逐层导入更深的节点之中,同时也是为了方便在树构建完成后对树的优化。
Algorithm
BIRCH(
DD
D
,
RR
R
,
Ma
x
_
C
h
i
l
d
Max\_Child
M
a
x
_
C
h
i
l
d
):
Input
: Samples
D=
{
x
i
∣
i
=
1
,
.
.
.
,
N
}
D=\{x_i|i=1,…,N\}
D
=
{
x
i
∣
i
=
1
,
.
.
.
,
N
}
, cluster radius
RR
R
, max num of child nodes or samples in a CF
Ma
x
_
C
h
i
l
d
Max\_Child
M
a
x
_
C
h
i
l
d
.
Output
: Clustering tree object
TT
T
.
initialize clustering tree object
TT
T
for
each sample vector
do
# – – – Insert Operation – – – #
search closest leaf cluster from top node
if
the vector falls in no existing cluster
then
create a new leaf
else
insert the vector into the matched cluster
# – – – Backward Adjustment – – – #
for
each internal node on the path from leaf to top
do
if
the number of child nodes exceed
Ma
x
_
C
h
i
l
d
Max\_Child
M
a
x
_
C
h
i
l
d
then
create a new leaf in parent node and accept the vector
# – – – (Optional) Outlier Exclusion – – – #
delete clusters with just one sample
# – – – (Optional) Cluster Merging – – – #
merging close clusters by backward searching (ignoring
Ma
x
_
C
h
i
l
d
Max\_Child
M
a
x
_
C
h
i
l
d
)
return
TT
T
与 K-Means 相同,BIRCH 无法适用于非凸形状的聚类,但超球体半径
ϵ
\epsilon
ϵ
的设定,最终能帮助模型方便地将离群点识别出来。
DBSCAN
abbr. Density-Based Spatial Clustering of Applications with Noise,经典的密度聚类算法,不同于 K-Means 和 BIRCH,可识别任何形状的聚类模式。算法的原理可以用一种极其简单的方式去理解。首先将全部样本点分为三类,核心点、边界点和噪声点。如上图所示,分类的标准在于设定数量阈值,超球体范围内邻点数量大于阈值则为核心点,小于阈值为非核心点;非核心点中,存在核心点邻点的样本点称为边界点,不存在核心点邻点的则为噪声点。上图中的阈值为 2。相邻的核心点及所有附近的边界点组合而成的闭路即为一簇。
Algorithm
DBSCAN(
DD
D
,
ϵ\epsilon
ϵ
,
Mi
n
_
P
t
s
Min\_Pts
M
i
n
_
P
t
s
):
Input
: Samples
D=
{
x
i
∣
i
=
1
,
.
.
.
,
N
}
D=\{x_i|i=1,…,N\}
D
=
{
x
i
∣
i
=
1
,
.
.
.
,
N
}
, threshold of distance
ϵ\epsilon
ϵ
, min num of neighbors to recognize core points
Mi
n
_
P
t
s
Min\_Pts
M
i
n
_
P
t
s
.
Output
: Cluster index list of samples
I=
{
y
i
∣
i
=
1
,
.
.
.
,
N
}
I=\{y_i|i=1,…,N\}
I
=
{
y
i
∣
i
=
1
,
.
.
.
,
N
}
.
# – – – Search Core Points – – – #
initialize set of core points
Ω\Omega
Ω
for
each sample vector
do
find number of neighbors with relative distance smaller than
ϵ\epsilon
ϵ
if
the number gets larger than
Mi
n
_
P
t
s
Min\_Pts
M
i
n
_
P
t
s
then
add the vector into core points
Ω\Omega
Ω
# – – – Clustering – – – #
initialize cluster index list
II
I
clustering core points into sets
{C
1
,
.
.
.
,
C
K
}
\{C_1,…,C_K\}
{
C
1
,
.
.
.
,
C
K
}
and label corresponding items in
II
I
for
each non-core point
do
if
it has at least one core point neighbors
then
label the sample with the cluster index
Ck
C_k
C
k
of nearest core point
else
label the sample as noise point
return
II
I
DBSCAN 的好处是显而易见的,既能除噪,也能应用于诸如螺旋数据集等复杂的聚类结构。缺点是超参数
ϵ
\epsilon
ϵ
不易控制。当数据呈现既非凸形状,同时非稠密形态时,DBSCAN 算法也无能为力。
谱聚类
(于近日补充…)