R语言聚类——生成数据集(重叠、噪声、环形、半环形)

  • Post author:
  • Post category:其他


在一次数据挖掘课的作业中,老师要求我们使用R语言生成不同分布,不同特征的数据集,并进行聚类。这是我第一次接触R语言,基本上属于现学现用型的,很多地方略显笨拙。



生成数据集

分别使用均匀分布和正态分布来生成二维的数据集,且预先分成两类。为了测试不同聚类算法的特性,分别生成重叠(overlapping)、有噪声(noise)、不同形状(different shape)的数据集。

在R语言中,生成符合正态分布的随机数是用rnorm函数,生成均匀分布的随机数是使用runif函数,以下的例子均为使用正态分布,若需生成均匀分布只需将其中的rnorm替换成runif即可。

  • 重叠
# Generate data set with normal distribution and overlap
x1 <- round(rnorm(70, mean = 60, sd = 5))
y1 <- round(rnorm(70, mean = 80, sd = 5))
x2 <- round(rnorm(70, mean = 65, sd = 5))
y2 <- round(rnorm(70, mean = 85, sd = 5))
cluster1 <- data.frame(x = x1, y = y1, label = '1')
cluster2 <- data.frame(x = x2, y = y2, label = '2')
clusters <- rbind(cluster1, cluster2)
clusters <- data.frame(clusters) 
print(ggplot(data = clusters, mapping = aes(x = x, y = y, shape = label, color = label)) + geom_point() + labs(title = "Overlapping clusters in normal distribution"))

效果如下:

在这里插入图片描述

  • 噪声
# Generate data set with normal distribution and noise
x1 <- round(rnorm(70, mean = 60, sd = 5))
y1 <- round(rnorm(70, mean = 80, sd = 5))
x2 <- round(rnorm(70, mean = 90, sd = 5))
y2 <- round(rnorm(70, mean = 100, sd = 5))
xnoise <- round(rnorm(6, mean = 30, sd = 9))
ynoise <- round(rnorm(6, mean = 70, sd = 9))
cluster1 <- data.frame(x = x1, y = y1, label = '1')
cluster2 <- data.frame(x = x2, y = y2, label = '2')
noise <- data.frame(x = xnoise, y = ynoise, label = '1')
clusters <- rbind(cluster1, cluster2,noise)
clusters <- data.frame(clusters)
print(ggplot(data = clusters, mapping = aes(x = x, y = y, shape = label, color = label)) + geom_point() + labs(title = "Clusters with noises in normal distribution"))

效果如下:

在这里插入图片描述

  • 环形数据
generatePointsByRnom <- function(xmean, ymean, sd, num, label) {
    x <- round(rnorm(num, mean = xmean, sd = sd))
    y <- round(rnorm(num, mean = ymean, sd = sd))
    data.frame(x, y, label)
}

generateRingShapePointsbyRnom <- function(r, class) {
    cluster = vector()
    for (i in 1:60) {
        angle = i * 6
        x = round(r * cos(angle))
        y = round(r * sin(angle))
        cluster <- rbind(cluster, generatePointsByRnom(x, y, 3, 2, class))
    }
    cluster
}

cluster1 <- generateRingShapePointsbyRnom(10, '1')
cluster2 <- generateRingShapePointsbyRnom(20, '2')
clusters <- rbind(cluster1, cluster2)
clusters <- data.frame(clusters)
print(ggplot(data = clusters, mapping = aes(x = x, y = y, shape = label, color = label)) + geom_point() + labs(title = "Clusters with ring shape in normal distribution"))

效果如下:

在这里插入图片描述

  • 半环形数据(这里用的是均匀分布)
generatePointsByUniform <- function(x, y, num, label) {
    x <- round(runif(num,x-3,x+3))
    y <- round(runif(num, y - 3, y + 3))
    data.frame(x, y, label)
}

generateRingShapePointsbyUniform <- function(r,xoffset,yoffset,divangle, class) {
    cluster = vector()
    for (i in 1:60) {
        angle = i /divangle
        x = round(r * cos(angle))+xoffset
        y = round(r * sin(angle))+yoffset
        cluster <- rbind(cluster, generatePointsByUniform(x, y,3, class))
    }
    cluster
}
cluster1 <- generateRingShapePointsbyUniform(10,0,0,20,'1')
cluster2 <- generateRingShapePointsbyUniform(10,10,5, -20,'2')
clusters <- rbind(cluster1, cluster2)
clusters <- data.frame(clusters)
print(ggplot(data = clusters, mapping = aes(x = x, y = y, shape = label, color = label)) + geom_point() + labs(title = "Clusters with half ring shape in uniform distribution"))

效果如下:

在这里插入图片描述



版权声明:本文为bunnysxy原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。