在一次数据挖掘课的作业中,老师要求我们使用R语言生成不同分布,不同特征的数据集,并进行聚类。这是我第一次接触R语言,基本上属于现学现用型的,很多地方略显笨拙。
生成数据集
分别使用均匀分布和正态分布来生成二维的数据集,且预先分成两类。为了测试不同聚类算法的特性,分别生成重叠(overlapping)、有噪声(noise)、不同形状(different shape)的数据集。
在R语言中,生成符合正态分布的随机数是用rnorm函数,生成均匀分布的随机数是使用runif函数,以下的例子均为使用正态分布,若需生成均匀分布只需将其中的rnorm替换成runif即可。
- 重叠
# Generate data set with normal distribution and overlap
x1 <- round(rnorm(70, mean = 60, sd = 5))
y1 <- round(rnorm(70, mean = 80, sd = 5))
x2 <- round(rnorm(70, mean = 65, sd = 5))
y2 <- round(rnorm(70, mean = 85, sd = 5))
cluster1 <- data.frame(x = x1, y = y1, label = '1')
cluster2 <- data.frame(x = x2, y = y2, label = '2')
clusters <- rbind(cluster1, cluster2)
clusters <- data.frame(clusters)
print(ggplot(data = clusters, mapping = aes(x = x, y = y, shape = label, color = label)) + geom_point() + labs(title = "Overlapping clusters in normal distribution"))
效果如下:
- 噪声
# Generate data set with normal distribution and noise
x1 <- round(rnorm(70, mean = 60, sd = 5))
y1 <- round(rnorm(70, mean = 80, sd = 5))
x2 <- round(rnorm(70, mean = 90, sd = 5))
y2 <- round(rnorm(70, mean = 100, sd = 5))
xnoise <- round(rnorm(6, mean = 30, sd = 9))
ynoise <- round(rnorm(6, mean = 70, sd = 9))
cluster1 <- data.frame(x = x1, y = y1, label = '1')
cluster2 <- data.frame(x = x2, y = y2, label = '2')
noise <- data.frame(x = xnoise, y = ynoise, label = '1')
clusters <- rbind(cluster1, cluster2,noise)
clusters <- data.frame(clusters)
print(ggplot(data = clusters, mapping = aes(x = x, y = y, shape = label, color = label)) + geom_point() + labs(title = "Clusters with noises in normal distribution"))
效果如下:
- 环形数据
generatePointsByRnom <- function(xmean, ymean, sd, num, label) {
x <- round(rnorm(num, mean = xmean, sd = sd))
y <- round(rnorm(num, mean = ymean, sd = sd))
data.frame(x, y, label)
}
generateRingShapePointsbyRnom <- function(r, class) {
cluster = vector()
for (i in 1:60) {
angle = i * 6
x = round(r * cos(angle))
y = round(r * sin(angle))
cluster <- rbind(cluster, generatePointsByRnom(x, y, 3, 2, class))
}
cluster
}
cluster1 <- generateRingShapePointsbyRnom(10, '1')
cluster2 <- generateRingShapePointsbyRnom(20, '2')
clusters <- rbind(cluster1, cluster2)
clusters <- data.frame(clusters)
print(ggplot(data = clusters, mapping = aes(x = x, y = y, shape = label, color = label)) + geom_point() + labs(title = "Clusters with ring shape in normal distribution"))
效果如下:
- 半环形数据(这里用的是均匀分布)
generatePointsByUniform <- function(x, y, num, label) {
x <- round(runif(num,x-3,x+3))
y <- round(runif(num, y - 3, y + 3))
data.frame(x, y, label)
}
generateRingShapePointsbyUniform <- function(r,xoffset,yoffset,divangle, class) {
cluster = vector()
for (i in 1:60) {
angle = i /divangle
x = round(r * cos(angle))+xoffset
y = round(r * sin(angle))+yoffset
cluster <- rbind(cluster, generatePointsByUniform(x, y,3, class))
}
cluster
}
cluster1 <- generateRingShapePointsbyUniform(10,0,0,20,'1')
cluster2 <- generateRingShapePointsbyUniform(10,10,5, -20,'2')
clusters <- rbind(cluster1, cluster2)
clusters <- data.frame(clusters)
print(ggplot(data = clusters, mapping = aes(x = x, y = y, shape = label, color = label)) + geom_point() + labs(title = "Clusters with half ring shape in uniform distribution"))
效果如下:
版权声明:本文为bunnysxy原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。