【机器学习】基于MNIST数据集的KNN，SVM分类器测试

Post author:xfxia
Post published:2023年9月30日
Post category:其他

上篇blog讲了MNIST的读取方法

本文主要利用MNIST数据集进行对分类器进行测试

KNN 近邻分类器

KNN是一种懒惰学习（Lazy learning）方法，其所谓训练过程就是将训练数据存入空间中。然后在测试时，将待测试数据投入到数据空间中寻找近邻，通过近邻类别的投票来确定该数据的类别

from sklearn.neighbors import KNeighborsClassifier
import MNIST 
import numpy as np


if __name__ == '__main__':

    im = MNIST.MNIST2vector('train-images.idx3-ubyte')
    label = MNIST.decode_idx1_ubyte('train-labels.idx1-ubyte')

    neigh = KNeighborsClassifier(n_neighbors=3)
    neigh.fit(im, label) 

    test = MNIST.MNIST2vector('t10k-images.idx3-ubyte')
    test_label = MNIST.decode_idx1_ubyte('t10k-labels.idx1-ubyte')

    # res = neigh.predict(test)
    score = neigh.score(test, test_label)
    print(" score: {:.6f}".format(score))

训练数据在60k，测试数据集10k，分类进度为97.05%

当数据集大的时候，可以发现一些简单模型都能实现比较好的训练效果。好的数据实际上要比模型重要的多。

SVM近邻分类器

SVM是目前分类器中效果最好的之一，其核函数（kernel trick）可以将数据映射到高维空间中，可以对数据很好的分类。

然而我在测试的时候发现SVM的几个缺点：

1）不适用于大规模数据，SVM在小规模数据中可以很好抓住特征和数据之间的非线性关系。然而数据量大的时候，在训练速度和分类精度都不是那么的恰如人意。

2) kernel选择困难，调参麻烦

from sklearn.svm import SVC
import MNIST 
import random


if __name__ == '__main__':

    im = MNIST.MNIST2vector('train-images.idx3-ubyte')
    label = MNIST.decode_idx1_ubyte('train-labels.idx1-ubyte')

    test = MNIST.MNIST2vector('t10k-images.idx3-ubyte')
    test_label = MNIST.decode_idx1_ubyte('t10k-labels.idx1-ubyte')

    train_idx = list(range(10000))
    random.shuffle(train_idx)
    im_sample = im[train_idx]
    label_sample = label[train_idx]

    test_idx = list(range(200))
    random.shuffle(test_idx)
    test_sample = test[test_idx]
    test_label_sample = test_label[test_idx]

    clf = SVC(kernel = 'poly')
    clf.fit(im_sample, label_sample) 

    score = clf.score(test_sample, test_label_sample)
    print(" score: {:.6f}".format(score))

由于这个训练时间太长了，SVM的训练时间复杂度还是很高的。我就采样进行训练了，10000个训练集，200个测试集。

选择多项式核poly，精度97.5%

，RBF核的精度相当差，只有12%左右

原文链接：https://blog.csdn.net/shwan_ma/article/details/77606218

KNN 近邻分类器

SVM近邻分类器

你可能也喜欢