K-近邻算法实现与应用(KNN)

  • Post author:
  • Post category:其他







距离度量


在这里插入图片描述






曼哈顿距离


曼哈顿距离又称马氏距离,是计算距离最简单的方式之一。公式如下:

d m a n = ∑ i = 1 N ∣ X i − Y i ∣ d_{man}=\sum_{i=1}^{N}\left | X_{i}-Y_{i} \right |

d


m


a


n

​=

i

=1∑

N

​∣

X


i

​−

Y


i

​∣


其中:

  • X X

    X

    , Y Y

    Y

    :两个数据点
  • N N

    N

    :每个数据中有 N N

    N

    个特征值
  • X i X_{i}

    X


    i

    ​ :数据 X X

    X

    的第 i i

    i

    个特征值


公式表示为将两个

数据

𝑋 和 𝑌 中每一个对应特征值之间差值的绝对值,再求和,便得到曼哈顿距离。







欧氏距离



欧式距离源自 N N

N

维欧氏空间中两点之间的距离公式。表达式如下:

d e u c = ∑ i = 1 N ( X i − Y i ) 2 d_{euc}= \sqrt{\sum_{i=1}^{N}(X_{i}-Y_{i})^{2}}

d


e


u


c

​=

i

=1∑

N

​(

X


i

​−

Y


i

​)2​

  • X X

    X

    , Y Y

    Y

    :两个数据点
  • N N

    N

    :每个数据中有 N N

    N

    个特征值
  • X i X_{i}

    X


    i

    ​ :数据 X X

    X

    的第 i i

    i

    个特征值


公式表示为将两个数据 X X

X

和 Y Y

Y

中的每一个对应特征值之间差值的平方,再求和,最后开平方,便是欧式距离。






最近邻算法


介绍 K-近邻算法之前,首先说一说最近邻算法。最近邻算法(Nearest Neighbor,简称:NN),其针对未知类别数据 x x

x

,在训练集中找到与 x x

x

最相似的训练样本 y y

y

,用 y y

y

的样本对应的类别作为未知类别数据 x x

x

的类别,从而达到分类的效果。

在这里插入图片描述

如上图所示,通过计算数据 X u X_{u}

X


u

​ (未知样本)和已知类别 ω 1 , ω 2 , ω 3 {\omega_{1},\omega_{2},\omega_{3}}

ω

1​,

ω

2​,

ω

3​ (已知样本)之间的距离,判断 X u X_{u}

X


u

​ 与不同训练集的相似度,最终判断 X u X_{u}

X


u

​ 的类别。显然,这里将

绿色未知样本

类别判定与

红色已知样本

类别相同较为合适。






K-近邻算法


K-近邻(K-Nearest Neighbors,简称:KNN)算法是最近邻(NN)算法的一个推广,也是机器学习分类算法中最简单的方法之一。KNN 算法的核心思想和最近邻算法思想相似,都是通过寻找和未知样本相似的类别进行分类。但 NN 算法中只依赖 1 个样本进行决策,在分类时过于绝对,会造成分类效果差的情况,为解决 NN 算法的缺陷,KNN 算法采用 K 个相邻样本的方式共同决策未知样本的类别,这样在决策中容错率相对于 NN 算法就要高很多,分类效果也会更好。

在这里插入图片描述






算法实现

  1. 数据准备:通过数据清洗,数据处理,将每条数据整理成向量。
  2. 计算距离:计算测试数据与训练数据之间的距离。
  3. 寻找邻居:找到与测试数据距离最近的 K 个训练数据样本。
  4. 决策分类:根据决策规则,从 K 个邻居得到测试数据的类别。


在这里插入图片描述






决策规则


在得到测试样本和训练样本之间的相似度后,通过相似度的排名,可以得到每一个测试样本的 K 个相邻的训练样本,那如何通过 K 个邻居来判断测试样本的最终类别呢?可以根据数据特征对决策规则进行选取,不同的决策规则会产生不同的预测结果,最常用的决策规则是:

  • 多数表决法:多数表决法类似于投票的过程,也就是在 K 个邻居中选择类别最多的种类作为测试样本的类别。
  • 加权表决法:根据距离的远近,对近邻的投票进行加权,距离越近则权重越大,通过权重计算结果最大值的类为测试样本的类别。







KNN算法

实现

<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">def</span> <span style="color:#61aeee">knn_regression</span><span style="color:#999999">(</span>train_data<span style="color:#999999">,</span> train_labels<span style="color:#999999">,</span> test_data<span style="color:#999999">,</span> k<span style="color:#999999">)</span><span style="color:#999999">:</span>
    <span style="color:#669900">"""
    参数:
    train_data -- 训练数据特征 numpy.ndarray.2d
    train_labels -- 训练数据目标 numpy.ndarray.1d
    test_data -- 测试数据特征 numpy.ndarray.2d
    k -- k 值

    返回:
    test_labels -- 测试数据目标 numpy.ndarray.1d
    """</span>
    <span style="color:#5c6370">### 代码开始 ### (≈ 10 行代码)</span>
    test_labels<span style="color:#669900">=</span>np<span style="color:#999999">.</span>array<span style="color:#999999">(</span><span style="color:#999999">[</span><span style="color:#999999">]</span><span style="color:#999999">)</span>
    <span style="color:#c678dd">for</span> j <span style="color:#c678dd">in</span> test_data<span style="color:#999999">:</span>  <span style="color:#5c6370">#  </span>
        distances<span style="color:#669900">=</span>np<span style="color:#999999">.</span>array<span style="color:#999999">(</span><span style="color:#999999">[</span><span style="color:#999999">]</span><span style="color:#999999">)</span>
        <span style="color:#c678dd">for</span> i <span style="color:#c678dd">in</span> train_data<span style="color:#999999">:</span>  <span style="color:#5c6370"># 欧式距离</span>
            temp<span style="color:#669900">=</span>np<span style="color:#999999">.</span>sqrt<span style="color:#999999">(</span>np<span style="color:#999999">.</span>square<span style="color:#999999">(</span><span style="color:#669900">sum</span><span style="color:#999999">(</span>j<span style="color:#669900">-</span>i<span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">)</span>
            distances<span style="color:#669900">=</span>np<span style="color:#999999">.</span>append<span style="color:#999999">(</span>distances<span style="color:#999999">,</span>temp<span style="color:#999999">)</span>
        sorted_labels<span style="color:#669900">=</span>distances<span style="color:#999999">.</span>argsort<span style="color:#999999">(</span><span style="color:#999999">)</span>
        temp_label<span style="color:#669900">=</span>train_labels<span style="color:#999999">[</span>sorted_labels<span style="color:#999999">[</span><span style="color:#999999">:</span>k<span style="color:#999999">]</span><span style="color:#999999">]</span>
        test_labels<span style="color:#669900">=</span>np<span style="color:#999999">.</span>append<span style="color:#999999">(</span>test_labels<span style="color:#999999">,</span> np<span style="color:#999999">.</span>mean<span style="color:#999999">(</span>temp_label<span style="color:#999999">)</span><span style="color:#999999">)</span>
    <span style="color:#5c6370">### 代码结束 ###</span>
    <span style="color:#c678dd">return</span> test_labels
</code></span></span>






测试数据

<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">import</span> numpy <span style="color:#c678dd">as</span> np

<span style="color:#5c6370"># 训练样本特征</span>
train_data <span style="color:#669900">=</span> np<span style="color:#999999">.</span>array<span style="color:#999999">(</span><span style="color:#999999">[</span><span style="color:#999999">[</span><span style="color:#98c379">1</span><span style="color:#999999">,</span> <span style="color:#98c379">1</span><span style="color:#999999">]</span><span style="color:#999999">,</span> <span style="color:#999999">[</span><span style="color:#98c379">2</span><span style="color:#999999">,</span> <span style="color:#98c379">2</span><span style="color:#999999">]</span><span style="color:#999999">,</span> <span style="color:#999999">[</span><span style="color:#98c379">3</span><span style="color:#999999">,</span> <span style="color:#98c379">3</span><span style="color:#999999">]</span><span style="color:#999999">,</span> <span style="color:#999999">[</span><span style="color:#98c379">4</span><span style="color:#999999">,</span> <span style="color:#98c379">4</span><span style="color:#999999">]</span><span style="color:#999999">,</span> <span style="color:#999999">[</span><span style="color:#98c379">5</span><span style="color:#999999">,</span> <span style="color:#98c379">5</span><span style="color:#999999">]</span><span style="color:#999999">,</span>
                       <span style="color:#999999">[</span><span style="color:#98c379">6</span><span style="color:#999999">,</span> <span style="color:#98c379">6</span><span style="color:#999999">]</span><span style="color:#999999">,</span> <span style="color:#999999">[</span><span style="color:#98c379">7</span><span style="color:#999999">,</span> <span style="color:#98c379">7</span><span style="color:#999999">]</span><span style="color:#999999">,</span> <span style="color:#999999">[</span><span style="color:#98c379">8</span><span style="color:#999999">,</span> <span style="color:#98c379">8</span><span style="color:#999999">]</span><span style="color:#999999">,</span> <span style="color:#999999">[</span><span style="color:#98c379">9</span><span style="color:#999999">,</span> <span style="color:#98c379">9</span><span style="color:#999999">]</span><span style="color:#999999">,</span> <span style="color:#999999">[</span><span style="color:#98c379">10</span><span style="color:#999999">,</span> <span style="color:#98c379">10</span><span style="color:#999999">]</span><span style="color:#999999">]</span><span style="color:#999999">)</span>
<span style="color:#5c6370"># 训练样本目标值</span>
train_labels <span style="color:#669900">=</span> np<span style="color:#999999">.</span>array<span style="color:#999999">(</span><span style="color:#999999">[</span><span style="color:#98c379">1</span><span style="color:#999999">,</span> <span style="color:#98c379">2</span><span style="color:#999999">,</span> <span style="color:#98c379">3</span><span style="color:#999999">,</span> <span style="color:#98c379">4</span><span style="color:#999999">,</span> <span style="color:#98c379">5</span><span style="color:#999999">,</span> <span style="color:#98c379">6</span><span style="color:#999999">,</span> <span style="color:#98c379">7</span><span style="color:#999999">,</span> <span style="color:#98c379">8</span><span style="color:#999999">,</span> <span style="color:#98c379">9</span><span style="color:#999999">,</span> <span style="color:#98c379">10</span><span style="color:#999999">]</span><span style="color:#999999">)</span>
<span style="color:#5c6370"># 测试样本特征</span>
test_data <span style="color:#669900">=</span> np<span style="color:#999999">.</span>array<span style="color:#999999">(</span><span style="color:#999999">[</span><span style="color:#999999">[</span><span style="color:#98c379">1.2</span><span style="color:#999999">,</span> <span style="color:#98c379">1.3</span><span style="color:#999999">]</span><span style="color:#999999">,</span> <span style="color:#999999">[</span><span style="color:#98c379">3.7</span><span style="color:#999999">,</span> <span style="color:#98c379">3.5</span><span style="color:#999999">]</span><span style="color:#999999">,</span> <span style="color:#999999">[</span><span style="color:#98c379">5.5</span><span style="color:#999999">,</span> <span style="color:#98c379">6.2</span><span style="color:#999999">]</span><span style="color:#999999">,</span> <span style="color:#999999">[</span><span style="color:#98c379">7.1</span><span style="color:#999999">,</span> <span style="color:#98c379">7.9</span><span style="color:#999999">]</span><span style="color:#999999">]</span><span style="color:#999999">)</span>
<span style="color:#5c6370"># 测试样本目标值</span>
knn_regression<span style="color:#999999">(</span>train_data<span style="color:#999999">,</span> train_labels<span style="color:#999999">,</span> test_data<span style="color:#999999">,</span> k<span style="color:#669900">=</span><span style="color:#98c379">3</span><span style="color:#999999">)</span>
</code></span></span>


输出:array([2., 4., 6., 7.])






丁香花分类






加载数据集

<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">import</span> pandas <span style="color:#c678dd">as</span> pd
<span style="color:#c678dd">import</span> matplotlib<span style="color:#999999">.</span>pyplot <span style="color:#c678dd">as</span> plt
lilac_data <span style="color:#669900">=</span> pd<span style="color:#999999">.</span>read_csv<span style="color:#999999">(</span>
    <span style="color:#669900">'https://labfile.oss.aliyuncs.com/courses/1081/course-9-syringa.csv'</span><span style="color:#999999">)</span>
lilac_data<span style="color:#999999">.</span>head<span style="color:#999999">(</span><span style="color:#999999">)</span>  <span style="color:#5c6370"># 预览前 5 行</span>
</code></span></span>






训练测试数据划分


from sklearn.model_selection import train_test_split X_train,X_test, y_train, y_test

=train_test_split(train_data,train_target,test_size=0.4, random_state=0)


  • X_train

    ,

    X_test

    ,

    y_train

    ,

    y_test

    分别表示,切分后的特征的训练集,特征的测试集,标签的训练集,标签的测试集;其中特征和标签的值是一一对应的。

  • train_data

    ,

    train_target

    分别表示为待划分的特征集和待划分的标签集。

  • test_size

    :测试样本所占比例。

  • random_state

    :随机数种子,在需要重复实验时,保证在随机数种子一样时能得到一组一样的随机数。
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">from</span> sklearn<span style="color:#999999">.</span>model_selection <span style="color:#c678dd">import</span> train_test_split

<span style="color:#5c6370"># 得到 lilac 数据集中 feature 的全部序列: sepal_length,sepal_width,petal_length,petal_width</span>
feature_data <span style="color:#669900">=</span> lilac_data<span style="color:#999999">.</span>iloc<span style="color:#999999">[</span><span style="color:#999999">:</span><span style="color:#999999">,</span> <span style="color:#999999">:</span><span style="color:#669900">-</span><span style="color:#98c379">1</span><span style="color:#999999">]</span>
label_data <span style="color:#669900">=</span> lilac_data<span style="color:#999999">[</span><span style="color:#669900">"labels"</span><span style="color:#999999">]</span>  <span style="color:#5c6370"># 得到 lilac 数据集中 label 的序列</span>

X_train<span style="color:#999999">,</span> X_test<span style="color:#999999">,</span> y_train<span style="color:#999999">,</span> y_test <span style="color:#669900">=</span> train_test_split<span style="color:#999999">(</span>
    feature_data<span style="color:#999999">,</span> label_data<span style="color:#999999">,</span> test_size<span style="color:#669900">=</span><span style="color:#98c379">0.3</span><span style="color:#999999">,</span> random_state<span style="color:#669900">=</span><span style="color:#98c379">2</span><span style="color:#999999">)</span>

X_test  <span style="color:#5c6370"># 输出 lilac_test 查看</span>
</code></span></span>






训练模型


sklearn.neighbors.KNeighborsClassifier((n_neighbors=5,weights=‘uniform’,algorithm=‘auto’)


  • n_neighbors

    :

    k

    值,表示邻近个数,默认为

    5


  • weights

    : 决策规则选择,多数表决或加权表决,可用参数(

    'uniform'

    ,

    'distance'


  • algorithm

    : 搜索算法选择(

    auto



    kd_tree

    ,

    ball_tree

    ),包括逐一搜索,

    kd

    树搜索或

    ball

    树搜索
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">from</span> sklearn<span style="color:#999999">.</span>neighbors <span style="color:#c678dd">import</span> KNeighborsClassifier

<span style="color:#c678dd">def</span> <span style="color:#61aeee">sklearn_classify</span><span style="color:#999999">(</span>train_data<span style="color:#999999">,</span> label_data<span style="color:#999999">,</span> test_data<span style="color:#999999">,</span> k_num<span style="color:#999999">)</span><span style="color:#999999">:</span>
    <span style="color:#5c6370"># 使用 sklearn 构建 KNN 预测模型</span>
    knn <span style="color:#669900">=</span> KNeighborsClassifier<span style="color:#999999">(</span>n_neighbors<span style="color:#669900">=</span>k_num<span style="color:#999999">)</span>
    <span style="color:#5c6370"># 训练数据集</span>
    knn<span style="color:#999999">.</span>fit<span style="color:#999999">(</span>train_data<span style="color:#999999">,</span> label_data<span style="color:#999999">)</span>
    <span style="color:#5c6370"># 预测</span>
    predict_label <span style="color:#669900">=</span> knn<span style="color:#999999">.</span>predict<span style="color:#999999">(</span>test_data<span style="color:#999999">)</span>
    <span style="color:#5c6370"># 返回预测值</span>
    <span style="color:#c678dd">return</span> predict_label
</code></span></span>






模型预测

<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#5c6370"># 使用测试数据进行预测</span>
y_predict <span style="color:#669900">=</span> sklearn_classify<span style="color:#999999">(</span>X_train<span style="color:#999999">,</span> y_train<span style="color:#999999">,</span> X_test<span style="color:#999999">,</span> <span style="color:#98c379">3</span><span style="color:#999999">)</span>
y_predict
</code></span></span>






准确率计算

<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">def</span> <span style="color:#61aeee">get_accuracy</span><span style="color:#999999">(</span>test_labels<span style="color:#999999">,</span> pred_labels<span style="color:#999999">)</span><span style="color:#999999">:</span>
    <span style="color:#5c6370"># 准确率计算函数</span>
    correct <span style="color:#669900">=</span> np<span style="color:#999999">.</span><span style="color:#669900">sum</span><span style="color:#999999">(</span>test_labels <span style="color:#669900">==</span> pred_labels<span style="color:#999999">)</span>  <span style="color:#5c6370"># 计算预测正确的数据个数</span>
    n <span style="color:#669900">=</span> <span style="color:#669900">len</span><span style="color:#999999">(</span>test_labels<span style="color:#999999">)</span>  <span style="color:#5c6370"># 总测试集数据个数</span>
    accur <span style="color:#669900">=</span> correct<span style="color:#669900">/</span>n
    <span style="color:#c678dd">return</span> accur
get_accuracy<span style="color:#999999">(</span>y_test<span style="color:#999999">,</span> y_predict<span style="color:#999999">)</span>
</code></span></span>






K 值选择


当 K 值选取为 3 时,可以看到准确率不高,分类效果不太理想。 K 值的选取一直都是一个热门的话题,至今也没有得到很好的解决方法,根据经验,K 值的选择最好不超过样本数量的平方根。所以可以通过遍历的方式选择合适的 K 值。以下我们从 2 到 10 中画出每一个 K 值的准确率从而获得最佳 K 值。

<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python">normal_accuracy <span style="color:#669900">=</span> <span style="color:#999999">[</span><span style="color:#999999">]</span>  <span style="color:#5c6370"># 建立一个空的准确率列表</span>
k_value <span style="color:#669900">=</span> <span style="color:#669900">range</span><span style="color:#999999">(</span><span style="color:#98c379">2</span><span style="color:#999999">,</span> <span style="color:#98c379">11</span><span style="color:#999999">)</span>
<span style="color:#c678dd">for</span> k <span style="color:#c678dd">in</span> k_value<span style="color:#999999">:</span>
    y_predict <span style="color:#669900">=</span> sklearn_classify<span style="color:#999999">(</span>X_train<span style="color:#999999">,</span> y_train<span style="color:#999999">,</span> X_test<span style="color:#999999">,</span> k<span style="color:#999999">)</span>
    accuracy <span style="color:#669900">=</span> get_accuracy<span style="color:#999999">(</span>y_test<span style="color:#999999">,</span> y_predict<span style="color:#999999">)</span>
    normal_accuracy<span style="color:#999999">.</span>append<span style="color:#999999">(</span>accuracy<span style="color:#999999">)</span>

plt<span style="color:#999999">.</span>xlabel<span style="color:#999999">(</span><span style="color:#669900">"k"</span><span style="color:#999999">)</span>
plt<span style="color:#999999">.</span>ylabel<span style="color:#999999">(</span><span style="color:#669900">"accuracy"</span><span style="color:#999999">)</span>
new_ticks <span style="color:#669900">=</span> np<span style="color:#999999">.</span>linspace<span style="color:#999999">(</span><span style="color:#98c379">0.6</span><span style="color:#999999">,</span> <span style="color:#98c379">0.9</span><span style="color:#999999">,</span> <span style="color:#98c379">10</span><span style="color:#999999">)</span>  <span style="color:#5c6370"># 设定 y 轴显示,从 0.6 到 0.9</span>
plt<span style="color:#999999">.</span>yticks<span style="color:#999999">(</span>new_ticks<span style="color:#999999">)</span>
plt<span style="color:#999999">.</span>plot<span style="color:#999999">(</span>k_value<span style="color:#999999">,</span> normal_accuracy<span style="color:#999999">,</span> c<span style="color:#669900">=</span><span style="color:#669900">'r'</span><span style="color:#999999">)</span>
plt<span style="color:#999999">.</span>grid<span style="color:#999999">(</span><span style="color:#56b6c2">True</span><span style="color:#999999">)</span>  <span style="color:#5c6370"># 给画布增加网格</span>
</code></span></span>


在这里插入图片描述

从图像中可以得到,当 K=4 和 K=6 时,模型准确率相当。但机器学习选择最优模型时,我们一般会考虑到模型的泛化能力,所以这里选择 K=4,也就是更简单的模型。

文章来源:

【机器学习】K-近邻算法实现与应用(KNN)_ccql’s Blog-CSDN博客_k近邻算法应用