交叉验证的用处

  • Post author:
  • Post category:其他


以下内容转载自两篇博文,具体地址直接百度可以看到~

1.交叉验证可以用来确定模型的参数

一般来说准确率(accuracy)会用于判断分类(Classification)模型的好坏。

#coding:utf-8

from sklearn.datasets import load_iris # iris数据集

from sklearn.model_selection import train_test_split # 分割数据模块

from sklearn.neighbors import KNeighborsClassifier # K最近邻(kNN,k-NearestNeighbor)分类算法

from sklearn.model_selection import cross_val_score # K折交叉验证模块

import matplotlib.pyplot as plt #可视化模块

iris = load_iris()

X = iris.data

y = iris.target

#建立测试参数集

k_range = range(1, 31)

k_scores = []

#藉由迭代的方式来计算不同参数对模型的影响,并返回交叉验证后的平均准确率

for k in k_range:

knn = KNeighborsClassifier(n_neighbors=k)

scores = cross_val_score(knn, X, y, cv=10, scoring=’accuracy’)

k_scores.append(scores.mean())

#可视化数据

plt.plot(k_range, k_scores)

plt.xlabel(‘Value of K for KNN’)

plt.ylabel(‘Cross-Validated Accuracy’)

plt.show()

一般来说平均方差(Mean squared error)会用于判断回归(Regression)模型的好坏。

for k in k_range:

knn = KNeighborsClassifier(n_neighbors=k)

loss = -cross_val_score(knn, X, y, cv=10, scoring=’neg_mean_squared_error’)

k_scores.append(loss.mean())

plt.plot(k_range, k_scores)

plt.xlabel(‘Value of K for KNN’)

plt.ylabel(‘Cross-Validated MSE’)

plt.show()

2.检视过拟合

验证SVC中的一个参数 gamma 在什么范围内能使 model 产生好的结果. 以及过拟合和 gamma 取值的关系.

#coding:utf-8

from sklearn.model_selection import validation_curve

from sklearn.datasets import load_digits

from sklearn.svm import SVC

import matplotlib.pyplot as plt

import numpy as np

#digits数据集

digits = load_digits()

X = digits.data

y = digits.target

#建立参数测试集

param_range = np.logspace(-6, -2.3, 5)

#使用validation_curve快速找出参数对模型的影响

train_loss, test_loss = validation_curve(

SVC(), X, y, param_name=’gamma’, param_range=param_range, cv=10, scoring=’neg_mean_squared_error’)

#平均每一轮的平均方差

train_loss_mean = -np.mean(train_loss, axis=1)

test_loss_mean = -np.mean(test_loss, axis=1)

#可视化图形

plt.plot(param_range, train_loss_mean, ‘o-‘, color=”r”,

label=”Training”)

plt.plot(param_range, test_loss_mean, ‘o-‘, color=”g”,

label=”Cross-validation”)

plt.xlabel(“gamma”)

plt.ylabel(“Loss”)

plt.legend(loc=”best”)

plt.show()

3.用于模型的选择

交叉验证也可以帮助我们进行模型选择,以下是一组例子,分别使用iris数据,KNN和logistic回归模型进行模型的比较和选择。

In [13]:

# 10-fold cross-validation with the best KNN model

knn = KNeighborsClassifier(n_neighbors=20)

print cross_val_score(knn, X, y, cv=10, scoring=’accuracy’).mean()

0.98

In [14]:

# 10-fold cross-validation with logistic regression

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

print cross_val_score(logreg, X, y, cv=10, scoring=’accuracy’).mean()

0.953333333333

4.用于特征选择

下面我们使用advertising数据,通过交叉验证来进行特征的选择,对比不同的特征组合对于模型的预测效果。

In [15]:

import pandas as pd

import numpy as np

from sklearn.linear_model import LinearRegression

In [16]:

# read in the advertising dataset

data = pd.read_csv(‘http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv’, index_col=0)

In [17]:

# create a Python list of three feature names

feature_cols = [‘TV’, ‘Radio’, ‘Newspaper’]

# use the list to select a subset of the DataFrame (X)

X = data[feature_cols]

# select the Sales column as the response (y)

y = data.Sales

In [18]:

# 10-fold cv with all features

lm = LinearRegression()

scores = cross_val_score(lm, X, y, cv=10, scoring=’mean_squared_error’)

print scores

[-3.56038438 -3.29767522 -2.08943356 -2.82474283 -1.3027754  -1.74163618

-8.17338214 -2.11409746 -3.04273109 -2.45281793]

这里要注意的是,上面的scores都是负数,为什么均方误差会出现负数的情况呢?因为这里的mean_squared_error是一种损失函数,优化的目标的使其最小化,而分类准确率是一种奖励函数,优化的目标是使其最大化。

In [19]:

# fix the sign of MSE scores

mse_scores = -scores

print mse_scores

[ 3.56038438  3.29767522  2.08943356  2.82474283  1.3027754   1.74163618

8.17338214  2.11409746  3.04273109  2.45281793]

In [20]:

# convert from MSE to RMSE

rmse_scores = np.sqrt(mse_scores)

print rmse_scores

[ 1.88689808  1.81595022  1.44548731  1.68069713  1.14139187  1.31971064

2.85891276  1.45399362  1.7443426   1.56614748]

In [21]:

# calculate the average RMSE

print rmse_scores.mean()

1.69135317081

In [22]:

# 10-fold cross-validation with two features (excluding Newspaper)

feature_cols = [‘TV’, ‘Radio’]

X = data[feature_cols]

print np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring=’mean_squared_error’)).mean()

1.67967484191

由于不加入Newspaper这一个特征得到的分数较小(1.68 < 1.69),所以,使用所有特征得到的模型是一个更好的模型。