1. 集成学习

Bagging + Decision Tree -> Random Forest

AdaBoost + Decison Tree -> Boosting Decision Tree 提升树

Gradient Boosting + Decison -> Gradient Boosting Decision Tree GBDT梯度提升树

2. 提升树

1>分类问题

分类问题采用二叉分类树作为基分类器，采用指数函数作为损失函数。AdaBoost算法的损失函数也是指数函数，因此将其基分类器限定为二叉分类树，即为提升树算法。

2>回归问题

回归问题采用二叉回归树作为基分类器，采用均方误差作为损失函数: $L(y_i, f_{m-1}(x)+T(x;\theta_m ))=(y_i-f_{m-1}(x)-T(x;\theta))^2=r-T(x;\theta)$

式中T为当前决策树，f为当前模型，由上式可知，于是损失函数最小，则需使当前决策树逼近r，r称为残差。

具体流程如下：

a. 初始化基函数

b. 计算现有模型的残差：

$r_{mi}=y_i-f_{m-1}(x_i)$ i=1,2,…n

c. 通过残差拟合新模型 $T(x;\theta_m)$ ，该过程实质为CART回归树的建造过程，该树以残差为回归目标。

d. 更新模型

$f_m(x)=f_{m-1}(x)+T(x;\theta_m)$

e. 重复b~d直至m=M+1

f. 得到提升树模型

$f(x)=f_m(x)=\sum_{m=1}^MT(x;\theta_m)$

3.GBDT

GDBT是在提升树的基础上，利用最速下降法，用损失函数的负梯度作为残差的近似值，拟合新的模型。步骤如下：

a. 初始化基函数

b. 计算现有模型的负梯度，作为残差：

$r_{mi}=[\frac{\partial L(y_i, f_{m-1}(x_i))}{\partial f_{m-1}(x_i)}]$

c. 通过残差拟合新模型 $T(x;\theta_m)$ 。

d. 更新模型

$f_m(x)=f_{m-1}(x)+T(x;\theta_m)$

e. 重复b~d直至m=M+1

f. 得到提升树模型

$f(x)=f_m(x)=\sum_{m=1}^MT(x;\theta_m)$

4. Xgboost

XgBoost(eXtreme Gradient Boosting)是GBDT的改进算法，在模型本身及其计算效率两方面进行了改进。

1> 模型损失函数中加入了L2正则项

$Loss=L(y_i, f_{m-1}(x)+T(x;\theta_m ))+\sum_k\Omega (f_k)$

$\Omega (f_k)=\gamma T+\frac{1}{2}\lambda ||w||^2$

上式中T为叶子节点数目。该正则化项能够限制基模型的复杂度。

2> 使用二阶倒数拟合残差

$Loss=L[(y_i, f_{m-1}(x)+T(x;\theta_m ))+g_iT(x_i;\theta)+h_iT(x_i;\theat)^2]+\sum_k\Omega (f_k)$

其中：

$g_i= \frac{\partial L(y_i, f_{m-1}(x_i))}{\partial_{m-1}(x_i)}$

$h_i= \frac{1}{2}\frac{\partial^2 L(y_i, f_{m-1}(x_i))}{\partial_{m-1}(x_i)}$

Xgboost在一定程度上可以实现并行化，训练速度快，可以处理稀疏数据，正则化策略较好的防止模型过拟合。

实现代码：

import sklearn
import numpy as np
from sklearn import datasets, model_selection, ensemble
from matplotlib import pyplot as plt
"""
基于梯度提升树的分类与回归问题
"""


""" 
    回归问题，加载糖尿病病人数据集
"""


def load_dataset_regression():
    diabetes = datasets.load_diabetes()
    return model_selection.train_test_split(diabetes.data, diabetes.target,
                                            test_size=0.25, random_state=0)


""" 
    分类问题，加载鸢尾花数据集
"""


def load_dataset_classfication():
    diabetes = datasets.load_digits()
    return model_selection.train_test_split(diabetes.data, diabetes.target,
                                            test_size=0.25, random_state=0)


"""
梯度提升决策树用作分类
def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,
                 subsample=1.0, criterion='friedman_mse', min_samples_split=2,
                 min_samples_leaf=1, min_weight_fraction_leaf=0.,
                 max_depth=3, min_impurity_decrease=0.,
                 min_impurity_split=None, init=None,
                 random_state=None, max_features=None, verbose=0,
                 max_leaf_nodes=None, warm_start=False,
                 presort='auto', validation_fraction=0.1,
                 n_iter_no_change=None, tol=1e-4):
            loss：损失函数 deviance默认 exponential指数损失函数,
            learning_rate 学习率
            n_estimators整数，指定基础决策树数量
            subsample=1.0浮点数，训练样本子集占原始训练集的大小
            warm_start=False,是否使用上一次的训练结果
        方法：
            stage_predict():返回一个数组，数组元素为每一轮迭代结束时集成分类器的预测值。
"""


def test_GradientBoostingClassifier(*args):
    X_train, X_test, y_train, y_test = args
    clf = ensemble.GradientBoostingClassifier()
    clf.fit(X_train, y_train)
    print("training score:%f" %clf.score(X_train, y_train))
    print("testing score: %f" %clf.score(X_test, y_test))


def test_GradientBoostingClassifier_maxdepth(*args):
    from sklearn.naive_bayes import GaussianNB
    X_train, X_test, y_train,  y_test = args
    max_depth = np.arange(1, 20)
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)
    test_score = []
    train_score = []
    for i in max_depth:
        clf = ensemble.GradientBoostingClassifier(max_depth=i, max_leaf_nodes=None)
        clf.fit(X_train, y_train)
        train_score.append(clf.score(X_train, y_train))
        test_score.append(clf.score(X_test, y_test))
    ax.plot(max_depth, train_score, label="Traning score")
    ax.plot(max_depth, test_score, label="Testing score")
    ax.set_xlabel("max depth")
    ax.set_ylabel("score")
    ax.legend(loc="lower right")
    ax.set_ylim(0, 1.05)
    ax.set_title("Gradient Boosting Classifier")
    plt.show()


"""
梯度提升树回归
__init__(self, loss='ls', learning_rate=0.1, n_estimators=100,
                 subsample=1.0, criterion='friedman_mse', min_samples_split=2,
                 min_samples_leaf=1, min_weight_fraction_leaf=0.,
                 max_depth=3, min_impurity_decrease=0.,
                 min_impurity_split=None, init=None, random_state=None,
                 max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None,
                 warm_start=False, presort='auto', validation_fraction=0.1,
                 n_iter_no_change=None, tol=1e-4):
                 loss损失函数，ls平方损失函数， lad绝对值损失函数，huber两者结合
                 n_estimators整数，基础决策树个数,
                 max_depth:个体回归树最大深度
                 subsample：浮点数，单个回归数训练集占原始数据集的大小，小于1.0则为随机梯度提升回归数
                 max_feature:指定基础决策树的max_feature模型
"""


def test_GradientBoostingRegressor(*args):
    X_train, X_test, y_train, y_test = args
    regr = ensemble.GradientBoostingRegressor()
    regr.fit(X_train, y_train)
    print("training score:%f" % regr.score(X_train, y_train))
    print("testing score: %f" % regr.score(X_test, y_test))


if __name__ == '__main__':
    X_train, X_test, y_train, y_test = load_dataset_classfication()
    X_train, X_test, y_train, y_test = load_dataset_regression()
    # test_GradientBoostingClassifier(X_train, X_test, y_train, y_test)
    # test_GradientBoostingClassifier_maxdepth(X_train, X_test, y_train, y_test)
    test_GradientBoostingRegressor(X_train, X_test, y_train, y_test)

原文链接：https://blog.csdn.net/chaichai1997/article/details/105600447

1. 集成学习

2. 提升树

3.GBDT

4. Xgboost

你可能也喜欢