GBDT与Xgboost

  • Post author:
  • Post category:其他

1. 集成学习  

Bagging + Decision Tree -> Random Forest

AdaBoost + Decison Tree -> Boosting Decision Tree 提升树

Gradient Boosting + Decison -> Gradient Boosting Decision Tree GBDT梯度提升树

2. 提升树

1>分类问题

分类问题采用二叉分类树作为基分类器,采用指数函数作为损失函数。AdaBoost算法的损失函数也是指数函数,因此将其基分类器限定为二叉分类树,即为提升树算法。

2>回归问题

回归问题采用二叉回归树作为基分类器,采用均方误差作为损失函数:L(y_i, f_{m-1}(x)+T(x;\theta_m ))=(y_i-f_{m-1}(x)-T(x;\theta))^2=r-T(x;\theta)

式中T为当前决策树,f为当前模型,由上式可知,于是损失函数最小,则需使当前决策树逼近r,r称为残差。

具体流程如下:

a. 初始化基函数

b. 计算现有模型的残差:

r_{mi}=y_i-f_{m-1}(x_i)       i=1,2,…n

c. 通过残差拟合新模型T(x;\theta_m),该过程实质为CART回归树的建造过程,该树以残差为回归目标。

d. 更新模型

f_m(x)=f_{m-1}(x)+T(x;\theta_m)

e. 重复b~d直至m=M+1

f. 得到提升树模型

f(x)=f_m(x)=\sum_{m=1}^MT(x;\theta_m)

3.GBDT

  GDBT是在提升树的基础上,利用最速下降法,用损失函数的负梯度作为残差的近似值,拟合新的模型。步骤如下:

a. 初始化基函数

b. 计算现有模型的负梯度,作为残差:

r_{mi}=[\frac{\partial L(y_i, f_{m-1}(x_i))}{\partial f_{m-1}(x_i)}]

c. 通过残差拟合新模型T(x;\theta_m)

d. 更新模型

f_m(x)=f_{m-1}(x)+T(x;\theta_m)

e. 重复b~d直至m=M+1

f. 得到提升树模型

f(x)=f_m(x)=\sum_{m=1}^MT(x;\theta_m)

4. Xgboost

  XgBoost(eXtreme Gradient Boosting)是GBDT的改进算法,在模型本身及其计算效率两方面进行了改进。

  1> 模型损失函数中加入了L2正则项

Loss=L(y_i, f_{m-1}(x)+T(x;\theta_m ))+\sum_k\Omega (f_k)

\Omega (f_k)=\gamma T+\frac{1}{2}\lambda ||w||^2

上式中T为叶子节点数目。该正则化项能够限制基模型的复杂度。

 2> 使用二阶倒数拟合残差

Loss=L[(y_i, f_{m-1}(x)+T(x;\theta_m ))+g_iT(x_i;\theta)+h_iT(x_i;\theat)^2]+\sum_k\Omega (f_k)

其中:

g_i= \frac{\partial L(y_i, f_{m-1}(x_i))}{\partial_{m-1}(x_i)}

h_i= \frac{1}{2}\frac{\partial^2 L(y_i, f_{m-1}(x_i))}{\partial_{m-1}(x_i)}

Xgboost在一定程度上可以实现并行化,训练速度快,可以处理稀疏数据,正则化策略较好的防止模型过拟合。

实现代码:

import sklearn
import numpy as np
from sklearn import datasets, model_selection, ensemble
from matplotlib import pyplot as plt
"""
基于梯度提升树的分类与回归问题
"""


""" 
    回归问题,加载糖尿病病人数据集
"""


def load_dataset_regression():
    diabetes = datasets.load_diabetes()
    return model_selection.train_test_split(diabetes.data, diabetes.target,
                                            test_size=0.25, random_state=0)


""" 
    分类问题,加载鸢尾花数据集
"""


def load_dataset_classfication():
    diabetes = datasets.load_digits()
    return model_selection.train_test_split(diabetes.data, diabetes.target,
                                            test_size=0.25, random_state=0)


"""
梯度提升决策树用作分类
def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,
                 subsample=1.0, criterion='friedman_mse', min_samples_split=2,
                 min_samples_leaf=1, min_weight_fraction_leaf=0.,
                 max_depth=3, min_impurity_decrease=0.,
                 min_impurity_split=None, init=None,
                 random_state=None, max_features=None, verbose=0,
                 max_leaf_nodes=None, warm_start=False,
                 presort='auto', validation_fraction=0.1,
                 n_iter_no_change=None, tol=1e-4):
            loss:损失函数 deviance默认 exponential指数损失函数,
            learning_rate 学习率
            n_estimators整数,指定基础决策树数量
            subsample=1.0浮点数,训练样本子集占原始训练集的大小
            warm_start=False,是否使用上一次的训练结果
        方法:
            stage_predict():返回一个数组,数组元素为每一轮迭代结束时集成分类器的预测值。
"""


def test_GradientBoostingClassifier(*args):
    X_train, X_test, y_train, y_test = args
    clf = ensemble.GradientBoostingClassifier()
    clf.fit(X_train, y_train)
    print("training score:%f" %clf.score(X_train, y_train))
    print("testing score: %f" %clf.score(X_test, y_test))


def test_GradientBoostingClassifier_maxdepth(*args):
    from sklearn.naive_bayes import GaussianNB
    X_train, X_test, y_train,  y_test = args
    max_depth = np.arange(1, 20)
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)
    test_score = []
    train_score = []
    for i in max_depth:
        clf = ensemble.GradientBoostingClassifier(max_depth=i, max_leaf_nodes=None)
        clf.fit(X_train, y_train)
        train_score.append(clf.score(X_train, y_train))
        test_score.append(clf.score(X_test, y_test))
    ax.plot(max_depth, train_score, label="Traning score")
    ax.plot(max_depth, test_score, label="Testing score")
    ax.set_xlabel("max depth")
    ax.set_ylabel("score")
    ax.legend(loc="lower right")
    ax.set_ylim(0, 1.05)
    ax.set_title("Gradient Boosting Classifier")
    plt.show()


"""
梯度提升树回归
__init__(self, loss='ls', learning_rate=0.1, n_estimators=100,
                 subsample=1.0, criterion='friedman_mse', min_samples_split=2,
                 min_samples_leaf=1, min_weight_fraction_leaf=0.,
                 max_depth=3, min_impurity_decrease=0.,
                 min_impurity_split=None, init=None, random_state=None,
                 max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None,
                 warm_start=False, presort='auto', validation_fraction=0.1,
                 n_iter_no_change=None, tol=1e-4):
                 loss损失函数,ls平方损失函数, lad绝对值损失函数,huber两者结合
                 n_estimators整数,基础决策树个数,
                 max_depth:个体回归树最大深度
                 subsample:浮点数,单个回归数训练集占原始数据集的大小,小于1.0则为随机梯度提升回归数
                 max_feature:指定基础决策树的max_feature模型
"""


def test_GradientBoostingRegressor(*args):
    X_train, X_test, y_train, y_test = args
    regr = ensemble.GradientBoostingRegressor()
    regr.fit(X_train, y_train)
    print("training score:%f" % regr.score(X_train, y_train))
    print("testing score: %f" % regr.score(X_test, y_test))


if __name__ == '__main__':
    X_train, X_test, y_train, y_test = load_dataset_classfication()
    X_train, X_test, y_train, y_test = load_dataset_regression()
    # test_GradientBoostingClassifier(X_train, X_test, y_train, y_test)
    # test_GradientBoostingClassifier_maxdepth(X_train, X_test, y_train, y_test)
    test_GradientBoostingRegressor(X_train, X_test, y_train, y_test)

 


版权声明:本文为chaichai1997原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。