使用XGBoost进行时间序列预测流程代码

  • Post author:
  • Post category:其他


来自

参考



原始数据

也就是两列数据,一列是时间,一列是电力消耗量:

Datetime,PJME_MW
2002-12-31 01:00:00,26498.0
2002-12-31 02:00:00,25147.0
2002-12-31 03:00:00,24574.0
2002-12-31 04:00:00,24393.0
2002-12-31 05:00:00,24860.0
2002-12-31 06:00:00,26222.0
2002-12-31 07:00:00,28702.0
2002-12-31 08:00:00,30698.0
...
2018-01-01 19:00:00,44343.0
2018-01-01 20:00:00,44284.0
2018-01-01 21:00:00,43751.0
2018-01-01 22:00:00,42402.0
2018-01-01 23:00:00,40164.0
2018-01-02 00:00:00,38608.0



准备训练集和测试集

以2015-01-01切分训练集和测试集:

pjme = pd.read_csv('PJME_hourly.csv', index_col=[0], parse_dates=[0])
split_date = '2015-01-01'
pjme_train = pjme.loc[pjme.index <= split_date].copy()
pjme_test = pjme.loc[pjme.index > split_date].copy()

构造特征:

def create_features(df, label=None):
    df['date'] = df.index # index: DatetimeIndex
    df['hour'] = df['date'].dt.hour # dt: DatetimeProperties, hour: Series
    df['day_of_week'] = df['date'].dt.dayofweek
    df['quarter'] = df['date'].dt.quarter
    df['month'] = df['date'].dt.month
    df['year'] = df['date'].dt.year
    df['day_of_year'] = df['date'].dt.dayofyear
    df['day_of_month'] = df['date'].dt.day
    df['week_of_year'] = df['date'].dt.weekofyear

    X = df[['hour', 'day_of_week', 'quarter', 'month', 'year', 'day_of_year', 'day_of_month', 'week_of_year']]
    if label:
        y = df[label]
        return X, y
    return X

# 训练集
X_train, y_train = create_features(pjme_train, label='PJME_MW')
# 测试集
X_test, y_test = create_features(pjme_test, label='PJME_MW')
X_train:

                     hour  day_of_week  quarter  month  year  day_of_year  day_of_month  week_of_year
Datetime                                                                                             
2002-12-31 01:00:00     1            1        4     12  2002          365            31             1
2002-12-31 02:00:00     2            1        4     12  2002          365            31             1
2002-12-31 03:00:00     3            1        4     12  2002          365            31             1
2002-12-31 04:00:00     4            1        4     12  2002          365            31             1
2002-12-31 05:00:00     5            1        4     12  2002          365            31             1
...



模型->训练->预测

# 模型
reg = xgb.XGBRegressor(n_estimators=1000)
# 训练
reg.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], early_stopping_rounds=50)
[0]	validation_0-rmse:29710.4	validation_1-rmse:28762.5
Multiple eval metrics have been passed: 'validation_1-rmse' will be used for early stopping.

Will train until validation_1-rmse hasn't improved in 50 rounds.
[1]	validation_0-rmse:26822.6	validation_1-rmse:25892.2
[2]	validation_0-rmse:24211.2	validation_1-rmse:23286.6
[3]	validation_0-rmse:21885.1	validation_1-rmse:20967.5
[4]	validation_0-rmse:19780.3	validation_1-rmse:18868.5
...
[195]	validation_0-rmse:2844.33	validation_1-rmse:3754.45
[196]	validation_0-rmse:2842.94	validation_1-rmse:3754.73
[197]	validation_0-rmse:2840.57	validation_1-rmse:3754.88
[198]	validation_0-rmse:2838.73	validation_1-rmse:3754.71
[199]	validation_0-rmse:2837.81	validation_1-rmse:3753.66
Stopping. Best iteration:
[149]	validation_0-rmse:2923.17	validation_1-rmse:3712.2
# 预测
y_pred = reg.predict(X_test)
[28804.365 27663.098 27125.912 ... 34988.7   32725.598 31440.66 ]



评价

RMSE: 均方根误差(Root Mean Square Error)