XGBoost 无法预测一个简单的正弦函数。

huangapple go评论73阅读模式
英文:

XGBoost can't predict a simple sinusoidal function

问题

I created a very simple function to test XGBoost.

X is an array containing 1000 rows of "7np.pi" for each row.
Y is simply "1 + 0.5
np.sin(x)".

I split the dataset in 800 training and 200 testing rows. Shuffle MUST be False to simulate future occurrences, making sure the last 200 rows are reserved to testing.

import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error as MSE
from xgboost import XGBRegressor

N = 1000                       # 1000 rows
x = np.linspace(0, 7*np.pi, N) # Simple function
y = 1 + 0.5*np.sin(x)          # Generate simple function sin(x) as y

# Train-test split, intentionally use shuffle=False to simulate time series
X = x.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
### Interestingly, model generalizes well if shuffle=False
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)

XGB_reg = XGBRegressor(random_state=42)
XGB_reg.fit(X_train,y_train)

# EVALUATE ON TRAIN DATA
yXGBPredicted = XGB_reg.predict(X_train)
rmse = np.sqrt(MSE(y_train, yXGBPredicted))
print("RMSE TRAIN XGB: % f" %(rmse))

# EVALUATE ON TEST DATA
yXGBPredicted = XGB_reg.predict(X_test)
# METRICAS XGB
rmse = np.sqrt(MSE(y_test, yXGBPredicted))
print("RMSE TEST XGB: % f" %(rmse))

# Predict full dataset
yXGB = XGB_reg.predict(X)

# Plot and compare
plt.style.use('fivethirtyeight')
plt.rcParams.update({'font.size': 16})
fig, ax = plt.subplots(figsize=(10,5))
plt.plot(x, y)
plt.plot(x, yXGB)
plt.ylim(0,2)
plt.xlabel("x")
plt.ylabel("y")
plt.show()

I trained the model on the first 800 rows and then predicted the next 200 rows.

I was expecting testing data to have a great RMSE, but it did not happen.

I was surprised to see that XGBoost simply repeated the last value of the training set on all rows of the predictions (see chart).

Any ideas why this doesn't work?

XGBoost 无法预测一个简单的正弦函数。

英文:

I created a very simple function to test XGBoost.

X is an array containing 1000 rows of "7*np.pi" for each row.
Y is simply "1 + 0.5*np.sin(x)"

I split the dataset in 800 training and 200 testing rows. Shuffle MUST be False to simulate future occurrences, making sure the last 200 rows are reserved to testing.

import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt 
from sklearn.metrics import mean_squared_error as MSE
from xgboost import XGBRegressor

N = 1000                       # 1000 rows
x = np.linspace(0, 7*np.pi, N) # Simple function
y = 1 + 0.5*np.sin(x)          # Generate simple function sin(x) as y

# Train-test split, intentionally use shuffle=False to simulate time series
X = x.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
### Interestingly, model generalizes well if shuffle=False
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)

XGB_reg = XGBRegressor(random_state=42)
XGB_reg.fit(X_train,y_train)

# EVALUATE ON TRAIN DATA
yXGBPredicted = XGB_reg.predict(X_train)
rmse = np.sqrt(MSE(y_train, yXGBPredicted))
print("RMSE TRAIN XGB: % f" %(rmse))

# EVALUATE ON TEST DATA
yXGBPredicted = XGB_reg.predict(X_test)
# METRICAS XGB
rmse = np.sqrt(MSE(y_test, yXGBPredicted))
print("RMSE TEST XGB: % f" %(rmse))

# Predict full dataset
yXGB = XGB_reg.predict(X)

# Plot and compare
plt.style.use('fivethirtyeight')
plt.rcParams.update({'font.size': 16})
fig, ax = plt.subplots(figsize=(10,5))
plt.plot(x, y)
plt.plot(x, yXGB)
plt.ylim(0,2)
plt.xlabel("x")
plt.ylabel("y")
plt.show()

I trained the model on the first 800 rows and then predicted the next 200 rows.

I was expecting testing data to have a great RMSE, but it did not happen.

I was surprised to see that XGBoost simple repeated the last value of the training set on all rows of the predictions (see chart).

Any ideas why this doesn't work?

XGBoost 无法预测一个简单的正弦函数。

答案1

得分: 3

您要求模型进行“外推” - 对大于训练数据集中的x值进行预测。外推适用于某些模型类型(如线性模型),但通常不适用于决策树模型及其集成模型(如XGBoost)。

如果您从XGBoost切换到LightGBM,则可以使用“线性树”方法训练支持外推的决策树集成模型:

"您的XGBRegressor可能已经过拟合(具有n_estimators = 100max_depth = 6)。如果您减小这些参数值,那么红线将显得更加锯齿状,您将更容易看到它“起作用”。

目前,如果您要求您的过拟合的XGBRegressor进行外推,那么它基本上会充当一个巨大的查找表。在外推到+Inf时,“最接近匹配”位于x = 17.5;在外推到-Inf时,“最接近匹配”位于x = 0.0

英文:

You're asking your model to "extrapolate" - making predictions for x values that are greater than x values in the training dataset. Extrapolation works with some model types (such as linear models), but it typically does not work with decision tree models and their ensembles (such as XGBoost).

If you switch from XGBoost to LightGBM, then you can train extrapolation-capable decision tree ensembles using the "linear tree" approach:

> Any ideas why this doesn't work?

Your XGBRegressor is probably over-fitted (has n_estimators = 100 and max_depth = 6). If you decrease those parameter values, then the red line will appear more jagged, and it will be easier for you to see it "working".

Right now, if you ask your over-fitted XGBRegressor to extrapolate, then it basically functions as a giant look-up table. When extrapolating towards +Inf, then the "closest match" is at x = 17.5; when extrapolating towards -Inf, then the "closest match" is at x = 0.0.

huangapple
  • 本文由 发表于 2023年2月18日 20:46:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/75493439.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定