英文:
Auto ARIMA in Python results in poor fitting prediction of trend
问题
新手尝试使用Python的auto ARIMA对数据集进行建模。我使用auto-ARIMA,因为我相信它在定义p、d和q的值方面会更好,然而结果很差,我需要一些指导。请看下面的可复制尝试。
尝试如下:
# 依赖项
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pmdarima as pm
from pmdarima.model_selection import train_test_split
from statsmodels.tsa.stattools import adfuller
from pmdarima.arima import ADFTest
from pmdarima import auto_arima
from sklearn.metrics import r2_score
# 创建数据
data_plot = pd.DataFrame(data removed)
# 设置索引
data_plot['date_index'] = pd.to_datetime(data_plot['date'])
data_plot.set_index('date_index', inplace=True)
# 创建ARIMA数据集
arima_data = data_plot[['value']]
arima_data
# 绘制数据
arima_data['value'].plot(figsize=(7,4))
上述步骤将生成一个类似于此的数据集。
# Dicky Fuller测试用于检验平稳性
adf_test = ADFTest(alpha=0.05)
adf_test.should_diff(arima_data)
Result = 0.9867,表示非平稳数据,应在auto ARIMA过程中适当进行差分处理。
# 分配训练和测试子集-80:20拆分
print('Dataset dimensions:', arima_data.shape)
train_data = arima_data[:-24]
test_data = arima_data[-24:]
print('Training data dimension:', train_data.shape, round((len(train_data)/len(arima_data)*100),2),'% of dataset')
print('Test data dimension:', test_data.shape, round((len(train_data)/len(arima_data)*100),2),'% of dataset')
# 画训练和测试数据
plt.plot(train_data)
plt.plot(test_data)
# 运行auto ARIMA
arima_model = auto_arima(train_data, start_p=0, d=1, start_q=0,
max_p=5, max_d=5, max_q=5,
start_P=0, D=1, start_Q=0, max_P=5, max_D=5,
max_Q=5, m=12, seasonal=True,
stationary=False,
error_action='warn', trace=True,
suppress_warnings=True, stepwise=True,
random_state=20, n_fits=50)
print(arima_model.aic())
输出建议最佳模型是`'ARIMA(1,1,1)(0,1,0)[12]'`,AIC为1725.35484
# 存储预测值并查看结果df
prediction = pd.DataFrame(arima_model.predict(n_periods=25), index=test_data.index)
prediction.columns = ['predicted_value']
prediction
# 将预测与测试和训练趋势进行对比
plt.figure(figsize=(7,4))
plt.plot(train_data, label="Training")
plt.plot(test_data, label="Test")
plt.plot(prediction, label="Predicted")
plt.legend(loc='upper right')
plt.show()
# 查找r2模型得分
test_data['predicted_value'] = prediction
r2_score(test_data['value'], test_data['predicted_value'])
结果:-6.985
英文:
New to ARIMA and attempting to model a dataset in Python using auto ARIMA.
I'm using auto-ARIMA as I believe it will be better at defining the values of p, d and q however the results are poor and I need some guidance.
Please see my reproducible attempts below
Attempt as follows:
# DEPENDENCIES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pmdarima as pm
from pmdarima.model_selection import train_test_split
from statsmodels.tsa.stattools import adfuller
from pmdarima.arima import ADFTest
from pmdarima import auto_arima
from sklearn.metrics import r2_score
# CREATE DATA
data_plot = pd.DataFrame(data removed)
# SET INDEX
data_plot['date_index'] = pd.to_datetime(data_plot['date']
data_plot.set_index('date_index', inplace=True)
# CREATE ARIMA DATASET
arima_data = data_plot[['value']]
arima_data
# PLOT DATA
arima_data['value'].plot(figsize=(7,4))
The above steps result in a dataset that should look like this.
# Dicky Fuller test for stationarity
adf_test = ADFTest(alpha = 0.05)
adf_test.should_diff(arima_data)
Result = 0.9867 indicating non-stationary data which should be handled by appropriate over of differencing later in auto arima process.
# Assign training and test subsets - 80:20 split
print('Dataset dimensions;', arima_data.shape)
train_data = arima_data[:-24]
test_data = arima_data[-24:]
print('Training data dimension:', train_data.shape, round((len(train_data)/len(arima_data)*100),2),'% of dataset')
print('Test data dimension:', test_data.shape, round((len(train_data)/len(arima_data)*100),2),'% of dataset')
# Plot training & test data
plt.plot(train_data)
plt.plot(test_data)
# Run auto arima
arima_model = auto_arima(train_data, start_p=0, d=1, start_q=0,
max_p=5, max_d=5, max_q=5,
start_P=0, D=1, start_Q=0, max_P=5, max_D=5,
max_Q=5, m=12, seasonal=True,
stationary=False,
error_action='warn', trace=True,
suppress_warnings=True, stepwise=True,
random_state=20, n_fits=50)
print(arima_model.aic())
Output suggests best model is 'ARIMA(1,1,1)(0,1,0)[12]'
with AIC 1725.35484
#Store predicted values and view resultant df
prediction = pd.DataFrame(arima_model.predict(n_periods=25), index=test_data.index)
prediction.columns = ['predicted_value']
prediction
# Plot prediction against test and training trends
plt.figure(figsize=(7,4))
plt.plot(train_data, label="Training")
plt.plot(test_data, label="Test")
plt.plot(prediction, label="Predicted")
plt.legend(loc='upper right')
plt.show()
# Finding r2 model score
test_data['predicted_value'] = prediction
r2_score(test_data['value'], test_data['predicted_value'])
Result: -6.985
答案1
得分: 2
ARIMA具有在将其应用于数据之前需要检查的假设。其中之一是数据需要是平稳的,即不应具有趋势或季节性。您可以通过绘图来检查趋势,如果在您的图表中可见则具有上升趋势。
-
您还可以从图表中查看季节性,或使用Dicker Fuller测试来检查假设。
import statsmodels.tsa.stattools as ts ts.adfuller(data.col)
检查这个答案,如何执行和解读Dicker Fuller测试已经被很好地解释。
链接 -
始终检查ACF和PACF图,并查看超出限制的滞后值,显示自相关性。检查是否存在平稳性。
如Jose所解释的,可以进行差分以使数据平稳化。
SARIMA算法考虑季节性分量(p,d,q)和(S,P',D',Q'),还考虑外生变量。
英文:
ARIMA has assumptions which need to be checked before applying it to the data . One of them is data Need to be stationary i.e it should not have trend or seasonality . You can check trend through plotting , which is visible in your graph thent it has upwards trend .
1.You can seasonality also from graph or use Dicker fuller test to check hypothesis.
import statsmodels.tsa.stattools as ts
ts.adfuller(data.col)
Check this answer , how to perform and read ad fuller test has been well explained .
https://stackoverflow.com/questions/47349422/how-to-interpret-adfuller-test-results
- Always check the ACF and PACF plots and at which lags are lying beyound the limits , shows autocorrelation. Check the whether the Stationarity exits
As explained by Jose , differencing can be done to Stationarize the data.
SARIMA Algorithms considers the Seasonal components (p,d,q) and (S,P',D',Q') and also the exogenous varaibles .
答案2
得分: 0
"auto_arima" 是由您执行的方法吗?这取决于您如何区分以及您在其中执行的操作。您是否检查了自相关和偏自相关来了解重复的时间滞后?
此外,似乎您每年都有一些季节性模式,如果您还没有尝试过SARIMA模型,可以尝试一下。
要尝试SARIMA模型,您需要执行以下步骤:
- 使数据平稳化,可以通过差分来将移动均值转换为平稳均值。
data_stationarized = train_data.diff()[1:]
- 检查自相关和偏自相关以检查季节性。您可以使用库“statsmodels”来执行此操作。
import statsmodels.api as sm
sm.graphics.tsa.plot_acf(data_stationarized);
您可以看到最明显的标志是第十二个标志,因此数据的粒度是每月一次,这意味着每12个月都存在明显的季节性模式。
我们可以检查偏自相关以确认:
sm.graphics.tsa.plot_pacf(data_stationarized);
再次,最明显的标志是第十二个标志。
- 使用季节性阶数为12来拟合模型。还有更多参数可以调整以获得更好的结果,但那将使这篇帖子非常长。
model = sm.tsa.SARIMAX(endog=train_data, order=(2,0,0), seasonal_order=(2,0,0,12))
model_fit = model.fit()
- 评估结果
from sklearn.metrics import mean_squared_error
y_pred = model_fit.forecast(steps=24)
# 当squared=False时,它等同于RMSE
mean_squared_error(y_true=test_data.values, y_pred=y_pred, squared=False)
这将输出 12063.88
,您可以使用它来更严格地比较不同的结果。
进行图形检查:
prediction = pd.DataFrame(model_fit.forecast(steps=25), index=test_data.index)
prediction.columns = ['predicted_value']
prediction
# 将预测与测试和训练趋势进行比较
plt.figure(figsize=(7,4))
plt.plot(train_data, label="Training")
plt.plot(test_data, label="Test")
plt.plot(prediction, label="Predicted")
plt.legend(loc='upper right')
plt.xticks([])
plt.yticks([])
plt.show();
现在您可以看到预测结果越来越接近预期值。
您可以继续微调阶数和季节性阶数以获得更好的结果,我建议查看statsmodel的文档。
另一个建议是分析残差的自相关和偏自相关,以检查您的模型是否捕捉到了所有的模式。您可以在model_fit
对象中找到它们。
英文:
Is auto_arima
a method done by you? It depends how you differentiate and what you do there. Did you check the autocorrelation and partial autocorrelation to know which repeating time lags you have there?
Also, it seems you have some seasonality patterns every year, you could try a SARIMA model if you are not doing it already.
To try a SARIMA model you have to:
- Stationarized the data, in this case by differentiation you can convert the moving mean a stationary one.
data_stationarized = train_data.diff()[1:]
- Check the autocorrelation and partial autocorrelation to check the seasonality.
You can use the librarystatsmodels
for this.
import statsmodels.api as sm
sm.graphics.tsa.plot_acf(data_stationarized);
You can see that the most prominent flag is the twelfth flag, so as the granularity of the data is by month, that means there is prominent seasonality pattern every 12 months.
We can check the partial autocorrelation to confirm it too:
sm.graphics.tsa.plot_pacf(data_stationarized);
Again the most prominent flag is the twelfth one.
- Fit the model with a seasonality order of 12. There are more parameters to explain which can be adjusted to have better results, but then this post will be very long.
model = sm.tsa.SARIMAX(endog=train_data, order=(2,0,0), seasonal_order=(2,0,0,12))
model_fit = model.fit()
- Evaluate the results
from sklearn.metrics import mean_squared_error
y_pred = model_fit.forecast(steps=24)
# when squared=False then is equals to RMSE
mean_squared_error(y_true=test_data.values, y_pred=y_pred, squared=False)
This outputs 12063.88
, which you can use to compare different results more rigorously.
For a graphical check:
prediction = pd.DataFrame(model_fit.forecast(steps=25), index=test_data.index)
prediction.columns = ['predicted_value']
prediction
# Plot prediction against test and training trends
plt.figure(figsize=(7,4))
plt.plot(train_data, label="Training")
plt.plot(test_data, label="Test")
plt.plot(prediction, label="Predicted")
plt.legend(loc='upper right')
plt.xticks([])
plt.yticks([])
plt.show();
Now you can see that the predictions get closer to the expected values.
You could continue fine tuning the order and seasonal order to get even better results, I will advice to check the docs of statsmodel.
Another advice it's to analyze the autocorrelation and partial autocorrelation of the residuals to check if your model is capturing all of the patterns. You have them in the model_fit
object.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论