英文:
How to test for Granger Causality in Python (Toda & Yamamoto style)?
问题
我正在尝试实现Dave Giles在这篇博文中提到的格兰杰因果性测试过程,我理解这是一篇关于使用Toda-Yamamoto方法对非平稳数据执行格兰杰因果性测试的著名博文。他在EViews中工作,但我真的很想在Python中执行相同的步骤。
在Giles的示例中,他使用了两个时间序列变量。首先,他使用ADF测试和KPSS测试确定了两个变量的积分阶数,然后他在数据的水平上建立了一个VAR模型,使用FPE、AIC、SC和HQ信息标准来确定包括在VAR模型中的最大滞后阶数。这些步骤涵盖了博文中的步骤1-4。
所有这些步骤我都可以使用statsmodels
库的.tsa.vector_ar.var_model.VAR()
和.tsa.vector_ar.var_model.VAR.fit()
在Python中复现。我不确定如何实现他的第5步,其中他“对AR(k)/MA(k)的备择假设进行LM检验[不确定这是什么意思],其中k = 1, ...., 12,...”并检查模型是否具有动态稳定性。我也不确定如何执行第6步,这是Johansen的迹检验和最大特征根检验,用于测试两个时间序列变量之间的协整关系。我看到有一个用于VECM的Johansen协整检验的方法,但我该如何将其用于VAR模型?
我也似乎无法弄清楚如何在VAR模型中包含与最大滞后阶数和最大积分阶数相对应的外生变量,就像他在博文末尾所做的那样。他这样做是为了确保Wald检验维持渐近卡方零分布。
我知道statsmodels.tsa.stattools.grangercausalitytests()
和statsmodels.tsa.vector_ar.var_model.VARResults.test_causality()
都存在,但我想要正确地对两个时间序列变量执行这些测试,这需要遵循Giles概述的过程(或类似的过程)。
任何建议将不胜感激!如果需要,请告诉我是否需要包含一些示例代码。
英文:
I am trying to implement the process for Granger Causality testing outlined in this blogpost by Dave Giles, which I understand is a famous post about performing a Granger Causality test for non-stationary data, following the Toda-Yamamoto method. He works in EViews, but I would really like to do the same steps in Python.
In Giles' example, he uses two time-series variables. First, he determines the order of integration of both variables using both the ADF test and KPSS test, and then he sets up a VAR model on the levels of the data, using FPE, AIC, SC, and HQ information criteria to determine the maximum lag length included in the VAR model. These steps cover steps 1-4 in the blog post.
All of the above I can reproduce in Python using the statsmodels
library's .tsa.vector_ar.var_model.VAR()
and .tsa.vector_ar.var_model.VAR.fit()
. I am not sure how to implement his step 5, where he "appl[ies] the LM test [not sure what this stands for] for serial independence against the alternative of AR(k)/MA(k), for k = 1, ...., 12,..." and also checks if the model is dynamically stable. I am also not certain how to perform step 6, which is a Johansen's Trace Test and Max. Eigenvalue Test which test for cointegration between the 2 time-series variables. I see there is a method for Johansen cointegration for a VECM, but how would I incorporate that for a VAR?
I also can't seem to figure out how to include an exogenous variable in the VAR model corresponding to the maximum lag length plus the maximum order of integration, as he does at the end of the blog post. He does this to ensure the Wald test maintains an asymptotic chi-square null distribution.
I know that both statsmodels.tsa.stattools.grangercausalitytests()
and statsmodels.tsa.vector_ar.var_model.VARResults.test_causality()
exist, but I would like to be able to correctly perform these tests on two time-series variables, which necessitates following the procedure Giles outlines (or a procedure similar to it).
Any advice would be appreciated! Please let me know if I need to include some example code.
答案1
得分: 0
Step 5:
要检查残差的自相关图 (ACF plot)。
假设你已经从第 4 步中找到了最佳模型。
from statsmodels.tsa.api import VAR
best_model = VAR(data)
best_model_fitted = best_model.fit(best_lag)
然后,你想要获取这个拟合模型的残差,并制作一个自相关图。
from statsmodels.graphics.tsaplots import plot_acf
residuals = data - best_model_fitted.fittedvalues
plot_acf(residuals[col], lags=20)
如果没有显著的自相关性,那么你的VAR模型就是良好指定的(即你选择的best_lag是正确的)。
Step 6:
我不需要进行协整检验,因为我的时间序列具有不同的积分阶数。但是,statsmodels 有一个执行此测试的方法。
Step 7:
你说得对,statsmodels.tsa.stattools.grangercausalitytests()
不允许你按博客文章中指定的方式执行测试。但别担心,格兰杰因果关系检验在底层使用了 Wald 检验。由于 VAR
模型本质上是一组针对依赖变量滞后和其他时间序列变量滞后的线性回归,我们可以使用普通的线性回归模型,然后执行 RegressionResults 对象上的 wald_test。
例如,假设你有 p = 4 和 d = 2,那么你的VAR模型对于第一个时间序列(其中包含其自身的滞后版本以及其他时间序列的滞后版本)将如下所示:
所以,我们需要构建一个包含y1和y2的滞后值列的数据集。
# 假设df有两列:y1和y2
# 创建滞后列
lags = [1, 2, 3, 4, 5, 6]
for col in ['y1', 'y2']:
for lag in lags:
df[f'{col}_lag{lag}'] = df[col].shift(lag)
df.drop(columns=['y2'], inplace=True)
df.dropna(inplace=True)
Y = df['y1']
X = df[['y1_lag1', 'y1_lag2', 'y1_lag3',
'y1_lag4', 'y1_lag5', 'y1_lag6',
'y2_lag1', 'y2_lag2', 'y2_lag3',
'y2_lag4', 'y2_lag5', 'y2_lag6']]
拟合模型。在这里,如果你愿意,你可以验证线性回归得到的参数与VAR对象的y1部分得到的参数是否相同。
import statsmodels.api as sm
X = sm.add_constant(X)
model = sm.OLS(Y, X)
results = model.fit()
results.params
现在,我们准备按照博客中指定的方式执行 Wald 检验。在这个示例中,我说 p = 4,d = 2,所以我们要将与y2的前4个滞后值对应的系数设置为0。
# 执行 Wald 检验
hypothesis = 'y2_lag1 = y2_lag2 = y2_lag3 = y2_lag4 = 0'
wald_test = results.wald_test(hypothesis)
print(wald_test)
将会打印出一个 p 值,如果它大于 0.05,那么你没有足够的证据来拒绝零假设,这意味着你不能说 y2 "Granger 引起" y1。
英文:
Step 5:
You want to check the ACF plot of the residuals.
Assuming you already have found the best model from step 4.
from statsmodels.tsa.api import VAR
best_model = VAR(data)
best_model_fitted = best_model.fit(best_lag)
You then want to get the residuals of this fitted model. And do an ACF plot.
from statsmodels.graphics.tsaplots import plot_acf
residuals = data - best_model_fitted.fittedvalues
plot_acf(residuals[col], lags=20)
If there are no significant auto-correlations then your VAR is well-specified (i.e. the best_lag you had selected was correct).
Step 6:
I didn't have to perform the cointegration test as my time series were of different orders of integration. But statsmodels has a method for performing this test.
Step 7:
You're right in that statsmodels.tsa.stattools.grangercausalitytests()
does not give you the flexibility to perform the test in the way specified in the blog post. But fret not, the granger causality test uses a Wald test under the hood. Since VAR
models are essentially a set of linear regressions that regress against lags of the dependent variable and lags of other time series variables, we can use a normal linear regression model and then perform the wald_test off the RegressionResults object.
For example, assume that you have p = 4 and d = 2, then your VAR for the first time series (which has terms representing lagged versions of itself and lagged versions of the other time series) would look like
So we need to build a dataset that has columns representing lagged values of y1 and y2.
# assume df has two columns: y1 and y2
# Create lagged columns
lags = [1, 2, 3, 4, 5, 6]
for col in ['y1', 'y2']:
for lag in lags:
df[f'{col}_lag{lag}'] = df[col].shift(lag)
df.drop(columns=['y2'], inplace=True)
df.dropna(inplace=True)
Y = df['y1']
X = df[['y1_lag1', 'y1_lag2', 'y1_lag3',
'y1_lag4', 'y1_lag5', 'y1_lag6',
'y2_lag1', 'y2_lag2', 'y2_lag3',
'y2_lag4', 'y2_lag5', 'y2_lag6']]
Fit the model. Here you can sanity check that the parameters you get from the linear regression are the same as the ones you get for the y1 part of the VAR object if you'd like.
import statsmodels.api as sm
X = sm.add_constant(X)
model = sm.OLS(Y,X)
results = model.fit()
results.params
Now we are ready to perform our Wald test in the way specified in the blog. In this example I said that p = 4, d = 2 so we want to set the coefficients corresponding to the first 4 lags of y2 to 0.
# Perform Wald test
hypothesis = 'y2_lag1 = y2_lag2 = y2_lag3 = y2_lag4 = 0'
wald_test = results.wald_test(hypothesis)
print(wald_test)
A p value will be printed out and if it > 0.05, then you do not have enough evidence to reject the null hypothesis meaning that you cannot say that y2 "Granger causes" y1.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论