英文:
Python: How to handle outliers in a regression Q-Q plot?
问题
I draw the qq plot multiple regression and I got below graph. Can someone tell me why there are two points under the red line? And do these points have an effect on my model?
I used below code for draw the graph.
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg = reg.fit(x_train, y_train)
pred_reg_GS = reg.predict(x_test)
diff = y_test - pred_reg_GS
import statsmodels.api as sm
sm.qqplot(diff, fit=True, line='45')
plt.show()
英文:
I draw the qq plot multiple regression and I got below graph. Can someone tell me why there are two points under the red line? And do these points have an effect on my model?
I used below code for draw the graph.
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg = reg.fit(x_train,y_train)
pred_reg_GS = reg.predict(x_test)
diff= y_test-pred_reg_GS
import statsmodels.api as sm
sm.qqplot(diff,fit=True,line='45')
plt.show()
答案1
得分: 2
请注意理解 Q-Q 图以获取关于 QQ 图的简明描述。在您的情况下,这个特定部分很重要:
如果两组分位数来自相同的分布,我们应该看到点形成一个大致直线的线。
这种理论上的一对一关系在您的图中使用红线明确说明。
关于您的问题...
这些点对我的模型有什么影响?
... 远离红线的一个或两个点可能被视为异常值。这意味着您尝试构建的模型无法捕捉这两个观察结果的特性。如果我们在这里看到的是回归模型的残差的 QQ 图,您应该仔细研究这两个观察结果。是什么让它们与您的样本中的其他部分不同?通常,捕捉这些异常值的一种方法是用一个或两个虚拟变量来表示它们。
编辑 1: 异常值和虚拟变量的基本方法
由于您没有明确标记您的问题为 sklearn
,我将自行使用 statsmodels
进行说明。在没有您的数据样本的情况下,我将使用内置的 iris
数据集,其中我们将使用的最后一部分如下所示:
1. sepal_length 对 sepal_width 的线性回归
图 1:
看起来不错!这里没有问题。但是让我们将一些极端值添加到数据集中,以混淆一下。您将在代码段的末尾找到完整的代码片段。
2. 引入异常值
现在,让我们在数据框中添加一行,其中 ``sepal_width = 8而不是
3`。
这将为您提供具有非常明显异常值的 qqplot:
这是模型摘要的一部分:
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
sepal_width 1.8690 0.033 57.246 0.000 1.804 1.934
==============================================================================
Omnibus: 18.144 Durbin-Watson: 0.427
Prob(Omnibus): 0.000 Jarque-Bera (JB): 7.909
Skew: -0.338 Prob(JB): 0.0192
Kurtosis: 2.101 Cond. No. 1.00
==============================================================================
那么 为什么 这是一个异常值?因为我们干扰了数据集。我无法确定数据集中异常值的原因。在我们虚构的示例中,setosa 鸢尾花的 sepal width 为 8 的原因可能有很多。也许科学家标错了?也许它根本不是 setosa 鸢尾花?或者它可能已经被基因修改了?现在,与其只是将这个观察结果从样本中丢弃,通常更有信息价值的是将其保留在原位,接受这个观察结果有点特殊,并通过包含一个虚拟变量来明确说明这一点,该变量对于该观察结果为 1
,对于所有其他观察结果为 0
。现在,您的数据框的最后一部分应如下所示:
3. 使用虚拟变量识别异常值
现在,您的 qqplot 将如下所示:
这是您的模型摘要:
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
sepal_width 1.4512 0.015 94.613 0.000 1.420 1.482
outlier_dummy -6.6097 0.394 -16.791 0.000 -7.401 -5.819
==============================================================================
Omnibus: 1.917 Durbin-Watson: 2.188
Prob(Omnibus): 0.383 Jarque-Bera (JB): 1.066
Skew: 0.218 Prob(JB): 0.587
Kurtosis: 3.558 Cond. No. 27.0
==============================================================================
请注意,包含虚拟变量会更改 sepal_width
的系数估计值,以及 Skewness
和 Kurtosis
的值。这就是异常值对您的模型产生的影响的简要版本。
完整代码:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt
import seaborn as sns
# 样本数据
df = pd.DataFrame(sns.load_dataset('iris'))
# 样本数据的子集
df = df[df['species'] == 'setosa']
# 添加虚拟变量的列
df['outlier_dummy'] = 0
# 追加具有极端 sepal width 值和该行的虚拟变量 = 1 的行。
df.loc[len(df)] = [5, 8, 1.4, 0.3, 'setosa', 1]
# 定义自变量
x = ['sepal_width', 'outlier_dummy']
# 运行回归
mod_fit = sm.OLS(df['sepal_length'], df[x]).fit()
res = mod_fit.resid
fig = sm.qqplot(res)
plt.show()
mod_fit.summary()
英文:
Take a look at Understanding Q-Q Plots for a concise description of what a QQ plot is. In your case, this particular part is important:
> If both sets of quantiles came from the same distribution, we should
> see the points forming a line that’s roughly straight.
This theoretical one-to-one relationship is illustrated explicitly in your plot using the red line.
And regarding your question...
> that points effect for my model?
... one or both points that occur far from that red line could be conisered to be outliers. This means that whatever model you've tried to build here does not capture the properties of those tho observations. If what we're looking at here is a QQ plot of the residuals from a regression model, you should take a closer look at those two observations. What is it with these two that make them stand out from the rest of your sample? One way to "catch" these outliers is often to represent them with one or two dummy variables.
Edit 1: Basic approach for outliers and dummy variables
Since you haven't explicitly labeled your question sklearn
I'm taking the liberty to illustrate this using statsmodels
. And in lieu of a sample of your data, I'll just use the built-in iris
dataset where the last part of what we'll use looks like this:
1. Linear regression of sepal_width on sepal_length
Plot 1:
Looks good! Nothing wrong here. But let's mix it up a bit by adding some extreme values to the dataset. You'll find a complete code snippet at the end.
2. Introduce an outlier
Now, lets add a line in the dataframe where ``sepal_width = 8instead of
3`.
This will give you the following qqplot with a very clear outlier:
And here's a part of the model summary:
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
sepal_width 1.8690 0.033 57.246 0.000 1.804 1.934
==============================================================================
Omnibus: 18.144 Durbin-Watson: 0.427
Prob(Omnibus): 0.000 Jarque-Bera (JB): 7.909
Skew: -0.338 Prob(JB): 0.0192
Kurtosis: 2.101 Cond. No. 1.00
==============================================================================
So why is this an outlier? Because we messed with the dataset. The reason for the outliers in your dataset is impossible for me to determine. In our made-up example the reason for a setosa iris to have a sepal width if 8 could be many. Maybe the scientist labeled it wrong? Maybe it isn't a setosa at all? Or maybe it has been genetically modified? Now, instead of just discarding this observation from the sample, it's usually more informative to keep it where it is, accept that there is something special with this observation, and illustrate exactly that by including a dummy variable that is 1
for that observation and 0
for all other. Now the last part of your dataframe should look like this:
3. Identify the outlier using a dummy variable
Now, your qqplot will look like this:
And here's your model summary:
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
sepal_width 1.4512 0.015 94.613 0.000 1.420 1.482
outlier_dummy -6.6097 0.394 -16.791 0.000 -7.401 -5.819
==============================================================================
Omnibus: 1.917 Durbin-Watson: 2.188
Prob(Omnibus): 0.383 Jarque-Bera (JB): 1.066
Skew: 0.218 Prob(JB): 0.587
Kurtosis: 3.558 Cond. No. 27.0
==============================================================================
Notice that the inclusion of a dummy variable changes the coefficient estimate for sepal_widht
, and also the values for Skewness
and Kurtosis
. And that's the short version of the effects an outlier will have on your model.
Complete code:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt
import seaborn as sns
# sample data
df = pd.DataFrame(sns.load_dataset('iris'))
# subset of sample data
df=df[df['species']=='setosa']
# add column for dummy variable
df['outlier_dummy']=0
# append line with extreme value for sepal width
# as well as a dummy variable = 1 for that row.
df.loc[len(df)] = [5,8,1.4, 0.3, 'setosa', 1]
# define independent variables
x=['sepal_width', 'outlier_dummy']
# run regression
mod_fit = sm.OLS(df['sepal_length'], df[x]).fit()
res = mod_fit.resid
fig = sm.qqplot(res)
plt.show()
mod_fit.summary()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论