英文:
StatsModel Linear Regression: Initial vs Reduced Model - Is it better?
问题
我正在使用一个数据集运行线性回归(虽然它是为学校用途而创建的,我被告知其中包含虚构信息),首先我选择了我的变量(从更大的数据集中选择),然后进行了相应的编码。
我运行了第一次初始回归,并获得了如下截图所示的结果。
初始回归模型摘要
然后我运行了RFE(递归特征消除)并选择了3个要选择的特征,然后重新运行了回归,得到了如下结果。
精简回归模型
用于初始模型中的x、y拆分的代码:
# 创建特征变量,其中X = 自变量,Y = 因变量
X_data = df2.drop('TotalCharge', axis=1)
Y_data = df2[['TotalCharge']]
print('特征的形状是:', X_data.shape)
X_data.head()
print('标签的形状是:', Y_data.shape)
Y_data.head()
用于精简模型的代码:
X_data2 = df2[['Age', 'Income', 'VitD_levels', 'Doc_visits', 'Gender_Male', 'Services_Intravenous', 'Overweight_Yes', 'Diabetes_Yes']]
Y_data2 = df2[['TotalCharge']]
print('特征的形状是:', X_data2.shape)
X_data2.head()
print('标签的形状是:', Y_data2.shape)
Y_data2.head()
我的问题是:精简模型是否更好?不太确定如何分析这一点(我还是新手...)
我尝试选择新的特征,检查多重共线性,规范化后再运行回归,甚至使用Scikit-learn而不是Statsmodel。不确定如何分析结果以确定是否更好...
英文:
I am running linear regression using a dataset (granted, it is for school purposes and I was told its fictitious information) and first I chose my variables (from the larger dataset) and encoded them accordingly.
I ran the first initial regression and got the following results shown in the screenshot.
Initial Regression Model Summary
I then ran RFE and selected 3 features to be selected and reran the regression to which I then obtain the following results.
Reduced Regression model
The code used for the x,y splitting in Initial Model:
# Creating feature variables, where X = independent variables and Y=dependent variables
X_data = df2.drop('TotalCharge', axis=1)
Y_data = df2[['TotalCharge']]
print('The shape of the features is:',X_data.shape)
X_data.head()
print('The shape of the labels:',Y_data.shape)
Y_data.head()
code used for Reduced model:
X_data2 = df2[['Age', 'Income', 'VitD_levels', 'Doc_visits', 'Gender_Male', 'Services_Intravenous', 'Overweight_Yes', 'Diabetes_Yes']]
Y_data2 = df2[['TotalCharge']]
print('The shape of the features is:',X_data2.shape)
X_data2.head()
print('The shape of the labels:',Y_data2.shape)
Y_data2.head()
My question is: Is the reduced model better? Not quite sure how to analyze this (still new to this....)
I tried choosing new features, checking for multicollinearity, normalizing before running the regression and even using Scikitlearn over Statsmodel. Not sure how to analyze the results to see if it is better...
答案1
得分: 0
几点观察:
-
对于Complication_risk、Initial_admin_Emergency Admission和Arthritis_Yes,您的p值为0。这表明这些变量在5%的显著性水平下是显著的,但它们已从缩减模型中删除,从而降低了模型的预测能力。
-
无论如何,两个模型的R-Squared统计数据都相当低(分别为0.021和0.001)。这表明该模型在预测因变量的变异性或TotalCharge变量方面表现不佳。R-Squared为1表示模型解释了100%的变异,而R-Squared为0表示解释了0%的变异。
简短回答您的问题是,缩减模型不比原始模型好,但原始模型的预测能力也不强。
下一步的一个好方法可能是仅使用显著变量运行原始模型,即Complication_risk、Initial_admin_Emergency Admission和Arthritis_Yes,并查看R-Squared的拟合情况是否有所改善。如果没有改善,那么这表明因变量的变异性不能由提供的自变量充分解释。
英文:
A couple of observations:
-
You had p-values of 0 for Complication_risk, Initial_admin_Emergency Admission, and Arthritis_Yes. This indicates that these variables are significant at the 5% level of significance - yet these were removed from the reduced model - thereby reducing the predictive power of the model.
-
In any event, the R-Squared statistics for both models are quite low (0.021 and 0.001). This indicates that the model is not doing a good job at predicting the variation in the dependent variable, or the TotalCharge variable. An R-Squared of 1 indicates that the model explains 100% of the variation whereas an R-Squared of 0 explains 0% of the variation.
The short answer to your question is that the reduced model is not better than the original - but the original model does not have much predictive power either.
A good next step might be to run the original model with only the significant variables, i.e. Complication_risk, Initial_admin_Emergency Admission, and Arthritis_Yes - and see if the fit as measured by R-Squared improves. If it does not, then this is a good indication that the variation in the dependent variable cannot be adequately explained by the independent variables provided.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论