英文:
The meaning of an explanatory variable in computing accuracy of a prediction model (migrated to CrossValidated)
问题
这段代码中capital_gain
作为解释变量,但是将其移除或替换为另一个变量并不会改变输出结果。估计值和混淆矩阵保持不变。那么为什么在那个位置上要使用capital_gain
呢?
pred <- train %>%
select(income, capital_gain) %>%
bind_cols(
predict(mod_null, new_data = train, type = "class")
) %>%
rename(income_null = .pred_class)
accuracy(pred, income, income_null)
confusion_null <- pred %>%
conf_mat(truth = income, estimate = income_null)
confusion_null
英文:
I am following the predictive modeling example found here, chapter 10.
This piece of the code has capital_gain
as the explanatory variable, but removing it or replacing it with another variable doesn't change anything in the output. The estimate and confusion matrix remain the same. So why is capital_gain
there in that place?
library(yardstick)
pred <- train %>%
select(income, capital_gain) %>% #why 'capital_gain' is here?
bind_cols(
predict(mod_null, new_data = train, type = "class")
) %>%
rename(income_null = .pred_class)
accuracy(pred, income, income_null)
confusion_null <- pred %>%
conf_mat(truth = income, estimate = income_null)
confusion_null
EDIT: path to the data used -> http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
EDIT: this post is migrated to CrossValidated
答案1
得分: 1
你没有解释你的数据是什么(来自yardstick
包?),但通常情况下,如果在你的数据中省略一个预测变量不会改变线性模型的预测,这种情况通常发生在你的数据中存在多重共线性的情况下。在这种情况下,任何一个共线性变量都可以被省略而不会损失预测准确性。
你可以通过检查你的预测变量矩阵的秩是否等于独立变量的数量来测试数据中是否存在确切的多重共线性。如果不等于,那么你有确切的共线性:
> A <- cbind(iris[,-5], 2*iris[,1]+iris[,2])
> ncol(A)
[1] 5
> qr(A)$rank
[1] 4
如果只有近似的共线性,你可以计算你的变量的“方差膨胀因子”(VIF)。高值(大于5)表示存在多重共线性。VIF的实现不包含在标准的R中,但有许多包可以实现它。
英文:
You do not explain what your data is (something form the yardstick
package?), but the situation that omitting a predictor varaible does not change the predictions of a linear model typically occurs when you have multi-collinearity in your data. In this case, any of the collinear variables can be omitted without losing predicton accuracy.
You can test for exact mutli-collinearity in your data by checking whether the rank of your predictor matrix is equal to the number of independent variables. If not, you have exact collinearity:
> A <- cbind(iris[,-5], 2*iris[,1]+iris[,2])
> ncol(A)
[1] 5
> qr(A)$rank
[1] 4
If there is only approximate collinearity, you can compute the "variance inflation factor" (VIF) of your variables. High values indicate (say greater than 5) multi-collinearity. VIF implementations do not come with vanilla R, but there are many packages that implement it.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论