解释变量在计算预测模型准确性方面的含义(迁移到CrossValidated)

huangapple go评论69阅读模式
英文:

The meaning of an explanatory variable in computing accuracy of a prediction model (migrated to CrossValidated)

问题

这段代码中capital_gain作为解释变量,但是将其移除或替换为另一个变量并不会改变输出结果。估计值和混淆矩阵保持不变。那么为什么在那个位置上要使用capital_gain呢?

pred <- train %>%
  select(income, capital_gain) %>%
  bind_cols(
    predict(mod_null, new_data = train, type = "class")
  ) %>%
  rename(income_null = .pred_class)
accuracy(pred, income, income_null)


confusion_null <- pred %>%
  conf_mat(truth = income, estimate = income_null)
confusion_null
英文:

I am following the predictive modeling example found here, chapter 10.

This piece of the code has capital_gain as the explanatory variable, but removing it or replacing it with another variable doesn't change anything in the output. The estimate and confusion matrix remain the same. So why is capital_gain there in that place?

library(yardstick)
pred &lt;- train %&gt;%
  select(income, capital_gain) %&gt;% #why &#39;capital_gain&#39; is here? 
  bind_cols(
    predict(mod_null, new_data = train, type = &quot;class&quot;)
  ) %&gt;%
  rename(income_null = .pred_class)
accuracy(pred, income, income_null)


confusion_null &lt;- pred %&gt;%
  conf_mat(truth = income, estimate = income_null)
confusion_null

EDIT: path to the data used -> http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

EDIT: this post is migrated to CrossValidated

答案1

得分: 1

你没有解释你的数据是什么(来自yardstick包?),但通常情况下,如果在你的数据中省略一个预测变量不会改变线性模型的预测,这种情况通常发生在你的数据中存在多重共线性的情况下。在这种情况下,任何一个共线性变量都可以被省略而不会损失预测准确性。

你可以通过检查你的预测变量矩阵的秩是否等于独立变量的数量来测试数据中是否存在确切的多重共线性。如果不等于,那么你有确切的共线性:

> A <- cbind(iris[,-5], 2*iris[,1]+iris[,2])
> ncol(A)
[1] 5
> qr(A)$rank
[1] 4

如果只有近似的共线性,你可以计算你的变量的“方差膨胀因子”(VIF)。高值(大于5)表示存在多重共线性。VIF的实现不包含在标准的R中,但有许多包可以实现它。

英文:

You do not explain what your data is (something form the yardstick package?), but the situation that omitting a predictor varaible does not change the predictions of a linear model typically occurs when you have multi-collinearity in your data. In this case, any of the collinear variables can be omitted without losing predicton accuracy.

You can test for exact mutli-collinearity in your data by checking whether the rank of your predictor matrix is equal to the number of independent variables. If not, you have exact collinearity:

&gt; A &lt;- cbind(iris[,-5], 2*iris[,1]+iris[,2])
&gt; ncol(A)
[1] 5
&gt; qr(A)$rank
[1] 4

If there is only approximate collinearity, you can compute the "variance inflation factor" (VIF) of your variables. High values indicate (say greater than 5) multi-collinearity. VIF implementations do not come with vanilla R, but there are many packages that implement it.

huangapple
  • 本文由 发表于 2023年7月6日 19:42:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76628475.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定