2023年7月6日 19:42:59go评论75阅读模式

英文:

The meaning of an explanatory variable in computing accuracy of a prediction model (migrated to CrossValidated)

问题

这段代码中capital_gain作为解释变量，但是将其移除或替换为另一个变量并不会改变输出结果。估计值和混淆矩阵保持不变。那么为什么在那个位置上要使用capital_gain呢？

pred <- train %>%
  select(income, capital_gain) %>%
  bind_cols(
    predict(mod_null, new_data = train, type = "class")
  ) %>%
  rename(income_null = .pred_class)
accuracy(pred, income, income_null)


confusion_null <- pred %>%
  conf_mat(truth = income, estimate = income_null)
confusion_null

英文:

I am following the predictive modeling example found here, chapter 10.

This piece of the code has capital_gain as the explanatory variable, but removing it or replacing it with another variable doesn't change anything in the output. The estimate and confusion matrix remain the same. So why is capital_gain there in that place?

library(yardstick)
pred &lt;- train %&gt;%
  select(income, capital_gain) %&gt;% #why &#39;capital_gain&#39; is here? 
  bind_cols(
    predict(mod_null, new_data = train, type = &quot;class&quot;)
  ) %&gt;%
  rename(income_null = .pred_class)
accuracy(pred, income, income_null)


confusion_null &lt;- pred %&gt;%
  conf_mat(truth = income, estimate = income_null)
confusion_null

EDIT: path to the data used -> http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

EDIT: this post is migrated to CrossValidated

答案1

得分: 1

你没有解释你的数据是什么（来自yardstick包？），但通常情况下，如果在你的数据中省略一个预测变量不会改变线性模型的预测，这种情况通常发生在你的数据中存在多重共线性的情况下。在这种情况下，任何一个共线性变量都可以被省略而不会损失预测准确性。

你可以通过检查你的预测变量矩阵的秩是否等于独立变量的数量来测试数据中是否存在确切的多重共线性。如果不等于，那么你有确切的共线性：

> A <- cbind(iris[,-5], 2*iris[,1]+iris[,2])
> ncol(A)
[1] 5
> qr(A)$rank
[1] 4

如果只有近似的共线性，你可以计算你的变量的“方差膨胀因子”（VIF）。高值（大于5）表示存在多重共线性。VIF的实现不包含在标准的R中，但有许多包可以实现它。

英文:

You do not explain what your data is (something form the yardstick package?), but the situation that omitting a predictor varaible does not change the predictions of a linear model typically occurs when you have multi-collinearity in your data. In this case, any of the collinear variables can be omitted without losing predicton accuracy.

You can test for exact mutli-collinearity in your data by checking whether the rank of your predictor matrix is equal to the number of independent variables. If not, you have exact collinearity:

&gt; A &lt;- cbind(iris[,-5], 2*iris[,1]+iris[,2])
&gt; ncol(A)
[1] 5
&gt; qr(A)$rank
[1] 4

If there is only approximate collinearity, you can compute the "variance inflation factor" (VIF) of your variables. High values indicate (say greater than 5) multi-collinearity. VIF implementations do not come with vanilla R, but there are many packages that implement it.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

解释变量在计算预测模型准确性方面的含义（迁移到CrossValidated）

问题

答案1

如何根据包含响应式图表的Rmarkdown生成HTML报告？

填充 ggplot2 图的区域是否可以根据每个 x 轴坐标处的值进行更改？

将重复项放在同一行，R

匹配邮政编码与纬度和经度（印度）

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论