2023年5月25日 18:39:56go评论86阅读模式

英文:

Automatically select best columns for best subset

问题

The Leaps package允许识别子集，并且可以返回最佳子集的结果。例如：

library(leaps)
mtcars.regsubsets <- regsubsets(mpg ~ ., data = mtcars)
summary(mtcars.regsubsets, nvmax = 10)

选择算法：穷举
         cyl disp hp  drat wt  qsec vs  am  gear carb
1  ( 1 ) " " " " " "  " " " " " "  "*" " "  " " " " " " " " " " " " " " 
2  ( 1 ) "*" " "  " " " " " "  "*" " "  " " " " " " " " " " " " " " 
3  ( 1 ) " " " " " "  " " " " " "  "*" "*"  " " " " "*" " "  " " " 
4  ( 1 ) " " " " " "  "*" " "  "*" "*"  " " " " "*" " "  " " " 
5  ( 1 ) " " " "*"  "*" " "  "*" "*"  " " " " "*" " "  " " " 
6  ( 1 ) " " " "*"  "*" "*"  "*" "*"  " " " " "*" " "  " " " 
7  ( 1 ) " " " "*"  "*" "*"  "*" "*"  " " " " "*" "*"  " " " 
8  ( 1 ) " " " "*"  "*" "*"  "*" "*"  " " " " "*" "*"  "*"

总结部分对于列数的最佳结果给出了 "*"。还可以使用特定的度量标准，例如Mallow's Cp来找到最佳结果：

which.min(summary(mtcars.regsubsets)$cp)

这返回结论，即3变量集具有Mallow's Cp的最低分数。

是否有一种方法可以自动选择数据集中的这些列（在本例中为wt、qsec和am），以便返回一个只包含这三列的新数据集？

mtcars[,c(5,6,8)]

显然，Leaps包的帮助文件已经查看过了，查看了StackOverflow和其他来源，但没有找到解决方案。

英文:

The Leaps package allows identification of subsets, and can return results for the best subsets. For example:

library(leaps)
mtcars.regsubsets &lt;- regsubsets(mpg ~ ., data = mtcars)
summary(mtcars.regsubsets, nvmax = 10)

Selection Algorithm: exhaustive
         cyl disp hp  drat wt  qsec vs  am  gear carb
1  ( 1 ) &quot; &quot; &quot; &quot;  &quot; &quot; &quot; &quot;  &quot;*&quot; &quot; &quot;  &quot; &quot; &quot; &quot; &quot; &quot;  &quot; &quot; 
2  ( 1 ) &quot;*&quot; &quot; &quot;  &quot; &quot; &quot; &quot;  &quot;*&quot; &quot; &quot;  &quot; &quot; &quot; &quot; &quot; &quot;  &quot; &quot; 
3  ( 1 ) &quot; &quot; &quot; &quot;  &quot; &quot; &quot; &quot;  &quot;*&quot; &quot;*&quot;  &quot; &quot; &quot;*&quot; &quot; &quot;  &quot; &quot; 
4  ( 1 ) &quot; &quot; &quot; &quot;  &quot;*&quot; &quot; &quot;  &quot;*&quot; &quot;*&quot;  &quot; &quot; &quot;*&quot; &quot; &quot;  &quot; &quot; 
5  ( 1 ) &quot; &quot; &quot;*&quot;  &quot;*&quot; &quot; &quot;  &quot;*&quot; &quot;*&quot;  &quot; &quot; &quot;*&quot; &quot; &quot;  &quot; &quot; 
6  ( 1 ) &quot; &quot; &quot;*&quot;  &quot;*&quot; &quot;*&quot;  &quot;*&quot; &quot;*&quot;  &quot; &quot; &quot;*&quot; &quot; &quot;  &quot; &quot; 
7  ( 1 ) &quot; &quot; &quot;*&quot;  &quot;*&quot; &quot;*&quot;  &quot;*&quot; &quot;*&quot;  &quot; &quot; &quot;*&quot; &quot;*&quot;  &quot; &quot; 
8  ( 1 ) &quot; &quot; &quot;*&quot;  &quot;*&quot; &quot;*&quot;  &quot;*&quot; &quot;*&quot;  &quot; &quot; &quot;*&quot; &quot;*&quot;  &quot;*&quot;

The summary gives an "*" for the best results for the number of columns. It's also possible to use a specific measure, such as Mallow's Cp to find the best result:

which.min(summary(mtcars.regsubsets)$cp)

This returns that conclusion that the 3-variable set has the lowest score for Mallow's Cp.

Is there a way to automatically select those columns (wt, sec and am in this case) in the data set, so it returns a new data set only with those three columns?

mtcars[,c(5,6,8)]

Obviously the help file for the Leaps package has been checked, looked on StackOverflow, other sources, but no solution was found.

答案1

得分: 2

以下是使用dplyr的一种方法：

library(leaps)
library(dplyr)
mtcars.regsubsets <- regsubsets(mpg ~ ., data = mtcars)
cols_sel <- mtcars.regsubsets %>%
  broom::tidy() %>%
  filter(mallows_cp == min(mallows_cp)) %>%
  select(where(~.x == TRUE)) %>%
  colnames()
select(mtcars, any_of(cols_sel))

英文:

Here is one way using dplyr:

library(leaps)
library(dplyr)
mtcars.regsubsets &lt;- regsubsets(mpg ~ ., data = mtcars)
cols_sel &lt;- mtcars.regsubsets |&gt; 
  broom::tidy() |&gt; 
  filter(mallows_cp == min(mallows_cp)) |&gt; 
  select(where(~.x == TRUE)) |&gt; 
  colnames()
select(mtcars, any_of(cols_sel))

答案2

得分: 0

你可以使用summary组件的"which"来扫描其列，逐行查找"TRUE"值，然后检索关联的列=变量名称并将其存储在列表中。然后，您可以从此列表中选择列名，使用所需数量的变量。

使用基本的R：

column_sets <- 
  apply(
    regsubsets(mpg ~ ., data = mtcars) |
    summary() |
    (`[[`)('which'),
    MARGIN = 1,
    FUN = \(x) names(x)[x][-1] ## drop "(Intercept)"
  )

mtcars[column_sets[['3']]]

+ mtcars[column_sets[['3']]]
                       wt  qsec am
Mazda RX4           2.620 16.46  1
Mazda RX4 Wag       2.875 17.02  1
Datsun 710          2.320 18.61  1
Hornet 4 Drive      3.215 19.44  0
Hornet Sportabout   3.440 17.02  0
Valiant             3.460 20.22  0
Duster 360          3.570 15.84  0
## ...

英文:

You can use the summarys component "which" and scan its columns, row by row, for TRUE values, the retrieve the associated column = variable names and store it in a list. Then, you can pick the column names from this list, using the desired number of variables.

With base R:

column_sets &lt;- 
  apply(
    regsubsets(mpg ~ ., data = mtcars) |&gt;
    summary() |&gt;
    (`[[`)(&#39;which&#39;),
    MARGIN = 1,
    FUN = \(x) names(x)[x][-1] ## drop &quot;(Intercept)&quot;
  )

mtcars[column_sets[[&#39;3&#39;]]]

+ mtcars[column_sets[[&#39;3&#39;]]]
                       wt  qsec am
Mazda RX4           2.620 16.46  1
Mazda RX4 Wag       2.875 17.02  1
Datsun 710          2.320 18.61  1
Hornet 4 Drive      3.215 19.44  0
Hornet Sportabout   3.440 17.02  0
Valiant             3.460 20.22  0
Duster 360          3.570 15.84  0
## ...

答案3

得分: 0

获取摘要，其中有一个 "which" 矩阵，获取不包括截距的列总和，并获取具有最多 TRUE 值的列，使用 tail：

ss <- summary(mtcars.regsubsets, nvmax = 10)
names(tail(sort(colSums(ss$which[, -1 ])), which.min(ss$cp)))
# [1] &quot;qsec&quot; &quot;am&quot;   &quot;wt&quot;

(Note: The translation retains the code snippet as is, without translating the code itself.)

英文:

Get the summary, it has a "which" matrix, get column sums excluding intercept, and get the columns with most TRUE values, using tail:

ss &lt;- summary(mtcars.regsubsets, nvmax = 10)
names(tail(sort(colSums(ss$which[, -1 ])), which.min(ss$cp)))
# [1] &quot;qsec&quot; &quot;am&quot;   &quot;wt&quot;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

自动选择最佳列以获得最佳子集。

问题

答案1

答案2

答案3

在R中使用Highcharter在Highcharts饼图上显示标签名称和数值。

基于连续的行创建分组，以在 ggplot 折线图中显示。

Apply a function in R on each row: function takes multiple columns from each row and returns multiple new columns

如何在`ggh4x::facet_wrap2()`中重新排列`geom_col`，仅基于其中一个图形。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。