自动选择最佳列以获得最佳子集。

huangapple go评论60阅读模式
英文:

Automatically select best columns for best subset

问题

The Leaps package允许识别子集,并且可以返回最佳子集的结果。例如:

library(leaps)
mtcars.regsubsets <- regsubsets(mpg ~ ., data = mtcars)
summary(mtcars.regsubsets, nvmax = 10)
选择算法:穷举
         cyl disp hp  drat wt  qsec vs  am  gear carb
1  ( 1 ) " " " " " "  " " " " " "  "*" " "  " " " " " " " " " " " " " " 
2  ( 1 ) "*" " "  " " " " " "  "*" " "  " " " " " " " " " " " " " " 
3  ( 1 ) " " " " " "  " " " " " "  "*" "*"  " " " " "*" " "  " " " 
4  ( 1 ) " " " " " "  "*" " "  "*" "*"  " " " " "*" " "  " " " 
5  ( 1 ) " " " "*"  "*" " "  "*" "*"  " " " " "*" " "  " " " 
6  ( 1 ) " " " "*"  "*" "*"  "*" "*"  " " " " "*" " "  " " " 
7  ( 1 ) " " " "*"  "*" "*"  "*" "*"  " " " " "*" "*"  " " " 
8  ( 1 ) " " " "*"  "*" "*"  "*" "*"  " " " " "*" "*"  "*"

总结部分对于列数的最佳结果给出了 "*"。还可以使用特定的度量标准,例如Mallow's Cp来找到最佳结果:

which.min(summary(mtcars.regsubsets)$cp)
3

这返回结论,即3变量集具有Mallow's Cp的最低分数。

是否有一种方法可以自动选择数据集中的这些列(在本例中为wt、qsec和am),以便返回一个只包含这三列的新数据集?

mtcars[,c(5,6,8)]

显然,Leaps包的帮助文件已经查看过了,查看了StackOverflow和其他来源,但没有找到解决方案。

英文:

The Leaps package allows identification of subsets, and can return results for the best subsets. For example:

library(leaps)
mtcars.regsubsets &lt;- regsubsets(mpg ~ ., data = mtcars)
summary(mtcars.regsubsets, nvmax = 10)
Selection Algorithm: exhaustive
         cyl disp hp  drat wt  qsec vs  am  gear carb
1  ( 1 ) &quot; &quot; &quot; &quot;  &quot; &quot; &quot; &quot;  &quot;*&quot; &quot; &quot;  &quot; &quot; &quot; &quot; &quot; &quot;  &quot; &quot; 
2  ( 1 ) &quot;*&quot; &quot; &quot;  &quot; &quot; &quot; &quot;  &quot;*&quot; &quot; &quot;  &quot; &quot; &quot; &quot; &quot; &quot;  &quot; &quot; 
3  ( 1 ) &quot; &quot; &quot; &quot;  &quot; &quot; &quot; &quot;  &quot;*&quot; &quot;*&quot;  &quot; &quot; &quot;*&quot; &quot; &quot;  &quot; &quot; 
4  ( 1 ) &quot; &quot; &quot; &quot;  &quot;*&quot; &quot; &quot;  &quot;*&quot; &quot;*&quot;  &quot; &quot; &quot;*&quot; &quot; &quot;  &quot; &quot; 
5  ( 1 ) &quot; &quot; &quot;*&quot;  &quot;*&quot; &quot; &quot;  &quot;*&quot; &quot;*&quot;  &quot; &quot; &quot;*&quot; &quot; &quot;  &quot; &quot; 
6  ( 1 ) &quot; &quot; &quot;*&quot;  &quot;*&quot; &quot;*&quot;  &quot;*&quot; &quot;*&quot;  &quot; &quot; &quot;*&quot; &quot; &quot;  &quot; &quot; 
7  ( 1 ) &quot; &quot; &quot;*&quot;  &quot;*&quot; &quot;*&quot;  &quot;*&quot; &quot;*&quot;  &quot; &quot; &quot;*&quot; &quot;*&quot;  &quot; &quot; 
8  ( 1 ) &quot; &quot; &quot;*&quot;  &quot;*&quot; &quot;*&quot;  &quot;*&quot; &quot;*&quot;  &quot; &quot; &quot;*&quot; &quot;*&quot;  &quot;*&quot; 

The summary gives an "*" for the best results for the number of columns. It's also possible to use a specific measure, such as Mallow's Cp to find the best result:

which.min(summary(mtcars.regsubsets)$cp)

3

This returns that conclusion that the 3-variable set has the lowest score for Mallow's Cp.

Is there a way to automatically select those columns (wt, sec and am in this case) in the data set, so it returns a new data set only with those three columns?

mtcars[,c(5,6,8)]

Obviously the help file for the Leaps package has been checked, looked on StackOverflow, other sources, but no solution was found.

答案1

得分: 2

以下是使用dplyr的一种方法:

library(leaps)
library(dplyr)
mtcars.regsubsets <- regsubsets(mpg ~ ., data = mtcars)

cols_sel <- mtcars.regsubsets %>%
  broom::tidy() %>%
  filter(mallows_cp == min(mallows_cp)) %>%
  select(where(~.x == TRUE)) %>%
  colnames()

select(mtcars, any_of(cols_sel))
英文:

Here is one way using dplyr:

library(leaps)
library(dplyr)
mtcars.regsubsets &lt;- regsubsets(mpg ~ ., data = mtcars)

cols_sel &lt;- mtcars.regsubsets |&gt; 
  broom::tidy() |&gt; 
  filter(mallows_cp == min(mallows_cp)) |&gt; 
  select(where(~.x == TRUE)) |&gt; 
  colnames()

select(mtcars, any_of(cols_sel))

答案2

得分: 0

你可以使用summary组件的"which"来扫描其列,逐行查找"TRUE"值,然后检索关联的列=变量名称并将其存储在列表中。然后,您可以从此列表中选择列名,使用所需数量的变量。

使用基本的R:

column_sets <- 
  apply(
    regsubsets(mpg ~ ., data = mtcars) |
    summary() |
    (`[[`)('which'),
    MARGIN = 1,
    FUN = \(x) names(x)[x][-1] ## drop "(Intercept)"
  )
mtcars[column_sets[['3']]]
+ mtcars[column_sets[['3']]]
                       wt  qsec am
Mazda RX4           2.620 16.46  1
Mazda RX4 Wag       2.875 17.02  1
Datsun 710          2.320 18.61  1
Hornet 4 Drive      3.215 19.44  0
Hornet Sportabout   3.440 17.02  0
Valiant             3.460 20.22  0
Duster 360          3.570 15.84  0
## ...
英文:

You can use the summarys component "which" and scan its columns, row by row, for TRUE values, the retrieve the associated column = variable names and store it in a list. Then, you can pick the column names from this list, using the desired number of variables.

With base R:

column_sets &lt;- 
  apply(
    regsubsets(mpg ~ ., data = mtcars) |&gt;
    summary() |&gt;
    (`[[`)(&#39;which&#39;),
    MARGIN = 1,
    FUN = \(x) names(x)[x][-1] ## drop &quot;(Intercept)&quot;
  )
mtcars[column_sets[[&#39;3&#39;]]]
+ mtcars[column_sets[[&#39;3&#39;]]]
                       wt  qsec am
Mazda RX4           2.620 16.46  1
Mazda RX4 Wag       2.875 17.02  1
Datsun 710          2.320 18.61  1
Hornet 4 Drive      3.215 19.44  0
Hornet Sportabout   3.440 17.02  0
Valiant             3.460 20.22  0
Duster 360          3.570 15.84  0
## ...

答案3

得分: 0

获取摘要,其中有一个 &quot;which&quot; 矩阵,获取不包括截距的列总和,并获取具有最多 TRUE 值的列,使用 tail

ss <- summary(mtcars.regsubsets, nvmax = 10)

names(tail(sort(colSums(ss$which[, -1 ])), which.min(ss$cp)))
# [1] &quot;qsec&quot; &quot;am&quot;   &quot;wt&quot;

(Note: The translation retains the code snippet as is, without translating the code itself.)

英文:

Get the summary, it has a &quot;which&quot; matrix, get column sums excluding intercept, and get the columns with most TRUE values, using tail:

ss &lt;- summary(mtcars.regsubsets, nvmax = 10)

names(tail(sort(colSums(ss$which[, -1 ])), which.min(ss$cp)))
# [1] &quot;qsec&quot; &quot;am&quot;   &quot;wt&quot;  

huangapple
  • 本文由 发表于 2023年5月25日 18:39:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76331386.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定