英文:
Automatically select best columns for best subset
问题
The Leaps package允许识别子集,并且可以返回最佳子集的结果。例如:
library(leaps)
mtcars.regsubsets <- regsubsets(mpg ~ ., data = mtcars)
summary(mtcars.regsubsets, nvmax = 10)
选择算法:穷举
cyl disp hp drat wt qsec vs am gear carb
1 ( 1 ) " " " " " " " " " " " " "*" " " " " " " " " " " " " " " " "
2 ( 1 ) "*" " " " " " " " " "*" " " " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " " "*" "*" " " " " "*" " " " " "
4 ( 1 ) " " " " " " "*" " " "*" "*" " " " " "*" " " " " "
5 ( 1 ) " " " "*" "*" " " "*" "*" " " " " "*" " " " " "
6 ( 1 ) " " " "*" "*" "*" "*" "*" " " " " "*" " " " " "
7 ( 1 ) " " " "*" "*" "*" "*" "*" " " " " "*" "*" " " "
8 ( 1 ) " " " "*" "*" "*" "*" "*" " " " " "*" "*" "*"
总结部分对于列数的最佳结果给出了 "*"。还可以使用特定的度量标准,例如Mallow's Cp来找到最佳结果:
which.min(summary(mtcars.regsubsets)$cp)
3
这返回结论,即3变量集具有Mallow's Cp的最低分数。
是否有一种方法可以自动选择数据集中的这些列(在本例中为wt、qsec和am),以便返回一个只包含这三列的新数据集?
mtcars[,c(5,6,8)]
显然,Leaps包的帮助文件已经查看过了,查看了StackOverflow和其他来源,但没有找到解决方案。
英文:
The Leaps package allows identification of subsets, and can return results for the best subsets. For example:
library(leaps)
mtcars.regsubsets <- regsubsets(mpg ~ ., data = mtcars)
summary(mtcars.regsubsets, nvmax = 10)
Selection Algorithm: exhaustive
cyl disp hp drat wt qsec vs am gear carb
1 ( 1 ) " " " " " " " " "*" " " " " " " " " " "
2 ( 1 ) "*" " " " " " " "*" " " " " " " " " " "
3 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " "
4 ( 1 ) " " " " "*" " " "*" "*" " " "*" " " " "
5 ( 1 ) " " "*" "*" " " "*" "*" " " "*" " " " "
6 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" " " " "
7 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" " "
8 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" "*"
The summary gives an "*" for the best results for the number of columns. It's also possible to use a specific measure, such as Mallow's Cp to find the best result:
which.min(summary(mtcars.regsubsets)$cp)
3
This returns that conclusion that the 3-variable set has the lowest score for Mallow's Cp.
Is there a way to automatically select those columns (wt, sec and am in this case) in the data set, so it returns a new data set only with those three columns?
mtcars[,c(5,6,8)]
Obviously the help file for the Leaps package has been checked, looked on StackOverflow, other sources, but no solution was found.
答案1
得分: 2
以下是使用dplyr
的一种方法:
library(leaps)
library(dplyr)
mtcars.regsubsets <- regsubsets(mpg ~ ., data = mtcars)
cols_sel <- mtcars.regsubsets %>%
broom::tidy() %>%
filter(mallows_cp == min(mallows_cp)) %>%
select(where(~.x == TRUE)) %>%
colnames()
select(mtcars, any_of(cols_sel))
英文:
Here is one way using dplyr
:
library(leaps)
library(dplyr)
mtcars.regsubsets <- regsubsets(mpg ~ ., data = mtcars)
cols_sel <- mtcars.regsubsets |>
broom::tidy() |>
filter(mallows_cp == min(mallows_cp)) |>
select(where(~.x == TRUE)) |>
colnames()
select(mtcars, any_of(cols_sel))
答案2
得分: 0
你可以使用summary
组件的"which"来扫描其列,逐行查找"TRUE"值,然后检索关联的列=变量名称并将其存储在列表中。然后,您可以从此列表中选择列名,使用所需数量的变量。
使用基本的R:
column_sets <-
apply(
regsubsets(mpg ~ ., data = mtcars) |
summary() |
(`[[`)('which'),
MARGIN = 1,
FUN = \(x) names(x)[x][-1] ## drop "(Intercept)"
)
mtcars[column_sets[['3']]]
+ mtcars[column_sets[['3']]]
wt qsec am
Mazda RX4 2.620 16.46 1
Mazda RX4 Wag 2.875 17.02 1
Datsun 710 2.320 18.61 1
Hornet 4 Drive 3.215 19.44 0
Hornet Sportabout 3.440 17.02 0
Valiant 3.460 20.22 0
Duster 360 3.570 15.84 0
## ...
英文:
You can use the summary
s component "which" and scan its columns, row by row, for TRUE
values, the retrieve the associated column = variable names and store it in a list. Then, you can pick the column names from this list, using the desired number of variables.
With base R:
column_sets <-
apply(
regsubsets(mpg ~ ., data = mtcars) |>
summary() |>
(`[[`)('which'),
MARGIN = 1,
FUN = \(x) names(x)[x][-1] ## drop "(Intercept)"
)
mtcars[column_sets[['3']]]
+ mtcars[column_sets[['3']]]
wt qsec am
Mazda RX4 2.620 16.46 1
Mazda RX4 Wag 2.875 17.02 1
Datsun 710 2.320 18.61 1
Hornet 4 Drive 3.215 19.44 0
Hornet Sportabout 3.440 17.02 0
Valiant 3.460 20.22 0
Duster 360 3.570 15.84 0
## ...
答案3
得分: 0
获取摘要,其中有一个 "which"
矩阵,获取不包括截距的列总和,并获取具有最多 TRUE 值的列,使用 tail:
ss <- summary(mtcars.regsubsets, nvmax = 10)
names(tail(sort(colSums(ss$which[, -1 ])), which.min(ss$cp)))
# [1] "qsec" "am" "wt"
(Note: The translation retains the code snippet as is, without translating the code itself.)
英文:
Get the summary, it has a "which"
matrix, get column sums excluding intercept, and get the columns with most TRUE values, using tail:
ss <- summary(mtcars.regsubsets, nvmax = 10)
names(tail(sort(colSums(ss$which[, -1 ])), which.min(ss$cp)))
# [1] "qsec" "am" "wt"
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论