自动选择最佳列以获得最佳子集。

huangapple go评论74阅读模式
英文:

Automatically select best columns for best subset

问题

The Leaps package允许识别子集,并且可以返回最佳子集的结果。例如:

  1. library(leaps)
  2. mtcars.regsubsets <- regsubsets(mpg ~ ., data = mtcars)
  3. summary(mtcars.regsubsets, nvmax = 10)
  1. 选择算法:穷举
  2. cyl disp hp drat wt qsec vs am gear carb
  3. 1 ( 1 ) " " " " " " " " " " " " "*" " " " " " " " " " " " " " " " "
  4. 2 ( 1 ) "*" " " " " " " " " "*" " " " " " " " " " " " " " " " "
  5. 3 ( 1 ) " " " " " " " " " " " " "*" "*" " " " " "*" " " " " "
  6. 4 ( 1 ) " " " " " " "*" " " "*" "*" " " " " "*" " " " " "
  7. 5 ( 1 ) " " " "*" "*" " " "*" "*" " " " " "*" " " " " "
  8. 6 ( 1 ) " " " "*" "*" "*" "*" "*" " " " " "*" " " " " "
  9. 7 ( 1 ) " " " "*" "*" "*" "*" "*" " " " " "*" "*" " " "
  10. 8 ( 1 ) " " " "*" "*" "*" "*" "*" " " " " "*" "*" "*"

总结部分对于列数的最佳结果给出了 "*"。还可以使用特定的度量标准,例如Mallow's Cp来找到最佳结果:

  1. which.min(summary(mtcars.regsubsets)$cp)
  1. 3

这返回结论,即3变量集具有Mallow's Cp的最低分数。

是否有一种方法可以自动选择数据集中的这些列(在本例中为wt、qsec和am),以便返回一个只包含这三列的新数据集?

  1. mtcars[,c(5,6,8)]

显然,Leaps包的帮助文件已经查看过了,查看了StackOverflow和其他来源,但没有找到解决方案。

英文:

The Leaps package allows identification of subsets, and can return results for the best subsets. For example:

  1. library(leaps)
  2. mtcars.regsubsets &lt;- regsubsets(mpg ~ ., data = mtcars)
  3. summary(mtcars.regsubsets, nvmax = 10)
  1. Selection Algorithm: exhaustive
  2. cyl disp hp drat wt qsec vs am gear carb
  3. 1 ( 1 ) &quot; &quot; &quot; &quot; &quot; &quot; &quot; &quot; &quot;*&quot; &quot; &quot; &quot; &quot; &quot; &quot; &quot; &quot; &quot; &quot;
  4. 2 ( 1 ) &quot;*&quot; &quot; &quot; &quot; &quot; &quot; &quot; &quot;*&quot; &quot; &quot; &quot; &quot; &quot; &quot; &quot; &quot; &quot; &quot;
  5. 3 ( 1 ) &quot; &quot; &quot; &quot; &quot; &quot; &quot; &quot; &quot;*&quot; &quot;*&quot; &quot; &quot; &quot;*&quot; &quot; &quot; &quot; &quot;
  6. 4 ( 1 ) &quot; &quot; &quot; &quot; &quot;*&quot; &quot; &quot; &quot;*&quot; &quot;*&quot; &quot; &quot; &quot;*&quot; &quot; &quot; &quot; &quot;
  7. 5 ( 1 ) &quot; &quot; &quot;*&quot; &quot;*&quot; &quot; &quot; &quot;*&quot; &quot;*&quot; &quot; &quot; &quot;*&quot; &quot; &quot; &quot; &quot;
  8. 6 ( 1 ) &quot; &quot; &quot;*&quot; &quot;*&quot; &quot;*&quot; &quot;*&quot; &quot;*&quot; &quot; &quot; &quot;*&quot; &quot; &quot; &quot; &quot;
  9. 7 ( 1 ) &quot; &quot; &quot;*&quot; &quot;*&quot; &quot;*&quot; &quot;*&quot; &quot;*&quot; &quot; &quot; &quot;*&quot; &quot;*&quot; &quot; &quot;
  10. 8 ( 1 ) &quot; &quot; &quot;*&quot; &quot;*&quot; &quot;*&quot; &quot;*&quot; &quot;*&quot; &quot; &quot; &quot;*&quot; &quot;*&quot; &quot;*&quot;

The summary gives an "*" for the best results for the number of columns. It's also possible to use a specific measure, such as Mallow's Cp to find the best result:

which.min(summary(mtcars.regsubsets)$cp)

  1. 3

This returns that conclusion that the 3-variable set has the lowest score for Mallow's Cp.

Is there a way to automatically select those columns (wt, sec and am in this case) in the data set, so it returns a new data set only with those three columns?

  1. mtcars[,c(5,6,8)]

Obviously the help file for the Leaps package has been checked, looked on StackOverflow, other sources, but no solution was found.

答案1

得分: 2

以下是使用dplyr的一种方法:

  1. library(leaps)
  2. library(dplyr)
  3. mtcars.regsubsets <- regsubsets(mpg ~ ., data = mtcars)
  4. cols_sel <- mtcars.regsubsets %>%
  5. broom::tidy() %>%
  6. filter(mallows_cp == min(mallows_cp)) %>%
  7. select(where(~.x == TRUE)) %>%
  8. colnames()
  9. select(mtcars, any_of(cols_sel))
英文:

Here is one way using dplyr:

  1. library(leaps)
  2. library(dplyr)
  3. mtcars.regsubsets &lt;- regsubsets(mpg ~ ., data = mtcars)
  4. cols_sel &lt;- mtcars.regsubsets |&gt;
  5. broom::tidy() |&gt;
  6. filter(mallows_cp == min(mallows_cp)) |&gt;
  7. select(where(~.x == TRUE)) |&gt;
  8. colnames()
  9. select(mtcars, any_of(cols_sel))

答案2

得分: 0

你可以使用summary组件的"which"来扫描其列,逐行查找"TRUE"值,然后检索关联的列=变量名称并将其存储在列表中。然后,您可以从此列表中选择列名,使用所需数量的变量。

使用基本的R:

  1. column_sets <-
  2. apply(
  3. regsubsets(mpg ~ ., data = mtcars) |
  4. summary() |
  5. (`[[`)('which'),
  6. MARGIN = 1,
  7. FUN = \(x) names(x)[x][-1] ## drop "(Intercept)"
  8. )
  1. mtcars[column_sets[['3']]]
  1. + mtcars[column_sets[['3']]]
  2. wt qsec am
  3. Mazda RX4 2.620 16.46 1
  4. Mazda RX4 Wag 2.875 17.02 1
  5. Datsun 710 2.320 18.61 1
  6. Hornet 4 Drive 3.215 19.44 0
  7. Hornet Sportabout 3.440 17.02 0
  8. Valiant 3.460 20.22 0
  9. Duster 360 3.570 15.84 0
  10. ## ...
英文:

You can use the summarys component "which" and scan its columns, row by row, for TRUE values, the retrieve the associated column = variable names and store it in a list. Then, you can pick the column names from this list, using the desired number of variables.

With base R:

  1. column_sets &lt;-
  2. apply(
  3. regsubsets(mpg ~ ., data = mtcars) |&gt;
  4. summary() |&gt;
  5. (`[[`)(&#39;which&#39;),
  6. MARGIN = 1,
  7. FUN = \(x) names(x)[x][-1] ## drop &quot;(Intercept)&quot;
  8. )
  1. mtcars[column_sets[[&#39;3&#39;]]]
  1. + mtcars[column_sets[[&#39;3&#39;]]]
  2. wt qsec am
  3. Mazda RX4 2.620 16.46 1
  4. Mazda RX4 Wag 2.875 17.02 1
  5. Datsun 710 2.320 18.61 1
  6. Hornet 4 Drive 3.215 19.44 0
  7. Hornet Sportabout 3.440 17.02 0
  8. Valiant 3.460 20.22 0
  9. Duster 360 3.570 15.84 0
  10. ## ...

答案3

得分: 0

获取摘要,其中有一个 &quot;which&quot; 矩阵,获取不包括截距的列总和,并获取具有最多 TRUE 值的列,使用 tail

  1. ss <- summary(mtcars.regsubsets, nvmax = 10)
  2. names(tail(sort(colSums(ss$which[, -1 ])), which.min(ss$cp)))
  3. # [1] &quot;qsec&quot; &quot;am&quot; &quot;wt&quot;

(Note: The translation retains the code snippet as is, without translating the code itself.)

英文:

Get the summary, it has a &quot;which&quot; matrix, get column sums excluding intercept, and get the columns with most TRUE values, using tail:

  1. ss &lt;- summary(mtcars.regsubsets, nvmax = 10)
  2. names(tail(sort(colSums(ss$which[, -1 ])), which.min(ss$cp)))
  3. # [1] &quot;qsec&quot; &quot;am&quot; &quot;wt&quot;

huangapple
  • 本文由 发表于 2023年5月25日 18:39:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76331386.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定