动态选择多列,这些列的名称存储为变量。

huangapple go评论90阅读模式
英文:

Dynamically select multiple columns whose names are stored as variables

问题

我想要一个函数,能够接受一个 tibble 和一个指示该 tibble 中的变量数量的列名的字符向量,并执行一些操作,如 group_by。

这是一个示例,它可以处理0、1或2列:

  1. library(tidyverse)
  2. ex = crossing(abc=LETTERS[1:3], xyz=LETTERS[24:26]) %>%
  3. mutate(n = row_number())
  4. group_flexibly = function(tbl, group_by_cols=character(0)) {
  5. if (length(group_by_cols)==0) {
  6. tbl %>%
  7. summarize(.groups='keep', mean_n = mean(n))
  8. } else if (length(group_by_cols)==1) {
  9. tbl %>%
  10. group_by(!!as.name(group_by_cols[1])) %>%
  11. summarize(.groups='keep', mean_n=mean(n))
  12. } else if (length(group_by_cols)==2) {
  13. tbl %>%
  14. group_by(!!as.name(group_by_cols[1]), !!as.name(group_by_cols[2])) %>%
  15. summarize(.groups='keep', mean_n=mean(n))
  16. }
  17. }
  18. group_flexibly(ex)
  19. group_flexibly(ex, 'abc')
  20. group_flexibly(ex, 'xyz')
  21. group_flexibly(ex, c('abc','xyz'))

输出如下所示:

  1. > group_flexibly(ex)
  2. # A tibble: 1 × 1
  3. mean_n
  4. <dbl>
  5. 1 5
  6. > group_flexibly(ex, 'abc')
  7. # A tibble: 3 × 2
  8. # Groups: abc [3]
  9. abc mean_n
  10. <chr> <dbl>
  11. 1 A 2
  12. 2 B 5
  13. 3 C 8
  14. > group_flexibly(ex, 'xyz')
  15. # A tibble: 3 × 2
  16. # Groups: xyz [3]
  17. xyz mean_n
  18. <chr> <dbl>
  19. 1 X 4
  20. 2 Y 5
  21. 3 Z 6
  22. > group_flexibly(ex, c('abc','xyz'))
  23. # A tibble: 9 × 3
  24. # Groups: abc, xyz [9]
  25. abc xyz mean_n
  26. <chr> <chr> <dbl>
  27. 1 A X 1
  28. 2 A Y 2
  29. 3 A Z 3
  30. 4 B X 4
  31. 5 B Y 5
  32. 6 B Z 6
  33. 7 C X 7
  34. 8 C Y 8
  35. 9 C Z 9

到目前为止一切顺利。现在,如何编写一个可以处理任意长度字符向量的函数?

以下是两种不起作用的方法:

  1. group_by_cols = c('abc','xyz')
  2. ex %>% group_by(!!as.name(group_by_cols)) %>% summarize(.groups='keep', mean_n=mean(n))
  3. ex %>% group_by({{group_by_cols}}) %>% summarize(.groups='keep', mean_n=mean(n))

到目前为止遇到的问题:

  • !!as.name(group_by_cols) 只使用 group_by_cols[1] 并忽略向量的其余部分。
  • {{group_by_cols}} 如果 length(group_by_cols) != 1,会引发错误。
  • 流行的 StackOverflow 讨论,如 这个,没有解决可变列名向量长度的需求。
英文:

I would like a function to be able to accept a tibble and a character vector indicating the column names of a variable number of columns in that tibble, and perform some operations such as group_by on it.

Here is an example that does it for 0, 1, or 2 columns:

  1. library(tidyverse)
  2. ex = crossing(abc=LETTERS[1:3], xyz=LETTERS[24:26]) %&gt;% mutate(n = row_number())
  3. group_flexibly = function(tbl, group_by_cols=character(0)) {
  4. if (length(group_by_cols)==0) {
  5. tbl %&gt;%
  6. summarize(.groups=&#39;keep&#39;, mean_n = mean(n))
  7. } else if (length(group_by_cols)==1) {
  8. tbl %&gt;%
  9. group_by(!!as.name(group_by_cols[1])) %&gt;%
  10. summarize(.groups=&#39;keep&#39;, mean_n=mean(n))
  11. } else if (length(group_by_cols)==2) {
  12. tbl %&gt;%
  13. group_by(!!as.name(group_by_cols[1]), !!as.name(group_by_cols[2])) %&gt;%
  14. summarize(.groups=&#39;keep&#39;, mean_n=mean(n))
  15. }
  16. }
  17. group_flexibly(ex)
  18. group_flexibly(ex, &#39;abc&#39;)
  19. group_flexibly(ex, &#39;xyz&#39;)
  20. group_flexibly(ex, c(&#39;abc&#39;,&#39;xyz&#39;))

Output is as desired:

  1. &gt; group_flexibly(ex)
  2. # A tibble: 1 &#215; 1
  3. mean_n
  4. &lt;dbl&gt;
  5. 1 5
  6. &gt; group_flexibly(ex, &#39;abc&#39;)
  7. # A tibble: 3 &#215; 2
  8. # Groups: abc [3]
  9. abc mean_n
  10. &lt;chr&gt; &lt;dbl&gt;
  11. 1 A 2
  12. 2 B 5
  13. 3 C 8
  14. &gt; group_flexibly(ex, &#39;xyz&#39;)
  15. # A tibble: 3 &#215; 2
  16. # Groups: xyz [3]
  17. xyz mean_n
  18. &lt;chr&gt; &lt;dbl&gt;
  19. 1 X 4
  20. 2 Y 5
  21. 3 Z 6
  22. &gt; group_flexibly(ex, c(&#39;abc&#39;,&#39;xyz&#39;))
  23. # A tibble: 9 &#215; 3
  24. # Groups: abc, xyz [9]
  25. abc xyz mean_n
  26. &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
  27. 1 A X 1
  28. 2 A Y 2
  29. 3 A Z 3
  30. 4 B X 4
  31. 5 B Y 5
  32. 6 B Z 6
  33. 7 C X 7
  34. 8 C Y 8
  35. 9 C Z 9

So far so good. Now, how to write such a function that does this for a character vector of arbitrary length?

Here are two things that do not work:

  1. group_by_cols = c(&#39;abc&#39;,&#39;xyz&#39;)
  2. ex %&gt;% group_by(!!as.name(group_by_cols)) %&gt;% summarize(.groups=&#39;keep&#39;, mean_n=mean(n))
  3. ex %&gt;% group_by({{group_by_cols}}) %&gt;% summarize(.groups=&#39;keep&#39;, mean_n=mean(n))

Problems encountered so far:

  • !!as.name(group_by_cols) only uses group_by_cols[1] and ignores the rest of the vector.
  • {{group_by_cols}} throws an error if length(group_by_cols) != 1.
  • Popular StackOverflow discussions such as this do not address a need for the length of the vector of column names to be variable.

答案1

得分: 3

你正在寻找 across()all_of()

  1. group_flexibly <- function(tbl, grp_cols = character(0)){
  2. tbl %>%
  3. group_by(across(all_of(grp_cols))) %>%
  4. summarise(mean_n = mean(n), .groups = 'keep')
  5. }

character(0) 的默认值处理了不提供任何值给 grp_cols 的情况。

实际上,我最近学到了一个稍微更受欢迎的版本,是使用 pick() 而不是 across(),区别在于如果 grp_cols 是一个命名向量,它将使用这些名称创建新列。使用 pick(all_of(grp_cols)) 或评论中建议的 .by 参数都会在命名向量上出错。

英文:

You're looking for across() and all_of():

  1. group_flexibly &lt;- function(tbl,grp_cols = character(0)){
  2. tbl |&gt;
  3. group_by(across(all_of(grp_cols))) |&gt;
  4. summarise(mean_n = mean(n),.groups = &#39;keep&#39;)
  5. }

The default value of character(0) handles the case of not providing any value to grp_cols.

I actually recently learned that a somewhat preferred version is to use pick() instead of across(), the difference being that if grp_cols is a named vector it will create new columns using those names. Using pick(all_of(grp_cols)) or the .by argument suggested in a comment would both error on a named vector.

huangapple
  • 本文由 发表于 2023年5月25日 03:49:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76326954.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定