动态选择多列,这些列的名称存储为变量。

huangapple go评论68阅读模式
英文:

Dynamically select multiple columns whose names are stored as variables

问题

我想要一个函数,能够接受一个 tibble 和一个指示该 tibble 中的变量数量的列名的字符向量,并执行一些操作,如 group_by。

这是一个示例,它可以处理0、1或2列:

library(tidyverse)

ex = crossing(abc=LETTERS[1:3], xyz=LETTERS[24:26]) %>%
  mutate(n = row_number())

group_flexibly = function(tbl, group_by_cols=character(0)) {
  if (length(group_by_cols)==0) {
    tbl %>%
      summarize(.groups='keep', mean_n = mean(n))
  } else if (length(group_by_cols)==1) {
    tbl %>%
      group_by(!!as.name(group_by_cols[1])) %>%
      summarize(.groups='keep', mean_n=mean(n))
  } else if (length(group_by_cols)==2) {
    tbl %>%
      group_by(!!as.name(group_by_cols[1]), !!as.name(group_by_cols[2])) %>%
      summarize(.groups='keep', mean_n=mean(n))
  }
}

group_flexibly(ex)
group_flexibly(ex, 'abc')
group_flexibly(ex, 'xyz')
group_flexibly(ex, c('abc','xyz'))

输出如下所示:

> group_flexibly(ex)
# A tibble: 1 × 1
  mean_n
   <dbl>
1      5
> group_flexibly(ex, 'abc')
# A tibble: 3 × 2
# Groups:   abc [3]
  abc   mean_n
  <chr>  <dbl>
1 A          2
2 B          5
3 C          8
> group_flexibly(ex, 'xyz')
# A tibble: 3 × 2
# Groups:   xyz [3]
  xyz   mean_n
  <chr>  <dbl>
1 X          4
2 Y          5
3 Z          6
> group_flexibly(ex, c('abc','xyz'))
# A tibble: 9 × 3
# Groups:   abc, xyz [9]
  abc   xyz   mean_n
  <chr> <chr>  <dbl>
1 A     X          1
2 A     Y          2
3 A     Z          3
4 B     X          4
5 B     Y          5
6 B     Z          6
7 C     X          7
8 C     Y          8
9 C     Z          9

到目前为止一切顺利。现在,如何编写一个可以处理任意长度字符向量的函数?

以下是两种不起作用的方法:

group_by_cols = c('abc','xyz')
ex %>% group_by(!!as.name(group_by_cols)) %>% summarize(.groups='keep', mean_n=mean(n))
ex %>% group_by({{group_by_cols}}) %>% summarize(.groups='keep', mean_n=mean(n))

到目前为止遇到的问题:

  • !!as.name(group_by_cols) 只使用 group_by_cols[1] 并忽略向量的其余部分。
  • {{group_by_cols}} 如果 length(group_by_cols) != 1,会引发错误。
  • 流行的 StackOverflow 讨论,如 这个,没有解决可变列名向量长度的需求。
英文:

I would like a function to be able to accept a tibble and a character vector indicating the column names of a variable number of columns in that tibble, and perform some operations such as group_by on it.

Here is an example that does it for 0, 1, or 2 columns:

library(tidyverse)

ex = crossing(abc=LETTERS[1:3], xyz=LETTERS[24:26]) %&gt;% mutate(n = row_number())

group_flexibly = function(tbl, group_by_cols=character(0)) {
  if (length(group_by_cols)==0) {
    tbl %&gt;%
      summarize(.groups=&#39;keep&#39;, mean_n = mean(n))
  } else if (length(group_by_cols)==1) {
    tbl %&gt;%
      group_by(!!as.name(group_by_cols[1])) %&gt;%
      summarize(.groups=&#39;keep&#39;, mean_n=mean(n))
  } else if (length(group_by_cols)==2) {
    tbl %&gt;%
      group_by(!!as.name(group_by_cols[1]), !!as.name(group_by_cols[2])) %&gt;%
      summarize(.groups=&#39;keep&#39;, mean_n=mean(n))
  }
}

group_flexibly(ex)
group_flexibly(ex, &#39;abc&#39;)
group_flexibly(ex, &#39;xyz&#39;)
group_flexibly(ex, c(&#39;abc&#39;,&#39;xyz&#39;))

Output is as desired:

&gt; group_flexibly(ex)
# A tibble: 1 &#215; 1
  mean_n
   &lt;dbl&gt;
1      5
&gt; group_flexibly(ex, &#39;abc&#39;)
# A tibble: 3 &#215; 2
# Groups:   abc [3]
  abc   mean_n
  &lt;chr&gt;  &lt;dbl&gt;
1 A          2
2 B          5
3 C          8
&gt; group_flexibly(ex, &#39;xyz&#39;)
# A tibble: 3 &#215; 2
# Groups:   xyz [3]
  xyz   mean_n
  &lt;chr&gt;  &lt;dbl&gt;
1 X          4
2 Y          5
3 Z          6
&gt; group_flexibly(ex, c(&#39;abc&#39;,&#39;xyz&#39;))
# A tibble: 9 &#215; 3
# Groups:   abc, xyz [9]
  abc   xyz   mean_n
  &lt;chr&gt; &lt;chr&gt;  &lt;dbl&gt;
1 A     X          1
2 A     Y          2
3 A     Z          3
4 B     X          4
5 B     Y          5
6 B     Z          6
7 C     X          7
8 C     Y          8
9 C     Z          9

So far so good. Now, how to write such a function that does this for a character vector of arbitrary length?

Here are two things that do not work:

group_by_cols = c(&#39;abc&#39;,&#39;xyz&#39;)
ex %&gt;% group_by(!!as.name(group_by_cols)) %&gt;% summarize(.groups=&#39;keep&#39;, mean_n=mean(n))
ex %&gt;% group_by({{group_by_cols}}) %&gt;% summarize(.groups=&#39;keep&#39;, mean_n=mean(n))

Problems encountered so far:

  • !!as.name(group_by_cols) only uses group_by_cols[1] and ignores the rest of the vector.
  • {{group_by_cols}} throws an error if length(group_by_cols) != 1.
  • Popular StackOverflow discussions such as this do not address a need for the length of the vector of column names to be variable.

答案1

得分: 3

你正在寻找 across()all_of()

group_flexibly <- function(tbl, grp_cols = character(0)){
  tbl %>%
    group_by(across(all_of(grp_cols))) %>%
    summarise(mean_n = mean(n), .groups = 'keep')
}

character(0) 的默认值处理了不提供任何值给 grp_cols 的情况。

实际上,我最近学到了一个稍微更受欢迎的版本,是使用 pick() 而不是 across(),区别在于如果 grp_cols 是一个命名向量,它将使用这些名称创建新列。使用 pick(all_of(grp_cols)) 或评论中建议的 .by 参数都会在命名向量上出错。

英文:

You're looking for across() and all_of():

group_flexibly &lt;- function(tbl,grp_cols = character(0)){
  tbl |&gt;
    group_by(across(all_of(grp_cols))) |&gt;
    summarise(mean_n = mean(n),.groups = &#39;keep&#39;)
}

The default value of character(0) handles the case of not providing any value to grp_cols.

I actually recently learned that a somewhat preferred version is to use pick() instead of across(), the difference being that if grp_cols is a named vector it will create new columns using those names. Using pick(all_of(grp_cols)) or the .by argument suggested in a comment would both error on a named vector.

huangapple
  • 本文由 发表于 2023年5月25日 03:49:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76326954.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定