2023年5月25日 03:49:24go评论90阅读模式

英文:

Dynamically select multiple columns whose names are stored as variables

问题

我想要一个函数，能够接受一个 tibble 和一个指示该 tibble 中的变量数量的列名的字符向量，并执行一些操作，如 group_by。

这是一个示例，它可以处理0、1或2列：

library(tidyverse)
ex = crossing(abc=LETTERS[1:3], xyz=LETTERS[24:26]) %>%
  mutate(n = row_number())
group_flexibly = function(tbl, group_by_cols=character(0)) {
  if (length(group_by_cols)==0) {
    tbl %>%
      summarize(.groups='keep', mean_n = mean(n))
  } else if (length(group_by_cols)==1) {
    tbl %>%
      group_by(!!as.name(group_by_cols[1])) %>%
      summarize(.groups='keep', mean_n=mean(n))
  } else if (length(group_by_cols)==2) {
    tbl %>%
      group_by(!!as.name(group_by_cols[1]), !!as.name(group_by_cols[2])) %>%
      summarize(.groups='keep', mean_n=mean(n))
  }
}
group_flexibly(ex)
group_flexibly(ex, 'abc')
group_flexibly(ex, 'xyz')
group_flexibly(ex, c('abc','xyz'))

输出如下所示：

> group_flexibly(ex)
# A tibble: 1 × 1
  mean_n
   <dbl>
1      5
> group_flexibly(ex, 'abc')
# A tibble: 3 × 2
# Groups:   abc [3]
  abc   mean_n
  <chr>  <dbl>
1 A          2
2 B          5
3 C          8
> group_flexibly(ex, 'xyz')
# A tibble: 3 × 2
# Groups:   xyz [3]
  xyz   mean_n
  <chr>  <dbl>
1 X          4
2 Y          5
3 Z          6
> group_flexibly(ex, c('abc','xyz'))
# A tibble: 9 × 3
# Groups:   abc, xyz [9]
  abc   xyz   mean_n
  <chr> <chr>  <dbl>
1 A     X          1
2 A     Y          2
3 A     Z          3
4 B     X          4
5 B     Y          5
6 B     Z          6
7 C     X          7
8 C     Y          8
9 C     Z          9

到目前为止一切顺利。现在，如何编写一个可以处理任意长度字符向量的函数？

以下是两种不起作用的方法：

group_by_cols = c('abc','xyz')
ex %>% group_by(!!as.name(group_by_cols)) %>% summarize(.groups='keep', mean_n=mean(n))
ex %>% group_by({{group_by_cols}}) %>% summarize(.groups='keep', mean_n=mean(n))

到目前为止遇到的问题：

!!as.name(group_by_cols) 只使用 group_by_cols[1] 并忽略向量的其余部分。
{{group_by_cols}} 如果 length(group_by_cols) != 1，会引发错误。
流行的 StackOverflow 讨论，如这个，没有解决可变列名向量长度的需求。

英文:

I would like a function to be able to accept a tibble and a character vector indicating the column names of a variable number of columns in that tibble, and perform some operations such as group_by on it.

Here is an example that does it for 0, 1, or 2 columns:

library(tidyverse)
ex = crossing(abc=LETTERS[1:3], xyz=LETTERS[24:26]) %&gt;% mutate(n = row_number())
group_flexibly = function(tbl, group_by_cols=character(0)) {
  if (length(group_by_cols)==0) {
    tbl %&gt;%
      summarize(.groups=&#39;keep&#39;, mean_n = mean(n))
  } else if (length(group_by_cols)==1) {
    tbl %&gt;%
      group_by(!!as.name(group_by_cols[1])) %&gt;%
      summarize(.groups=&#39;keep&#39;, mean_n=mean(n))
  } else if (length(group_by_cols)==2) {
    tbl %&gt;%
      group_by(!!as.name(group_by_cols[1]), !!as.name(group_by_cols[2])) %&gt;%
      summarize(.groups=&#39;keep&#39;, mean_n=mean(n))
  }
}
group_flexibly(ex)
group_flexibly(ex, &#39;abc&#39;)
group_flexibly(ex, &#39;xyz&#39;)
group_flexibly(ex, c(&#39;abc&#39;,&#39;xyz&#39;))

Output is as desired:

&gt; group_flexibly(ex)
# A tibble: 1 &#215; 1
  mean_n
   &lt;dbl&gt;
1      5
&gt; group_flexibly(ex, &#39;abc&#39;)
# A tibble: 3 &#215; 2
# Groups:   abc [3]
  abc   mean_n
  &lt;chr&gt;  &lt;dbl&gt;
1 A          2
2 B          5
3 C          8
&gt; group_flexibly(ex, &#39;xyz&#39;)
# A tibble: 3 &#215; 2
# Groups:   xyz [3]
  xyz   mean_n
  &lt;chr&gt;  &lt;dbl&gt;
1 X          4
2 Y          5
3 Z          6
&gt; group_flexibly(ex, c(&#39;abc&#39;,&#39;xyz&#39;))
# A tibble: 9 &#215; 3
# Groups:   abc, xyz [9]
  abc   xyz   mean_n
  &lt;chr&gt; &lt;chr&gt;  &lt;dbl&gt;
1 A     X          1
2 A     Y          2
3 A     Z          3
4 B     X          4
5 B     Y          5
6 B     Z          6
7 C     X          7
8 C     Y          8
9 C     Z          9

So far so good. Now, how to write such a function that does this for a character vector of arbitrary length?

Here are two things that do not work:

group_by_cols = c(&#39;abc&#39;,&#39;xyz&#39;)
ex %&gt;% group_by(!!as.name(group_by_cols)) %&gt;% summarize(.groups=&#39;keep&#39;, mean_n=mean(n))
ex %&gt;% group_by({{group_by_cols}}) %&gt;% summarize(.groups=&#39;keep&#39;, mean_n=mean(n))

Problems encountered so far:

!!as.name(group_by_cols) only uses group_by_cols[1] and ignores the rest of the vector.
{{group_by_cols}} throws an error if length(group_by_cols) != 1.
Popular StackOverflow discussions such as this do not address a need for the length of the vector of column names to be variable.

答案1

得分: 3

你正在寻找 across() 和 all_of()：

group_flexibly <- function(tbl, grp_cols = character(0)){
  tbl %>%
    group_by(across(all_of(grp_cols))) %>%
    summarise(mean_n = mean(n), .groups = 'keep')
}

character(0) 的默认值处理了不提供任何值给 grp_cols 的情况。

实际上，我最近学到了一个稍微更受欢迎的版本，是使用 pick() 而不是 across()，区别在于如果 grp_cols 是一个命名向量，它将使用这些名称创建新列。使用 pick(all_of(grp_cols)) 或评论中建议的 .by 参数都会在命名向量上出错。

英文:

You're looking for across() and all_of():

group_flexibly &lt;- function(tbl,grp_cols = character(0)){
  tbl |&gt;
    group_by(across(all_of(grp_cols))) |&gt;
    summarise(mean_n = mean(n),.groups = &#39;keep&#39;)
}

The default value of character(0) handles the case of not providing any value to grp_cols.

I actually recently learned that a somewhat preferred version is to use pick() instead of across(), the difference being that if grp_cols is a named vector it will create new columns using those names. Using pick(all_of(grp_cols)) or the .by argument suggested in a comment would both error on a named vector.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

动态选择多列，这些列的名称存储为变量。

问题

答案1

在数据框中基于重复数字序列创建标识列。

如何按时间顺序而不是按值顺序订购图例条目？

如何在R中使用ggplot绘制xts时间序列？

将宽表格转换为长表格在R中

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。