R函数用于使用dplyr的group_by和灵活的组进行总结,包括完全没有分组。

huangapple go评论82阅读模式
英文:

R function to summarise using dplyr group_by with flexibble groups, including no grouping at all

问题

我想编写一个R函数,使用dplyr来总结一个数据集,该函数接受不同数量的分组变量作为group_by语句的一部分,包括根本不分组。我找到了类似问题的答案,它们使用了'group_by_',但这已经被弃用(写作时的dplyr版本为1.1.2)。

我尝试过使用不同的方法将向group_by语句传递向量,试图使用整洁评估,但没有一个达到预期的效果,而且在不需要分组时无法返回答案。

以下是一个使用星球大战数据集的可重现示例的基础。该函数应能够返回各种生物的体重指数(BMI)的摘要表。

```r
`star_wars_BMI <- function(group_vec) {
  df_out <- starwars %>%
    mutate(BMI = height/mass^2) %>%
    group_by(group_vec) %>%
    summarise(height_mean = mean(height, na.rm = T),
              mass_mean = mean(mass, na.rm = T),
              BMI_mean = mean(BMI, na.rm = T))
  return(df_out)
}

group_vector0 <- c()  # 即整个星系的摘要
group_vector1 <- c("homeworld")  # 按故乡星球总结
group_vector2 <- c("homeworld", "species") # 在每个故乡星球上按物种总结

galaxy_BMI <- star_wars_BMI(group_vec = group_vector0)
homeworld_BMI <- star_wars_BMI(group_vec = group_vector1)
`

我知道为无组或某些组单独编写函数是一个相对简单的任务,但我想看看是否可能只使用一个函数来完成这个任务。

关于整洁评估原理的解释将非常感激,如果能提供一个示例来继续绘制摘要,那将更好。


<details>
<summary>英文:</summary>

I want to write an R function using dplyr to summarise a data set that accepts different numbers of grouping variables to the group_by statement - including no grouping at all.  I have found answers to similar questions that use &#39;group_by_&#39;, but this has been deprecated (dplyr vrsion at time of writing is 1.1.2).  

I have used different methods of passing vectors to the group_by statements attempting to use tidy evaluation, but none have worked as expected and failed to return an answer when no grouping is required.  

Here&#39;s the basis for a reproduceable example using the starwars dataset.  The function should be capable of returning summary tables of the Body-Mass Indexes (BMI) of the various creatures.

`star_wars_BMI <- function(group_vec) {
df_out <- starwars %>%
mutate (BMI = height/mass^2) %>%
group_by(group_vec) %>%
summarise(height_mean = mean(height, na.rm = T),
mass_mean = mean(mass, na.rm = T),
BMI_mean = mean(BMI, na.rm = T))
return(df_out)
}

group_vector0 <- c() # ie. summarise for the whole galaxy
group_vector1 <- c("homeworld") # summarise by homeworld planet
group_vector2 <- c("homeworld", "species") = summarise by species on each homeworld

galaxy_BMI <- star_wars_BMI(group_vec = group_vector0)
homeworld_BMI <- star_wars_BMI(group_vec = group_vector1)
`


I know it&#39;s a relatively simple task to produce separate functions for either no or some groups, but I would like to see if it is possible to do this with just one.  

An explanation of the tidy evalation rationale would be very much appreciated - as would an example that went on to plot the summaries.

</details>


# 答案1
**得分**: 3

这是另一种选项,使用省略号或 `...` 作为传递给 `group_by` 函数的参数列名。现在我们传递的不是向量,而是列名:

`rlang::ensyms(...)` 将列名存储为符号,然后 `!!!` 在 `group_by` 函数中取消引用它们:

```R
library(dplyr)

star_wars_BMI <- function(...) {
  
  group_vec <- rlang::ensyms(...)
  
  df_out <- starwars %>%
    mutate (BMI = height/mass^2) %>%
    group_by(!!!group_vec) %>%
    summarise(height_mean = mean(height, na.rm = TRUE),
              mass_mean = mean(mass, na.rm = TRUE),
              BMI_mean = mean(BMI, na.rm = TRUE))
  
  return(df_out)
}

star_wars_BMI() 结果输出:

height_mean mass_mean BMI_mean
        <dbl>     <dbl>    <dbl>
1        174.      97.3   0.0481

star_wars_BMI("homeworld") 结果输出:

# A tibble: 49 × 4
   homeworld      height_mean mass_mean BMI_mean
   <chr>                <dbl>     <dbl>    <dbl>
 1 Alderaan              176.      64     0.0463
 2 Aleen Minor            79       15     0.351 
 3 Bespin                175       79     0.0280
 4 Bestine IV            180      110     0.0149
 5 Cato Neimoidia        191       90     0.0236
 6 Cerea                 198       82     0.0294
 7 Champala              196      NaN   NaN     
 8 Chandrila             150      NaN   NaN     
 9 Concord Dawn          183       79     0.0293
10 Corellia              175       78.5   0.0284
# ... with 39 more rows
# ℹ Use `print(n = ...)` to see more rows

star_wars_BMI("homeworld", "species") 结果输出:

`summarise()` has grouped output by 'homeworld'. You can override using the
`.groups` argument.
# A tibble: 58 × 5
# Groups:   homeworld [49]
   homeworld      species   height_mean mass_mean BMI_mean
   <chr>          <chr>           <dbl>     <dbl>    <dbl>
 1 Alderaan       Human            176.      64     0.0463
 2 Aleen Minor    Aleena            79       15     0.351 
 3 Bespin         Human            175       79     0.0280
 4 Bestine IV     Human            180      110     0.0149
 5 Cato Neimoidia Neimodian        191       90     0.0236
 6 Cerea          Cerean           198       82     0.0294
 7 Champala       Chagrian         196      NaN   NaN     
 8 Chandrila      Human            150      NaN   NaN     
 9 Concord Dawn   Human            183       79     0.0293
10 Corellia       Human            175       78.5   0.0284
# ... with 48 more rows
# ℹ Use `print(n = ...)` to see more rows
英文:

Here is another option using the ellipsis or ... as argument to column names for group_by. Now we pass not a vector but the column names instead:

The rlang::ensyms(...) stores the column names as symbols, then !!!` unquotes them in the group_by function:

library(dplyr)

star_wars_BMI &lt;- function(...) {
  
  group_vec &lt;- rlang::ensyms(...)
  
  df_out &lt;- starwars %&gt;% 
    mutate (BMI = height/mass^2) %&gt;% 
    group_by(!!!group_vec) %&gt;% 
    summarise(height_mean = mean(height, na.rm = TRUE),
              mass_mean = mean(mass, na.rm = TRUE),
              BMI_mean = mean(BMI, na.rm = TRUE))
  
  return(df_out)
}


star_wars_BMI()
star_wars_BMI(&quot;homeworld&quot;)
star_wars_BMI(&quot;homeworld&quot;, &quot;species&quot;)

output:

height_mean mass_mean BMI_mean
        &lt;dbl&gt;     &lt;dbl&gt;    &lt;dbl&gt;
1        174.      97.3   0.0481
&gt; star_wars_BMI(&quot;homeworld&quot;)
# A tibble: 49 &#215; 4
   homeworld      height_mean mass_mean BMI_mean
   &lt;chr&gt;                &lt;dbl&gt;     &lt;dbl&gt;    &lt;dbl&gt;
 1 Alderaan              176.      64     0.0463
 2 Aleen Minor            79       15     0.351 
 3 Bespin                175       79     0.0280
 4 Bestine IV            180      110     0.0149
 5 Cato Neimoidia        191       90     0.0236
 6 Cerea                 198       82     0.0294
 7 Champala              196      NaN   NaN     
 8 Chandrila             150      NaN   NaN     
 9 Concord Dawn          183       79     0.0293
10 Corellia              175       78.5   0.0284
# … with 39 more rows
# ℹ Use `print(n = ...)` to see more rows
&gt; star_wars_BMI(&quot;homeworld&quot;, &quot;species&quot;)
`summarise()` has grouped output by &#39;homeworld&#39;. You can override using the
`.groups` argument.
# A tibble: 58 &#215; 5
# Groups:   homeworld [49]
   homeworld      species   height_mean mass_mean BMI_mean
   &lt;chr&gt;          &lt;chr&gt;           &lt;dbl&gt;     &lt;dbl&gt;    &lt;dbl&gt;
 1 Alderaan       Human            176.      64     0.0463
 2 Aleen Minor    Aleena            79       15     0.351 
 3 Bespin         Human            175       79     0.0280
 4 Bestine IV     Human            180      110     0.0149
 5 Cato Neimoidia Neimodian        191       90     0.0236
 6 Cerea          Cerean           198       82     0.0294
 7 Champala       Chagrian         196      NaN   NaN     
 8 Chandrila      Human            150      NaN   NaN     
 9 Concord Dawn   Human            183       79     0.0293
10 Corellia       Human            175       78.5   0.0284
# … with 48 more rows
# ℹ Use `print(n = ...)` to see more rows
&gt; 

答案2

得分: 2

希望你一切都好。

我相信你可以使用across

例如:

star_wars_BMI <- function(group_vec) {
  df_out <- starwars %>%
    mutate(BMI = height/mass^2) %>%
    group_by(across(group_vec)) %>%
    summarise(height_mean = mean(height, na.rm = T),
              mass_mean = mean(mass, na.rm = T),
              BMI_mean = mean(BMI, na.rm = T))
  return(df_out)
}

group_vector0 <- c()  # 即整个星系的总结
group_vector1 <- c("homeworld")  # 按母星总结
group_vector2 <- c("homeworld", "species") # 在每个母星上按物种总结

star_wars_BMI(group_vec = group_vector0)
star_wars_BMI(group_vec = group_vector1)
star_wars_BMI(group_vec = group_vector2)
英文:

hope you are doing well

I believe you can use the across

Like:

star_wars_BMI &lt;- function(group_vec) {
  df_out &lt;- starwars %&gt;% 
    mutate (BMI = height/mass^2) %&gt;% 
    group_by(across(group_vec)) %&gt;% 
    summarise(height_mean = mean(height, na.rm = T),
              mass_mean = mean(mass, na.rm = T),
              BMI_mean = mean(BMI, na.rm = T))
  return(df_out)
}

group_vector0 &lt;- c()  # ie. summarise for the whole galaxy
group_vector1 &lt;- c(&quot;homeworld&quot;)  # summarise by homeworld planet
group_vector2 &lt;- c(&quot;homeworld&quot;, &quot;species&quot;) # summarise by species on each homeworld


star_wars_BMI(group_vec = group_vector0)
star_wars_BMI(group_vec = group_vector1)
star_wars_BMI(group_vec = group_vector2)

huangapple
  • 本文由 发表于 2023年5月24日 21:43:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/76324205.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定