如何检测数据框是否被dplyr从子函数分组?

huangapple go评论66阅读模式
英文:

How to detect if data.frame is grouped by dplyr from subfunction?

问题

我有一个R包,其中一些函数通常设计为在dplyr函数mutate或summarize内部调用。

newdata <- dplyr::mutate(group_by(olddata, col1), newcol = myfunc(col1))

然而,有时用户可能会忘记在将数据放入mutate或summarize调用之前对其进行分组。

newdata <- dplyr::mutate(olddata, newcol = myfunc(col1))

当数据框没有首先分组时,包函数将产生大部分毫无意义的结果。但是,不会有明显的错误或警告,这可能会让用户对问题的原因感到不确定。

我想在myfunc代码内部添加一个Warning(),当myfunc检测到输入数据不来自分组的数据框时。然而,我无法弄清楚myfunc如何检测数据是否来自分组的数据框。似乎mutate只传递一个向量给myfunc,所以dplyr::is.grouped_df和inherits(x, "grouped_df")都返回false。

我想要的是:

myfunc <- function(x) {
if (comes.from.grouped.df) {
print("grouped")
} else {
print("ungrouped")
}
}

mutate(olddata, newcol = myfunc(col1))
'ungrouped'

mutate(group_by(olddata, col1), newcol = myfunc(col1))
'grouped'
'grouped'
'grouped'


<details>
<summary>英文:</summary>

I have an R package where some functions are designed to be typically called within dplyr functions mutate or summarize.

    newdata &lt;- dplyr::mutate(group_by(olddata, col1), newcol = myfunc(col1))

However, sometimes users might forget to group their data before putting it into the mutate or summarize call. 

    newdata &lt;- dplyr::mutate(olddata, newcol = myfunc(col1))

When the data frame is not grouped first, the package functions will produce largely nonsensical results. However, there won&#39;t be any errors or warnings per se, which could leave users uncertain about the cause of the issue. 

I&#39;d like to add a `Warning()` within the `myfunc` code when `myfunc` detects that the input data isn&#39;t coming from a grouped `data.frame`. However, I can&#39;t figure out how `myfunc` could detect if the data is coming from a grouped `data.frame`. It appears that `mutate` only passes a vector to `myfunc`, so both `dplyr::is.grouped_df` and `inherits(x, &quot;grouped_df&quot;)` return false.

What I would like:

myfunc <- function(x) {if(comes.from.grouped.df) {print("grouped")} else {print("ungrouped")}}

mutate(olddata, newcol = myfunc(col1))
'ungrouped'

mutate(group_by(olddata, col1), newcol = myfunc(col1))
'grouped'
'grouped'
'grouped'


</details>


# 答案1
**得分**: 5

``` r
如果你想要在特定上下文中使用你的函数,并且在数据框未分组时发出警告,那么你可以这样做:

在`mutate`之外使用,会出现错误:
```r
myfunc(1:10)
#&gt; Error in myfunc(1:10): `myfunc`必须从`mutate`内部调用

在未分组的数据框或 tibble 上会得到一个警告:

tibble(iris) %&gt;% 
  mutate(x = myfunc(Sepal.Length))
#&gt; 警告信息:`myfunc`在未分组的数据框或 tibble 上被调用
#&gt; # 一个 tibble: 150 x 6
#&gt;    Sepal.Length Sepal.Width Petal.Length Petal.Width Species     x
#&gt;           &lt;dbl&gt;       &lt;dbl&gt;        &lt;dbl&gt;       &lt;dbl&gt; &lt;fct&gt;   &lt;dbl&gt;
#&gt;  1          5.1         3.5          1.4         0.2 setosa   26.0
#&gt;  2          4.9         3            1.4         0.2 setosa   24.0
#&gt;  3          4.7         3.2          1.3         0.2 setosa   22.1
#&gt;  4          4.6         3.1          1.5         0.2 setosa   21.2
#&gt;  5          5           3.6          1.4         0.2 setosa   25  
#&gt;  6          5.4         3.9          1.7         0.4 setosa   29.2
#&gt;  7          4.6         3.4          1.4         0.3 setosa   21.2
#&gt;  8          5           3.4          1.5         0.2 setosa   25  
#&gt;  9          4.4         2.9          1.4         0.2 setosa   19.4
#&gt; 10          4.9         3.1          1.5         0.1 setosa   24.0
#&gt; # ... 还有 140 行

如果 tibble 被分组,它会毫无怨言地运行:

tibble(iris) %&gt;% 
  group_by(Species) %&gt;%
  mutate(x = myfunc(Sepal.Length))
#&gt; # 一个 tibble: 150 x 6
#&gt; # 分组:   Species [3]
#&gt;    Sepal.Length Sepal.Width Petal.Length Petal.Width Species     x
#&gt;           &lt;dbl&gt;       &lt;dbl&gt;        &lt;dbl&gt;       &lt;dbl&gt; &lt;fct&gt;   &lt;dbl&gt;
#&gt;  1          5.1         3.5          1.4         0.2 setosa   26.0
#&gt;  2          4.9         3            1.4         0.2 setosa   24.0
#&gt;  3          4.7         3.2          1.3         0.2 setosa   22.1
#&gt;  4          4.6         3.1          1.5         0.2 setosa   21.2
#&gt;  5          5           3.6          1.4         0.2 setosa   25  
#&gt;  6          5.4         3.9          1.7         0.4 setosa   29.2
#&gt;  7          4.6         3.4          1.4         0.3 setosa   21.2
#&gt;  8          5           3.4          1.5         0.2 setosa   25  
#&gt;  9          4.4         2.9          1.4         0.2 setosa   19.4
#&gt; 10          4.9         3.1          1.5         0.1 setosa   24.0
#&gt; # ... 还有 140 行

<sup>在 2023-02-15 使用 reprex v2.0.2 创建</sup>

英文:

If you want your function used within a specific context, and emit a warning if the data frame is not grouped, then you can do:

library(tidyverse)

myfunc &lt;- function(x) {
  if(all(ls(envir = parent.frame()) == &quot;~&quot;)) {
    ss &lt;- sys.status()
    funcs &lt;- sapply(ss$sys.calls, function(x) deparse(as.list(x)[[1]]))
    wf &lt;- which(funcs == &quot;mutate&quot;)
    if(length(wf) == 0) stop(&quot;`myfunc` must be called from inside `mutate`&quot;)
    wf &lt;- max(wf)
    data &lt;- eval(substitute(.data), ss$sys.frames[[wf]])
    if(!inherits(data, &quot;grouped_df&quot;)) {
      warning(&quot;`myfunc` called on an ungrouped data frame / tibble.&quot;)
    }
    return(x^2)
  }
  stop(&quot;`myfunc` must be called from inside `mutate`&quot;)
}

Used outside mutate, we get an error:

myfunc(1:10)
#&gt; Error in myfunc(1:10): `myfunc` must be called from inside `mutate`

With an ungrouped data frame or tibble we get a warning:

tibble(iris) %&gt;% 
  mutate(x = myfunc(Sepal.Length))
#&gt; Warning in myfunc(Sepal.Length): `myfunc` called on an ungrouped data frame /
#&gt; tibble.
#&gt; # A tibble: 150 x 6
#&gt;    Sepal.Length Sepal.Width Petal.Length Petal.Width Species     x
#&gt;           &lt;dbl&gt;       &lt;dbl&gt;        &lt;dbl&gt;       &lt;dbl&gt; &lt;fct&gt;   &lt;dbl&gt;
#&gt;  1          5.1         3.5          1.4         0.2 setosa   26.0
#&gt;  2          4.9         3            1.4         0.2 setosa   24.0
#&gt;  3          4.7         3.2          1.3         0.2 setosa   22.1
#&gt;  4          4.6         3.1          1.5         0.2 setosa   21.2
#&gt;  5          5           3.6          1.4         0.2 setosa   25  
#&gt;  6          5.4         3.9          1.7         0.4 setosa   29.2
#&gt;  7          4.6         3.4          1.4         0.3 setosa   21.2
#&gt;  8          5           3.4          1.5         0.2 setosa   25  
#&gt;  9          4.4         2.9          1.4         0.2 setosa   19.4
#&gt; 10          4.9         3.1          1.5         0.1 setosa   24.0
#&gt; # ... with 140 more rows

And it runs without complaint if the tibble is grouped:

tibble(iris) %&gt;% 
  group_by(Species) %&gt;%
  mutate(x = myfunc(Sepal.Length))
#&gt; # A tibble: 150 x 6
#&gt; # Groups:   Species [3]
#&gt;    Sepal.Length Sepal.Width Petal.Length Petal.Width Species     x
#&gt;           &lt;dbl&gt;       &lt;dbl&gt;        &lt;dbl&gt;       &lt;dbl&gt; &lt;fct&gt;   &lt;dbl&gt;
#&gt;  1          5.1         3.5          1.4         0.2 setosa   26.0
#&gt;  2          4.9         3            1.4         0.2 setosa   24.0
#&gt;  3          4.7         3.2          1.3         0.2 setosa   22.1
#&gt;  4          4.6         3.1          1.5         0.2 setosa   21.2
#&gt;  5          5           3.6          1.4         0.2 setosa   25  
#&gt;  6          5.4         3.9          1.7         0.4 setosa   29.2
#&gt;  7          4.6         3.4          1.4         0.3 setosa   21.2
#&gt;  8          5           3.4          1.5         0.2 setosa   25  
#&gt;  9          4.4         2.9          1.4         0.2 setosa   19.4
#&gt; 10          4.9         3.1          1.5         0.1 setosa   24.0
#&gt; # ... with 140 more rows

<sup>Created on 2023-02-15 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年2月16日 04:08:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/75464969.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定