dplyr & groups : what is the difference between keep and ungroup or directly drop?

huangapple go评论76阅读模式
英文:

dplyr & groups : what is the difference between keep and ungroup or directly drop?

问题

我需要对同一人的观察进行求和,而不需要为它们有一个唯一的标识代码/行。

这是数据集的示例

> head(dataset, 20)
   nquest nord tpens
1     173    1  1800
2     633    1   300
3     633    1   600
4     923    1   500
5    2886    1  1211
6    2886    2  2100
7    5416    1   700
8    7886    1  1800
9    7886    1   200
10  20297    1  1200
11  20711    2  2000
12  22169    1   600
13  22169    1   280
14  22173    2  1000
15  22276    1  1200
16  22286    1   850
17  22286    2   650
18  22657    1  1400
19  22657    2  1500
20  23490    1  1400

变量是:

  1. nquest = 个体所属家庭的代码
  2. nord = 家庭中个体的位置(1=丈夫,2=妻子,3=儿子,等等)
  3. tpens = 每个人的工资

我需要对同一人引用的工资值进行求和。例如

dplyr & groups : what is the difference between keep and ungroup or directly drop?

如图所示,这些tpens的值是指的同一个个体,因为不仅nquest相同(家庭代码),而且nord也相同。

我已经尝试过两种方式(按照一些建议)

第一种方式

new_dataset <- dataset %>%
  replace(is.na(.), 0) %>%
  group_by(nquest, nord) %>%
  summarize(tpens = sum(tpens), .groups = 'drop')

第二种方式

new_dataset <- dataset %>%
  replace(is.na(.), 0) %>%
  group_by(nquest, nord) %>%
  summarize(tpens = sum(tpens), .groups = 'keep') %>%
  ungroup

它们正确吗?
有人能解释一下使用keep组计算总和和然后ungroup以及直接drop组之间的区别吗?

我有点困惑,因为我不理解这一点:如果我对每个个体对应的值进行求和,最终的过程中就不应该有组...而只是每行一个人(我错了吗?)。如果我将此数据集与另一个数据集匹配,通过nquestnord进行匹配(因此对于每个人),我得到的结果是# A tibble: 6 x 41 # Groups: nquest, nord [6]

这是怎么可能的?

英文:

I need to sum the observations referred to the same individual without having a unique identification code/row for them .

This is a sample of the dataset

&gt; head(dataset, 20)
   nquest nord tpens
1     173    1  1800
2     633    1   300
3     633    1   600
4     923    1   500
5    2886    1  1211
6    2886    2  2100
7    5416    1   700
8    7886    1  1800
9    7886    1   200
10  20297    1  1200
11  20711    2  2000
12  22169    1   600
13  22169    1   280
14  22173    2  1000
15  22276    1  1200
16  22286    1   850
17  22286    2   650
18  22657    1  1400
19  22657    2  1500
20  23490    1  1400

The variables are:

  1. nquest = is the code of the family to which the individual belong
  2. nord = is the position of the individual in the family ( 1=husband, 2=wife, 3= son, etc..)
  3. tpens = is the wage that each one of them earn

I need to sum the values of the wage that are referred to the same individual. For example

dplyr & groups : what is the difference between keep and ungroup or directly drop?

As you can see, these values of tpens are referred to the same individual because not only nquest is the same ( family code) , but also nord.

I've tried to do it in 2 ways ( following some suggestions )

First way

new_dataset &lt;- dataset %&gt;%
  replace(is.na(.), 0) %&gt;%
  group_by(nquest, nord) %&gt;% 
  summarize(tpens = sum(tpens), .groups = &#39;drop&#39;)

Second way

new_dataset &lt;- dataset %&gt;%   
  replace(is.na(.), 0) %&gt;%   
  group_by(nquest, nord) %&gt;%    
  summarize(tpens = sum(tpens), .groups = &#39;keep&#39;) %&gt;% 
  ungroup

Are they right?
Can anyone explain me the difference between computing the sum with keep groups and then ungroup and instead drop the groups directly ??

I'm a bit confused because I do not understand this thing: if I make the sum of the values that correspond to each individual, I should not have groups at the end of the process... but just 1 indvidual per rows ( Am I wrong?). If I merge this dataset with another one matching by nquest and nord ( hence for each person ), I get instead # A tibble: 6 x 41 # Groups: nquest, nord [6].

How is that possible?

答案1

得分: 0

使用.groups = 'keep'.groups = 'drop'之间的区别在于这些函数执行后tibble的状态。如果使用.groups = 'keep'tibble将一直保持分组状态,直到运行ungroup()。然而,如果使用.groups = 'drop',在运行summarize后,tibble将不再是分组状态。要了解更多信息,请查看文档中的“Verbs”部分 here

以这个例子为例:

data("iris")
library(dplyr)

## 让我们试试“keep”
grouped <- iris %>%
  group_by(Species) %>%
  summarise(count = n(), .groups = "keep")
grouped

#> # A tibble: 3 × 2
#> # Groups:   Species [3]
#>   Species    count
#>   <fct>      <int>
#> 1 setosa        50
#> 2 versicolor    50
#> 3 virginica     50

grouped %>% group_data()
#> # A tibble: 3 × 2
#>   Species          .rows
#>   <fct>      <list<int>>
#> 1 setosa             [1]
#> 2 versicolor         [1]
#> 3 virginica          [1]

## 现在,让我们试试“drop”
ungrouped <- iris %>%
  group_by(Species) %>%
  summarise(count = n(), .groups = "keep")
ungrouped

#> # A tibble: 3 × 2
#> # Groups:   Species [3]
#>   Species    count
#>   <fct>      <int>
#> 1 setosa        50
#> 2 versicolor    50
#> 3 virginica     50

ungrouped %>% group_data()
#> # A tibble: 1 × 1
#>         .rows
#>   <list<int>>
#> 1         [3]

这些输出的关键区别在于分组 - 如果我们不执行ungroup()或使用.groups = 'drop',输出将保持分组状态。这意味着将来的操作将把这个tibble视为分组的,这可能导致意外的后果。

如果只需要对一个函数使用分组,尝试使用.by参数。了解更多信息 here。这样,您就无需记住使用.groups = 'drop'ungroup(),只需编写:

iris %>%
  summarise(count = n(), .by = Species)

#> # A tibble: 3 × 2
#>   Species    count
#>   <fct>      <int>
#> 1 setosa        50
#> 2 versicolor    50
#> 3 virginica     50

了解更多关于分组数据的信息 here

英文:

The difference between using .groups = &#39;keep&#39; and .groups = &#39;drop&#39; lies in the state of the tibble after these functions. If you use .groups = &#39;keep&#39;, the tibble will be grouped until you run ungroup(). However, if you use .groups = &#39;drop&#39;, the tibble will no longer be grouped after you run summarize. To learn more, check out the "Verbs" section of the documentation here.

Take this example:

data(&quot;iris&quot;)
library(dplyr)

## Let&#39;s try &quot;keep&quot;
grouped &lt;- iris %&gt;%
  group_by(Species) %&gt;%\
  summarise(count = n(), .groups = &quot;keep&quot;)
grouped

#&gt; # A tibble: 3 &#215; 2
#&gt; # Groups:   Species [3]
#&gt;   Species    count
#&gt;   &lt;fct&gt;      &lt;int&gt;
#&gt; 1 setosa        50
#&gt; 2 versicolor    50
#&gt; 3 virginica     50

grouped %&gt;% group_data()
#&gt; # A tibble: 3 &#215; 2
#&gt;   Species          .rows
#&gt;   &lt;fct&gt;      &lt;list&lt;int&gt;&gt;
#&gt; 1 setosa             [1]
#&gt; 2 versicolor         [1]
#&gt; 3 virginica          [1]

## Now, let&#39;s try &quot;drop&quot;
ungrouped &lt;- iris %&gt;%
  group_by(Species) %&gt;%\
  summarise(count = n(), .groups = &quot;keep&quot;)
ungrouped

#&gt; # A tibble: 3 &#215; 2
#&gt; # Groups:   Species [3]
#&gt;   Species    count
#&gt;   &lt;fct&gt;      &lt;int&gt;
#&gt; 1 setosa        50
#&gt; 2 versicolor    50
#&gt; 3 virginica     50

ungrouped %&gt;% group_data()
#&gt; # A tibble: 1 &#215; 1
#&gt;         .rows
#&gt;   &lt;list&lt;int&gt;&gt;
#&gt; 1         [3]

The key difference is these outputs is the grouping - if we do not ungroup() or use .groups = &#39;drop&#39;, the output remains grouped. This means that future operations will treat this tibble as grouped, which could create unintended consequences.

If you only need to use grouping for one function, try the .by parameter. Learn more here. This way instead of having to remember to use .groups = &#39;drop&#39; or ungroup(), you can just write:

iris %&gt;%
  summarise(count = n(), .by = Species)

#&gt; # A tibble: 3 &#215; 2
#&gt;   Species    count
#&gt;   &lt;fct&gt;      &lt;int&gt;
#&gt; 1 setosa        50
#&gt; 2 versicolor    50
#&gt; 3 virginica     50

Learn more about grouped data here.

huangapple
  • 本文由 发表于 2023年3月9日 22:38:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75686053.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定