英文:
dplyr & groups : what is the difference between keep and ungroup or directly drop?
问题
我需要对同一人的观察进行求和,而不需要为它们有一个唯一的标识代码/行。
这是数据集的示例
> head(dataset, 20)
nquest nord tpens
1 173 1 1800
2 633 1 300
3 633 1 600
4 923 1 500
5 2886 1 1211
6 2886 2 2100
7 5416 1 700
8 7886 1 1800
9 7886 1 200
10 20297 1 1200
11 20711 2 2000
12 22169 1 600
13 22169 1 280
14 22173 2 1000
15 22276 1 1200
16 22286 1 850
17 22286 2 650
18 22657 1 1400
19 22657 2 1500
20 23490 1 1400
变量是:
nquest
= 个体所属家庭的代码nord
= 家庭中个体的位置(1=丈夫,2=妻子,3=儿子,等等)tpens
= 每个人的工资
我需要对同一人引用的工资值进行求和。例如
如图所示,这些tpens
的值是指的同一个个体,因为不仅nquest
相同(家庭代码),而且nord
也相同。
我已经尝试过两种方式(按照一些建议)
第一种方式
new_dataset <- dataset %>%
replace(is.na(.), 0) %>%
group_by(nquest, nord) %>%
summarize(tpens = sum(tpens), .groups = 'drop')
第二种方式
new_dataset <- dataset %>%
replace(is.na(.), 0) %>%
group_by(nquest, nord) %>%
summarize(tpens = sum(tpens), .groups = 'keep') %>%
ungroup
它们正确吗?
有人能解释一下使用keep
组计算总和和然后ungroup
以及直接drop
组之间的区别吗?
我有点困惑,因为我不理解这一点:如果我对每个个体对应的值进行求和,最终的过程中就不应该有组...而只是每行一个人(我错了吗?)。如果我将此数据集与另一个数据集匹配,通过nquest
和nord
进行匹配(因此对于每个人),我得到的结果是# A tibble: 6 x 41 # Groups: nquest, nord [6]
。
这是怎么可能的?
英文:
I need to sum the observations referred to the same individual without having a unique identification code/row for them .
This is a sample of the dataset
> head(dataset, 20)
nquest nord tpens
1 173 1 1800
2 633 1 300
3 633 1 600
4 923 1 500
5 2886 1 1211
6 2886 2 2100
7 5416 1 700
8 7886 1 1800
9 7886 1 200
10 20297 1 1200
11 20711 2 2000
12 22169 1 600
13 22169 1 280
14 22173 2 1000
15 22276 1 1200
16 22286 1 850
17 22286 2 650
18 22657 1 1400
19 22657 2 1500
20 23490 1 1400
The variables are:
nquest
= is the code of the family to which the individual belongnord
= is the position of the individual in the family ( 1=husband, 2=wife, 3= son, etc..)tpens
= is the wage that each one of them earn
I need to sum the values of the wage that are referred to the same individual. For example
As you can see, these values of tpens
are referred to the same individual because not only nquest
is the same ( family code) , but also nord
.
I've tried to do it in 2 ways ( following some suggestions )
First way
new_dataset <- dataset %>%
replace(is.na(.), 0) %>%
group_by(nquest, nord) %>%
summarize(tpens = sum(tpens), .groups = 'drop')
Second way
new_dataset <- dataset %>%
replace(is.na(.), 0) %>%
group_by(nquest, nord) %>%
summarize(tpens = sum(tpens), .groups = 'keep') %>%
ungroup
Are they right?
Can anyone explain me the difference between computing the sum with keep
groups and then ungroup
and instead drop
the groups directly ??
I'm a bit confused because I do not understand this thing: if I make the sum of the values that correspond to each individual, I should not have groups at the end of the process... but just 1 indvidual per rows ( Am I wrong?). If I merge this dataset with another one matching by nquest
and nord
( hence for each person ), I get instead # A tibble: 6 x 41 # Groups: nquest, nord [6]
.
How is that possible?
答案1
得分: 0
使用.groups = 'keep'
和.groups = 'drop'
之间的区别在于这些函数执行后tibble
的状态。如果使用.groups = 'keep'
,tibble
将一直保持分组状态,直到运行ungroup()
。然而,如果使用.groups = 'drop'
,在运行summarize
后,tibble
将不再是分组状态。要了解更多信息,请查看文档中的“Verbs”部分 here。
以这个例子为例:
data("iris")
library(dplyr)
## 让我们试试“keep”
grouped <- iris %>%
group_by(Species) %>%
summarise(count = n(), .groups = "keep")
grouped
#> # A tibble: 3 × 2
#> # Groups: Species [3]
#> Species count
#> <fct> <int>
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
grouped %>% group_data()
#> # A tibble: 3 × 2
#> Species .rows
#> <fct> <list<int>>
#> 1 setosa [1]
#> 2 versicolor [1]
#> 3 virginica [1]
## 现在,让我们试试“drop”
ungrouped <- iris %>%
group_by(Species) %>%
summarise(count = n(), .groups = "keep")
ungrouped
#> # A tibble: 3 × 2
#> # Groups: Species [3]
#> Species count
#> <fct> <int>
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
ungrouped %>% group_data()
#> # A tibble: 1 × 1
#> .rows
#> <list<int>>
#> 1 [3]
这些输出的关键区别在于分组 - 如果我们不执行ungroup()
或使用.groups = 'drop'
,输出将保持分组状态。这意味着将来的操作将把这个tibble
视为分组的,这可能导致意外的后果。
如果只需要对一个函数使用分组,尝试使用.by
参数。了解更多信息 here。这样,您就无需记住使用.groups = 'drop'
或ungroup()
,只需编写:
iris %>%
summarise(count = n(), .by = Species)
#> # A tibble: 3 × 2
#> Species count
#> <fct> <int>
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
了解更多关于分组数据的信息 here。
英文:
The difference between using .groups = 'keep'
and .groups = 'drop'
lies in the state of the tibble
after these functions. If you use .groups = 'keep'
, the tibble
will be grouped until you run ungroup()
. However, if you use .groups = 'drop'
, the tibble
will no longer be grouped after you run summarize
. To learn more, check out the "Verbs" section of the documentation here.
Take this example:
data("iris")
library(dplyr)
## Let's try "keep"
grouped <- iris %>%
group_by(Species) %>%\
summarise(count = n(), .groups = "keep")
grouped
#> # A tibble: 3 × 2
#> # Groups: Species [3]
#> Species count
#> <fct> <int>
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
grouped %>% group_data()
#> # A tibble: 3 × 2
#> Species .rows
#> <fct> <list<int>>
#> 1 setosa [1]
#> 2 versicolor [1]
#> 3 virginica [1]
## Now, let's try "drop"
ungrouped <- iris %>%
group_by(Species) %>%\
summarise(count = n(), .groups = "keep")
ungrouped
#> # A tibble: 3 × 2
#> # Groups: Species [3]
#> Species count
#> <fct> <int>
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
ungrouped %>% group_data()
#> # A tibble: 1 × 1
#> .rows
#> <list<int>>
#> 1 [3]
The key difference is these outputs is the grouping - if we do not ungroup()
or use .groups = 'drop'
, the output remains grouped. This means that future operations will treat this tibble
as grouped, which could create unintended consequences.
If you only need to use grouping for one function, try the .by
parameter. Learn more here. This way instead of having to remember to use .groups = 'drop'
or ungroup()
, you can just write:
iris %>%
summarise(count = n(), .by = Species)
#> # A tibble: 3 × 2
#> Species count
#> <fct> <int>
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
Learn more about grouped data here.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论