2023年3月9日 22:38:58go评论103阅读模式

英文:

dplyr & groups : what is the difference between keep and ungroup or directly drop?

问题

我需要对同一人的观察进行求和，而不需要为它们有一个唯一的标识代码/行。

这是数据集的示例

&gt; head(dataset, 20)
   nquest nord tpens
1     173    1  1800
2     633    1   300
3     633    1   600
4     923    1   500
5    2886    1  1211
6    2886    2  2100
7    5416    1   700
8    7886    1  1800
9    7886    1   200
10  20297    1  1200
11  20711    2  2000
12  22169    1   600
13  22169    1   280
14  22173    2  1000
15  22276    1  1200
16  22286    1   850
17  22286    2   650
18  22657    1  1400
19  22657    2  1500
20  23490    1  1400

变量是：

nquest = 个体所属家庭的代码
nord = 家庭中个体的位置（1=丈夫，2=妻子，3=儿子，等等）
tpens = 每个人的工资

我需要对同一人引用的工资值进行求和。例如

如图所示，这些tpens的值是指的同一个个体，因为不仅nquest相同（家庭代码），而且nord也相同。

我已经尝试过两种方式（按照一些建议）

第一种方式

new_dataset <- dataset %>%
  replace(is.na(.), 0) %>%
  group_by(nquest, nord) %>%
  summarize(tpens = sum(tpens), .groups = 'drop')

第二种方式

new_dataset <- dataset %>%
  replace(is.na(.), 0) %>%
  group_by(nquest, nord) %>%
  summarize(tpens = sum(tpens), .groups = 'keep') %>%
  ungroup

它们正确吗？
有人能解释一下使用keep组计算总和和然后ungroup以及直接drop组之间的区别吗？

我有点困惑，因为我不理解这一点：如果我对每个个体对应的值进行求和，最终的过程中就不应该有组...而只是每行一个人（我错了吗？）。如果我将此数据集与另一个数据集匹配，通过nquest和nord进行匹配（因此对于每个人），我得到的结果是# A tibble: 6 x 41 # Groups: nquest, nord [6]。

这是怎么可能的？

英文:

I need to sum the observations referred to the same individual without having a unique identification code/row for them .

This is a sample of the dataset

&gt; head(dataset, 20)
   nquest nord tpens
1     173    1  1800
2     633    1   300
3     633    1   600
4     923    1   500
5    2886    1  1211
6    2886    2  2100
7    5416    1   700
8    7886    1  1800
9    7886    1   200
10  20297    1  1200
11  20711    2  2000
12  22169    1   600
13  22169    1   280
14  22173    2  1000
15  22276    1  1200
16  22286    1   850
17  22286    2   650
18  22657    1  1400
19  22657    2  1500
20  23490    1  1400

The variables are:

nquest = is the code of the family to which the individual belong
nord = is the position of the individual in the family ( 1=husband, 2=wife, 3= son, etc..)
tpens = is the wage that each one of them earn

I need to sum the values of the wage that are referred to the same individual. For example

As you can see, these values of tpens are referred to the same individual because not only nquest is the same ( family code) , but also nord.

I've tried to do it in 2 ways ( following some suggestions )

First way

new_dataset &lt;- dataset %&gt;%
  replace(is.na(.), 0) %&gt;%
  group_by(nquest, nord) %&gt;% 
  summarize(tpens = sum(tpens), .groups = &#39;drop&#39;)

Second way

new_dataset &lt;- dataset %&gt;%   
  replace(is.na(.), 0) %&gt;%   
  group_by(nquest, nord) %&gt;%    
  summarize(tpens = sum(tpens), .groups = &#39;keep&#39;) %&gt;% 
  ungroup

Are they right?
Can anyone explain me the difference between computing the sum with keep groups and then ungroup and instead drop the groups directly ??

I'm a bit confused because I do not understand this thing: if I make the sum of the values that correspond to each individual, I should not have groups at the end of the process... but just 1 indvidual per rows ( Am I wrong?). If I merge this dataset with another one matching by nquest and nord ( hence for each person ), I get instead # A tibble: 6 x 41 # Groups: nquest, nord [6].

How is that possible?

答案1

得分: 0

使用.groups = 'keep'和.groups = 'drop'之间的区别在于这些函数执行后tibble的状态。如果使用.groups = 'keep'，tibble将一直保持分组状态，直到运行ungroup()。然而，如果使用.groups = 'drop'，在运行summarize后，tibble将不再是分组状态。要了解更多信息，请查看文档中的“Verbs”部分 here。

以这个例子为例：

data("iris")
library(dplyr)
## 让我们试试“keep”
grouped <- iris %>%
  group_by(Species) %>%
  summarise(count = n(), .groups = "keep")
grouped
#> # A tibble: 3 × 2
#> # Groups:   Species [3]
#>   Species    count
#>   <fct>      <int>
#> 1 setosa        50
#> 2 versicolor    50
#> 3 virginica     50
grouped %>% group_data()
#> # A tibble: 3 × 2
#>   Species          .rows
#>   <fct>      <list<int>>
#> 1 setosa             [1]
#> 2 versicolor         [1]
#> 3 virginica          [1]
## 现在，让我们试试“drop”
ungrouped <- iris %>%
  group_by(Species) %>%
  summarise(count = n(), .groups = "keep")
ungrouped
#> # A tibble: 3 × 2
#> # Groups:   Species [3]
#>   Species    count
#>   <fct>      <int>
#> 1 setosa        50
#> 2 versicolor    50
#> 3 virginica     50
ungrouped %>% group_data()
#> # A tibble: 1 × 1
#>         .rows
#>   <list<int>>
#> 1         [3]

这些输出的关键区别在于分组 - 如果我们不执行ungroup()或使用.groups = 'drop'，输出将保持分组状态。这意味着将来的操作将把这个tibble视为分组的，这可能导致意外的后果。

如果只需要对一个函数使用分组，尝试使用.by参数。了解更多信息 here。这样，您就无需记住使用.groups = 'drop'或ungroup()，只需编写：

iris %>%
  summarise(count = n(), .by = Species)
#> # A tibble: 3 × 2
#>   Species    count
#>   <fct>      <int>
#> 1 setosa        50
#> 2 versicolor    50
#> 3 virginica     50

了解更多关于分组数据的信息 here。

英文:

The difference between using .groups = 'keep' and .groups = 'drop' lies in the state of the tibble after these functions. If you use .groups = 'keep', the tibble will be grouped until you run ungroup(). However, if you use .groups = 'drop', the tibble will no longer be grouped after you run summarize. To learn more, check out the "Verbs" section of the documentation here.

Take this example:

data(&quot;iris&quot;)
library(dplyr)
## Let&#39;s try &quot;keep&quot;
grouped &lt;- iris %&gt;%
  group_by(Species) %&gt;%\
  summarise(count = n(), .groups = &quot;keep&quot;)
grouped
#&gt; # A tibble: 3 &#215; 2
#&gt; # Groups:   Species [3]
#&gt;   Species    count
#&gt;   &lt;fct&gt;      &lt;int&gt;
#&gt; 1 setosa        50
#&gt; 2 versicolor    50
#&gt; 3 virginica     50
grouped %&gt;% group_data()
#&gt; # A tibble: 3 &#215; 2
#&gt;   Species          .rows
#&gt;   &lt;fct&gt;      &lt;list&lt;int&gt;&gt;
#&gt; 1 setosa             [1]
#&gt; 2 versicolor         [1]
#&gt; 3 virginica          [1]
## Now, let&#39;s try &quot;drop&quot;
ungrouped &lt;- iris %&gt;%
  group_by(Species) %&gt;%\
  summarise(count = n(), .groups = &quot;keep&quot;)
ungrouped
#&gt; # A tibble: 3 &#215; 2
#&gt; # Groups:   Species [3]
#&gt;   Species    count
#&gt;   &lt;fct&gt;      &lt;int&gt;
#&gt; 1 setosa        50
#&gt; 2 versicolor    50
#&gt; 3 virginica     50
ungrouped %&gt;% group_data()
#&gt; # A tibble: 1 &#215; 1
#&gt;         .rows
#&gt;   &lt;list&lt;int&gt;&gt;
#&gt; 1         [3]

The key difference is these outputs is the grouping - if we do not ungroup() or use .groups = 'drop', the output remains grouped. This means that future operations will treat this tibble as grouped, which could create unintended consequences.

If you only need to use grouping for one function, try the .by parameter. Learn more here. This way instead of having to remember to use .groups = 'drop' or ungroup(), you can just write:

iris %&gt;%
  summarise(count = n(), .by = Species)
#&gt; # A tibble: 3 &#215; 2
#&gt;   Species    count
#&gt;   &lt;fct&gt;      &lt;int&gt;
#&gt; 1 setosa        50
#&gt; 2 versicolor    50
#&gt; 3 virginica     50

Learn more about grouped data here.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

dplyr & groups : what is the difference between keep and ungroup or directly drop?

问题

答案1

基于行的标准确定学生的等级

如何在 Shiny 应用中编辑后保留数据表的筛选（或不同页面）视图？

当自变量和因变量相同时，线性模型不会产生斜率或R平方值。

SSRS报告中如何自定义主要分组（按组）的问题

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。