2023年7月13日 19:35:37go评论94阅读模式

英文:

Cumulative sum per group in tidyverse R

问题

我有面板数据，即每个家庭的重复观察。一个单位（家庭）会随着时间变化并展示一个特征（例如 variable）。我可以使用 group_by(id, year) 计算每年的组总和。如何得到类似 goal 列中的累积总和？在这个示例中，我需要结果保留10行，即不将数据合并到年份。如何选择每个单位每年只计算一个组总和来相加？

英文:

I have paneldata, i.e. repeated observations per household. A unit (household) is measured through time and exhibits a characteric (e.g. variable). I can calculate a group sum per year with group_by(id, year). How can I have a cummulative sum over time as in the goal column? I need the result to preserve 10 rows in this example, i.e. not collapse data to the years. How can I pick just one group-sum per year per unit to sum up?

set.seed(1234)
data &lt;- data.frame(id = rep(100, 10),
                   year = c(rep(2022, 5), rep(2023, 5)),
                   variable = rbinom(10, 1, 0.5))
library(tidyverse)
data &lt;- data %&gt;% 
  group_by(id, year) %&gt;% 
  mutate(group_sum_per_year = sum(variable)) 
data$goal &lt;- c(4,4,4,4,4,7,7,7,7,7)
data
# A tibble: 10 &#215; 5
# Groups:   id, year [2]
      id  year variable group_sum_per_year  goal
   &lt;dbl&gt; &lt;dbl&gt;    &lt;int&gt;              &lt;int&gt; &lt;dbl&gt;
 1   100  2022        0                  4     4
 2   100  2022        1                  4     4
 3   100  2022        1                  4     4
 4   100  2022        1                  4     4
 5   100  2022        1                  4     4
 6   100  2023        1                  3     7
 7   100  2023        0                  3     7
 8   100  2023        0                  3     7
 9   100  2023        1                  3     7
10   100  2023        1                  3     7

答案1

得分: 1

你可以首先创建一个临时列 hlp，该列仅对每个组的第一个条目等于 group_sum_per_year。

然后，你可以按 id 分组并对 hlp 使用累积和：

data %>%
  group_by(id, year) %>%
  mutate(group_sum_per_year = sum(variable)) %>%
  mutate(hlp = if_else(1:n() == 1, group_sum_per_year, 0)) %>%
  group_by(id) %>%
  mutate(goal = cumsum(hlp))

一个 tibble: 10 × 6

组别: id [1]

  id  year variable group_sum_per_year   hlp  goal

1 100 2022 0 4 4 4
2 100 2022 1 4 0 4
3 100 2022 1 4 0 4
4 100 2022 1 4 0 4
5 100 2022 1 4 0 4
6 100 2023 1 3 3 7
7 100 2023 0 3 0 7
8 100 2023 0 3 0 7
9 100 2023 1 3 0 7
10 100 2023 1 3 0 7


<details>
<summary>英文:</summary>
You could first create a tempory column `hlp` which is equal to `group_sum_per_year` for only the first entry per group.
Then you could group by `id` and use cumsum on `hlp`:
    data %&gt;% 
      group_by(id, year) %&gt;% 
      mutate(group_sum_per_year = sum(variable)) %&gt;% 
      mutate(hlp = if_else(1:n() == 1, group_sum_per_year, 0)) %&gt;%
      group_by(id) %&gt;%
      mutate(goal = cumsum(hlp))
    # A tibble: 10 &#215; 6
    # Groups:   id [1]
          id  year variable group_sum_per_year   hlp  goal
       &lt;dbl&gt; &lt;dbl&gt;    &lt;int&gt;              &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;
     1   100  2022        0                  4     4     4
     2   100  2022        1                  4     0     4
     3   100  2022        1                  4     0     4
     4   100  2022        1                  4     0     4
     5   100  2022        1                  4     0     4
     6   100  2023        1                  3     3     7
     7   100  2023        0                  3     0     7
     8   100  2023        0                  3     0     7
     9   100  2023        1                  3     0     7
    10   100  2023        1                  3     0     7
</details>
# 答案2
**得分**: 1
更直观的方法是：
- 使用 `summarise` 而不是 `mutate`
- 使用 `right_join` 与原始数据框合并
```r
data %>% 
  group_by(id, year) %>% 
  summarise(group_sum_per_year = sum(variable)) %>% # 注意这里使用了 `summarise`
  group_by(id) %>% 
  mutate(goal = cumsum(group_sum_per_year)) %>% 
  right_join(data)

英文:

A more intuitive approach is to:

use summarise instead of mutate
merge with the original dataframe using right_join

data %&gt;% 
  group_by(id, year) %&gt;% 
  summarise(group_sum_per_year = sum(variable)) %&gt;% # Note the use of `summarise` here
  group_by(id) %&gt;% 
  mutate(goal = cumsum(group_sum_per_year)) %&gt;% 
  right_join(data)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在tidyverse中R中每个组的累积总和

问题

答案1

一个 tibble: 10 × 6

组别: id [1]

两个长度不同的数据集之间的地理空间距离的平均值。

如何将 ggplot 中的 “fill” 变量的条形图排列在一起？

在R中将Levene检验和双向方差分析放入用户定义函数中。

每年加权平均值 – R

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。