在tidyverse中R中每个组的累积总和

huangapple go评论87阅读模式
英文:

Cumulative sum per group in tidyverse R

问题

我有面板数据,即每个家庭的重复观察。一个单位(家庭)会随着时间变化并展示一个特征(例如 variable)。我可以使用 group_by(id, year) 计算每年的组总和。如何得到类似 goal 列中的累积总和?在这个示例中,我需要结果保留10行,即不将数据合并到年份。如何选择每个单位每年只计算一个组总和来相加?

英文:

I have paneldata, i.e. repeated observations per household. A unit (household) is measured through time and exhibits a characteric (e.g. variable). I can calculate a group sum per year with group_by(id, year). How can I have a cummulative sum over time as in the goal column? I need the result to preserve 10 rows in this example, i.e. not collapse data to the years. How can I pick just one group-sum per year per unit to sum up?

  1. set.seed(1234)
  2. data <- data.frame(id = rep(100, 10),
  3. year = c(rep(2022, 5), rep(2023, 5)),
  4. variable = rbinom(10, 1, 0.5))
  5. library(tidyverse)
  6. data <- data %>%
  7. group_by(id, year) %>%
  8. mutate(group_sum_per_year = sum(variable))
  9. data$goal <- c(4,4,4,4,4,7,7,7,7,7)
  10. data
  11. # A tibble: 10 × 5
  12. # Groups: id, year [2]
  13. id year variable group_sum_per_year goal
  14. <dbl> <dbl> <int> <int> <dbl>
  15. 1 100 2022 0 4 4
  16. 2 100 2022 1 4 4
  17. 3 100 2022 1 4 4
  18. 4 100 2022 1 4 4
  19. 5 100 2022 1 4 4
  20. 6 100 2023 1 3 7
  21. 7 100 2023 0 3 7
  22. 8 100 2023 0 3 7
  23. 9 100 2023 1 3 7
  24. 10 100 2023 1 3 7

答案1

得分: 1

你可以首先创建一个临时列 hlp,该列仅对每个组的第一个条目等于 group_sum_per_year

然后,你可以按 id 分组并对 hlp 使用累积和:

  1. data %>%
  2. group_by(id, year) %>%
  3. mutate(group_sum_per_year = sum(variable)) %>%
  4. mutate(hlp = if_else(1:n() == 1, group_sum_per_year, 0)) %>%
  5. group_by(id) %>%
  6. mutate(goal = cumsum(hlp))

一个 tibble: 10 × 6

组别: id [1]

  1. id year variable group_sum_per_year hlp goal


1 100 2022 0 4 4 4
2 100 2022 1 4 0 4
3 100 2022 1 4 0 4
4 100 2022 1 4 0 4
5 100 2022 1 4 0 4
6 100 2023 1 3 3 7
7 100 2023 0 3 0 7
8 100 2023 0 3 0 7
9 100 2023 1 3 0 7
10 100 2023 1 3 0 7

  1. <details>
  2. <summary>英文:</summary>
  3. You could first create a tempory column `hlp` which is equal to `group_sum_per_year` for only the first entry per group.
  4. Then you could group by `id` and use cumsum on `hlp`:
  5. data %&gt;%
  6. group_by(id, year) %&gt;%
  7. mutate(group_sum_per_year = sum(variable)) %&gt;%
  8. mutate(hlp = if_else(1:n() == 1, group_sum_per_year, 0)) %&gt;%
  9. group_by(id) %&gt;%
  10. mutate(goal = cumsum(hlp))
  11. # A tibble: 10 &#215; 6
  12. # Groups: id [1]
  13. id year variable group_sum_per_year hlp goal
  14. &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;
  15. 1 100 2022 0 4 4 4
  16. 2 100 2022 1 4 0 4
  17. 3 100 2022 1 4 0 4
  18. 4 100 2022 1 4 0 4
  19. 5 100 2022 1 4 0 4
  20. 6 100 2023 1 3 3 7
  21. 7 100 2023 0 3 0 7
  22. 8 100 2023 0 3 0 7
  23. 9 100 2023 1 3 0 7
  24. 10 100 2023 1 3 0 7
  25. </details>
  26. # 答案2
  27. **得分**: 1
  28. 更直观的方法是:
  29. - 使用 `summarise` 而不是 `mutate`
  30. - 使用 `right_join` 与原始数据框合并
  31. ```r
  32. data %>%
  33. group_by(id, year) %>%
  34. summarise(group_sum_per_year = sum(variable)) %>% # 注意这里使用了 `summarise`
  35. group_by(id) %>%
  36. mutate(goal = cumsum(group_sum_per_year)) %>%
  37. right_join(data)
英文:

A more intuitive approach is to:

  • use summarise instead of mutate
  • merge with the original dataframe using right_join
  1. data %&gt;%
  2. group_by(id, year) %&gt;%
  3. summarise(group_sum_per_year = sum(variable)) %&gt;% # Note the use of `summarise` here
  4. group_by(id) %&gt;%
  5. mutate(goal = cumsum(group_sum_per_year)) %&gt;%
  6. right_join(data)

huangapple
  • 本文由 发表于 2023年7月13日 19:35:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76678932.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定