英文:
Cumulative sum per group in tidyverse R
问题
我有面板数据,即每个家庭的重复观察。一个单位(家庭)会随着时间变化并展示一个特征(例如 variable
)。我可以使用 group_by(id, year)
计算每年的组总和。如何得到类似 goal
列中的累积总和?在这个示例中,我需要结果保留10行,即不将数据合并到年份。如何选择每个单位每年只计算一个组总和来相加?
英文:
I have paneldata, i.e. repeated observations per household. A unit (household) is measured through time and exhibits a characteric (e.g. variable
). I can calculate a group sum per year with group_by(id, year)
. How can I have a cummulative sum over time as in the goal
column? I need the result to preserve 10 rows in this example, i.e. not collapse data to the years. How can I pick just one group-sum per year per unit to sum up?
set.seed(1234)
data <- data.frame(id = rep(100, 10),
year = c(rep(2022, 5), rep(2023, 5)),
variable = rbinom(10, 1, 0.5))
library(tidyverse)
data <- data %>%
group_by(id, year) %>%
mutate(group_sum_per_year = sum(variable))
data$goal <- c(4,4,4,4,4,7,7,7,7,7)
data
# A tibble: 10 × 5
# Groups: id, year [2]
id year variable group_sum_per_year goal
<dbl> <dbl> <int> <int> <dbl>
1 100 2022 0 4 4
2 100 2022 1 4 4
3 100 2022 1 4 4
4 100 2022 1 4 4
5 100 2022 1 4 4
6 100 2023 1 3 7
7 100 2023 0 3 7
8 100 2023 0 3 7
9 100 2023 1 3 7
10 100 2023 1 3 7
答案1
得分: 1
你可以首先创建一个临时列 hlp
,该列仅对每个组的第一个条目等于 group_sum_per_year
。
然后,你可以按 id
分组并对 hlp
使用累积和:
data %>%
group_by(id, year) %>%
mutate(group_sum_per_year = sum(variable)) %>%
mutate(hlp = if_else(1:n() == 1, group_sum_per_year, 0)) %>%
group_by(id) %>%
mutate(goal = cumsum(hlp))
一个 tibble: 10 × 6
组别: id [1]
id year variable group_sum_per_year hlp goal
1 100 2022 0 4 4 4
2 100 2022 1 4 0 4
3 100 2022 1 4 0 4
4 100 2022 1 4 0 4
5 100 2022 1 4 0 4
6 100 2023 1 3 3 7
7 100 2023 0 3 0 7
8 100 2023 0 3 0 7
9 100 2023 1 3 0 7
10 100 2023 1 3 0 7
<details>
<summary>英文:</summary>
You could first create a tempory column `hlp` which is equal to `group_sum_per_year` for only the first entry per group.
Then you could group by `id` and use cumsum on `hlp`:
data %>%
group_by(id, year) %>%
mutate(group_sum_per_year = sum(variable)) %>%
mutate(hlp = if_else(1:n() == 1, group_sum_per_year, 0)) %>%
group_by(id) %>%
mutate(goal = cumsum(hlp))
# A tibble: 10 × 6
# Groups: id [1]
id year variable group_sum_per_year hlp goal
<dbl> <dbl> <int> <int> <dbl> <dbl>
1 100 2022 0 4 4 4
2 100 2022 1 4 0 4
3 100 2022 1 4 0 4
4 100 2022 1 4 0 4
5 100 2022 1 4 0 4
6 100 2023 1 3 3 7
7 100 2023 0 3 0 7
8 100 2023 0 3 0 7
9 100 2023 1 3 0 7
10 100 2023 1 3 0 7
</details>
# 答案2
**得分**: 1
更直观的方法是:
- 使用 `summarise` 而不是 `mutate`
- 使用 `right_join` 与原始数据框合并
```r
data %>%
group_by(id, year) %>%
summarise(group_sum_per_year = sum(variable)) %>% # 注意这里使用了 `summarise`
group_by(id) %>%
mutate(goal = cumsum(group_sum_per_year)) %>%
right_join(data)
英文:
A more intuitive approach is to:
- use
summarise
instead ofmutate
- merge with the original dataframe using
right_join
data %>%
group_by(id, year) %>%
summarise(group_sum_per_year = sum(variable)) %>% # Note the use of `summarise` here
group_by(id) %>%
mutate(goal = cumsum(group_sum_per_year)) %>%
right_join(data)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论