在tidyverse中按组计算滚动均值。

huangapple go评论102阅读模式
英文:

Rolling mean per group in tidyverse

问题

以下是您要翻译的内容:

I aggregate data per group and calculate means per group to ease visualization. Unfortunately, some of my groups are very large, some are rather empty. I like to have a rolling mean calculation to smooth the picture further. Here is similar data:

load package

library(haven)

read dta file from github

soep <- read_dta("https://github.com/MarcoKuehne/marcokuehne.github.io/blob/main/data/SOEP/soep_lebensz_en/soep_lebensz_en.dta?raw=true")

soep %>%
group_by(education, sex) %>%
summarise(across(satisf_org, mean, na.rm = TRUE),
n = n()) %>%
ggplot(aes(x = education, y = satisf_org, col = as.factor(sex))) +
geom_point() +
labs(title = "Mean Satisfaction per Education Level by Gender",
x = "Education", y = "Mean Satisfaction", color = "Gender")

在tidyverse中按组计算滚动均值。

The mean satisfaction at education 8.5 for females looks like an outlier. At every year of education, I assume that people are not too different to be summarized, i.e. calculate the mean satisfaction of all people at education 7, 8.5 and 9 (grouped by sex) and store it as rolling mean at 8.5 (grouped by sex).

Starting from standard grouped means:

soep %>%
group_by(education, sex) %>%
summarise(across(satisf_org, mean, na.rm = TRUE),
n = n())

A tibble: 28 × 4

Groups: education [14]

education sex satisf_org n
<dbl> <dbl+lbl> <dbl> <int>
1 7 0 [male] 6.16 73
2 7 1 [female] 6.59 113
3 8.5 0 [male] 7.16 37
4 8.5 1 [female] 8.56 18
5 9 0 [male] 6.88 430
6 9 1 [female] 7.00 633
7 10 0 [male] 7.19 144
8 10 1 [female] 7.36 221
9 10.5 0 [male] 6.96 1538
10 10.5 1 [female] 7.02 1493

… with 18 more rows

ℹ Use print(n = ...) to see more rows

Here are the numbers that I expect

soep %>%
filter(sex == 1) %>% # only looks at females
filter(education %in% c(7, 8.5, 9)) %>% # take education level before and after
summarise(mean(satisf_org)) # calculate the "rolling mean" per group

A tibble: 1 × 1

mean(satisf_org)
<dbl>
1 6.97

This is the kind of rolling mean per group that I expect per value, i.e. 6.97 instead of 8.56.

PS: In my real data, I investigate age in years and I usually have at least some people at all ages. So the rolling window can be -1 to +1 (numeric) instead of lead / lag neighbors.

英文:

I aggregate data per group and calculate means per group to ease visualization. Unfortunately, some of my groups are very large, some are rather empty. I like to have a rolling mean calculation to smooth the picture further. Here is similar data:

  1. # load package
  2. library(haven)
  3. # read dta file from github
  4. soep &lt;- read_dta(&quot;https://github.com/MarcoKuehne/marcokuehne.github.io/blob/main/data/SOEP/soep_lebensz_en/soep_lebensz_en.dta?raw=true&quot;)
  5. soep %&gt;%
  6. group_by(education, sex) %&gt;%
  7. summarise(across(satisf_org, mean, na.rm = TRUE),
  8. n = n()) %&gt;%
  9. ggplot(aes(x = education, y = satisf_org, col = as.factor(sex))) +
  10. geom_point() +
  11. labs(title = &quot;Mean Satisfaction per Education Level by Gender&quot;,
  12. x = &quot;Education&quot;, y = &quot;Mean Satisfaction&quot;, color = &quot;Gender&quot;)

在tidyverse中按组计算滚动均值。

The mean satisfaction at education 8.5 for females looks like an outlier. At every year of education, I assume that people are not too different to be summarized, i.e. calculate the mean satisfaction of all people at education 7, 8.5 and 9 (grouped by sex) and store it as rolling mean at 8.5 (grouped by sex).

Starting from standard grouped means:

  1. soep %&gt;%
  2. group_by(education, sex) %&gt;%
  3. summarise(across(satisf_org, mean, na.rm = TRUE),
  4. n = n())
  5. # A tibble: 28 &#215; 4
  6. # Groups: education [14]
  7. education sex satisf_org n
  8. &lt;dbl&gt; &lt;dbl+lbl&gt; &lt;dbl&gt; &lt;int&gt;
  9. 1 7 0 [male] 6.16 73
  10. 2 7 1 [female] 6.59 113
  11. 3 8.5 0 [male] 7.16 37
  12. 4 8.5 1 [female] 8.56 18
  13. 5 9 0 [male] 6.88 430
  14. 6 9 1 [female] 7.00 633
  15. 7 10 0 [male] 7.19 144
  16. 8 10 1 [female] 7.36 221
  17. 9 10.5 0 [male] 6.96 1538
  18. 10 10.5 1 [female] 7.02 1493
  19. # … with 18 more rows
  20. # ℹ Use `print(n = ...)` to see more rows

Here are the numbers that I expect

  1. soep %&gt;%
  2. filter(sex == 1) %&gt;% # only looks at females
  3. filter(education %in% c(7, 8.5, 9)) %&gt;% # take education level before and after
  4. summarise(mean(satisf_org)) # calculate the &quot;rolling mean&quot; per group
  5. # A tibble: 1 &#215; 1
  6. `mean(satisf_org)`
  7. &lt;dbl&gt;
  8. 1 6.97

This is the kind of rolling mean per group that I expect per value, i.e. 6.97 instead of 8.56.

PS: In my real data, I investigate age in years and I usually have at least some people at all ages. So the rolling window can be -1 to +1 (numeric) instead of lead / lag neighbours.

答案1

得分: 2

你可以按性别进行group_by操作,然后进行滚动平均计算:

  1. library(dplyr)
  2. library(slider)
  3. soep %>%
  4. group_by(education, sex) %>%
  5. summarise(across(satisf_org, mean, na.rm = TRUE),
  6. n = n()) %>%
  7. group_by(sex) %>%
  8. mutate(rolling_mean = slide_dbl(satisf_org, mean, .before = 1, .after = 1))

输出:

  1. # A tibble: 28 × 5
  2. # Groups: sex [2]
  3. education sex satisf_org n rolling_mean
  4. <dbl> <dbl+lbl> <dbl> <int> <dbl>
  5. 1 7 0 [male] 6.16 73 6.66
  6. 2 7 1 [female] 6.59 113 7.57
  7. 3 8.5 0 [male] 7.16 37 6.73
  8. 4 8.5 1 [female] 8.56 18 7.38
  9. 5 9 0 [male] 6.88 430 7.08
  10. 6 9 1 [female] 7.00 633 7.64
  11. 7 10 0 [male] 7.19 144 7.01
  12. 8 10 1 [female] 7.36 221 7.13
  13. 9 10.5 0 [male] 6.96 1538 7.14
  14. 10 10.5 1 [female] 7.02 1493 7.20
  15. # … with 18 more rows
  16. # ℹ Use `print(n = ...)` to see more rows

注意:这只是代码的翻译部分,不包括问题中的其他内容。

英文:

You can group_by sex and do a rolling average there:

  1. library(dplyr)
  2. library(slider)
  3. soep %&gt;%
  4. group_by(education, sex) %&gt;%
  5. summarise(across(satisf_org, mean, na.rm = TRUE),
  6. n = n()) %&gt;%
  7. group_by(sex) %&gt;%
  8. mutate(rolling_mean = slide_dbl(satisf_org, mean, .before = 1, .after = 1))

output

  1. # A tibble: 28 &#215; 5
  2. # Groups: sex [2]
  3. education sex satisf_org n rolling_mean
  4. &lt;dbl&gt; &lt;dbl+lbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
  5. 1 7 0 [male] 6.16 73 6.66
  6. 2 7 1 [female] 6.59 113 7.57
  7. 3 8.5 0 [male] 7.16 37 6.73
  8. 4 8.5 1 [female] 8.56 18 7.38
  9. 5 9 0 [male] 6.88 430 7.08
  10. 6 9 1 [female] 7.00 633 7.64
  11. 7 10 0 [male] 7.19 144 7.01
  12. 8 10 1 [female] 7.36 221 7.13
  13. 9 10.5 0 [male] 6.96 1538 7.14
  14. 10 10.5 1 [female] 7.02 1493 7.20
  15. # … with 18 more rows
  16. # ℹ Use `print(n = ...)` to see more rows

huangapple
  • 本文由 发表于 2023年3月10日 00:35:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/75687514.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定