按照定义的间隔对一列进行分组和汇总。

huangapple go评论100阅读模式
英文:

group by and summarise a column into defined breaks

问题

以下是翻译的代码部分:

  1. 我有以下数据:
  2. df <- data.frame(
  3. random_str = rep("ok", 30),
  4. value = rnorm(30, mean=50, sd=25)
  5. ))
  6. 我想运行一些摘要统计信息,并可以这样做:
  7. df %>%
  8. group_by(random_str) %>%
  9. summarise(max = max(value),
  10. min = min(value)
  11. # 等等......
  12. )
  13. 但我还想将“value”列分成区间(例如0-33, 34-66, 67-100)。
  14. 我可以手动执行此操作:
  15. df %>%
  16. group_by(random_str) %>%
  17. summarise(max = max(value),
  18. min = min(value),
  19. # 等等......
  20. # .. 更多内容
  21. count1_33 = sum(value < 34),
  22. count34_66 = sum(value < 67 & value > 33),
  23. count67_100 = sum(value > 66)
  24. )
  25. **但是否有一种方法可以自动执行这最后三个摘要,实际上我有数百万行数据,希望有数百个等距区间。**
英文:

I have the following data:

  1. (df &lt;- data.frame(
  2. random_str = rep(&quot;ok&quot;, 30),
  3. value = rnorm(30, mean=50, sd=25)
  4. ))

I want to run a few summary statistics, and can do so:

  1. df %&gt;%
  2. group_by(random_str) %&gt;%
  3. summarise(max = max(value),
  4. min = min(value)
  5. # and so on.....
  6. )

But the value column I also want to split into breaks (e.g. 0-33, 34-66, 67-100).

I can do this manually:

  1. df %&gt;%
  2. group_by(random_str) %&gt;%
  3. summarise(max = max(value),
  4. min = min(value),
  5. # and so on.....
  6. # .. more stuff
  7. # ..
  8. count1_33 = sum(value &lt; 34),
  9. count34_66 = sum(value &lt; 67 &amp; value &gt; 33),
  10. count67_100 = sum(value &gt; 66)
  11. )

But is there a way to do these last three summaries automatically, in reality I have millions of rows of data and want 100s of equally space breaks.

答案1

得分: 1

以下是代码的翻译部分:

  1. set.seed(100)
  2. df <- data.frame(
  3. random_str = rep("ok", 30),
  4. value = rnorm(30, mean=50, sd=25))
  5. library(dplyr)
  6. library(tidyr)
  7. df %>%
  8. group_by(random_str) %>%
  9. summarise(max = max(value),
  10. min = min(value)) %>%
  11. bind_cols(.,
  12. df %>%
  13. group_by(random_str,
  14. range = cut(value,
  15. breaks = seq(1, by = 33,
  16. length.out = ceiling(max(value)/33) + 1))) %>%
  17. summarise( n = n(), .groups = "drop") %>%
  18. pivot_wider(-random_str, names_from = "range", values_from = "n"))
  19. #> # A tibble: 1 x 7
  20. #> random_str max min `(1,34]` `(34,67]` `(67,100]` `(100,133]`
  21. #> <chr> <dbl> <dbl> <int> <int> <int> <int>
  22. #> 1 ok 108. 21.1 5 19 5 1

更新部分的翻译:

  1. set.seed(100)
  2. df <- data.frame(
  3. random_str = c(rep("ok", 15), rep("not", 15)),
  4. value = rnorm(30, mean=500, sd=250))
  5. library(dplyr)
  6. library(purrr)
  7. mseq <- seq(0, 900, 100)
  8. split(df, as.factor(df$random_str)) %>%
  9. lapply(., function(mdf) {
  10. map2_dbl(.x = mseq, .y = mseq + 100,
  11. .f = ~nrow(subset(mdf, value > .x & value <= .y))) %>%
  12. set_names(., map2_chr(.x = mseq, .y = mseq + 100,
  13. .f = ~paste0("(",.x, ",", .y, "]")))}) %>%
  14. bind_rows(., .id = "random_str") %>%
  15. right_join({df %>%
  16. group_by(random_str) %>%
  17. summarise(max = max(value),
  18. min = min(value))}, .)
  19. #> Joining, by = "random_str"
  20. #> # A tibble: 2 x 13
  21. #> random_str max min `(0,100]` `(100,200]` `(200,300]` `(300,400]`
  22. #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
  23. #> 1 not 1078. 211. 0 0 3 3
  24. #> 2 ok 722. 294. 0 0 1 2
  25. #> # ... with 6 more variables: (400,500] <dbl>, (500,600] <dbl>, (600,700] <dbl>,
  26. #> # (700,800] <dbl>, (800,900] <dbl>, (900,1000] <dbl>

希望这些翻译对你有帮助。

英文:
  1. set.seed(100)
  2. df &lt;- data.frame(
  3. random_str = rep(&quot;ok&quot;, 30),
  4. value = rnorm(30, mean=50, sd=25))
  5. library(dplyr)
  6. library(tidyr)
  7. df %&gt;%
  8. group_by(random_str) %&gt;%
  9. summarise(max = max(value),
  10. min = min(value)) %&gt;%
  11. bind_cols(.,
  12. df %&gt;%
  13. group_by(random_str,
  14. range = cut(value,
  15. breaks = seq(1, by = 33,
  16. length.out = ceiling(max(value)/33) + 1))) %&gt;%
  17. summarise( n = n(), .groups = &quot;drop&quot;) %&gt;%
  18. pivot_wider(-random_str, names_from = &quot;range&quot;, values_from = &quot;n&quot;))
  19. #&gt; # A tibble: 1 x 7
  20. #&gt; random_str max min `(1,34]` `(34,67]` `(67,100]` `(100,133]`
  21. #&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
  22. #&gt; 1 ok 108. 21.1 5 19 5 1

Update:

I am not sure if we can get cut to include empty ranges. For this, I used purrr::map_2 to provide the ranges "manually" and set their names (looped over each random_str). Then, I bind_rows the resulted dataframes (for each group), and finally will join it with summary stats.

  1. set.seed(100)
  2. df &lt;- data.frame(
  3. random_str = c(rep(&quot;ok&quot;, 15), rep(&quot;not&quot;, 15)),
  4. value = rnorm(30, mean=500, sd=250))
  5. library(dplyr)
  6. library(purrr)
  7. mseq &lt;- seq(0, 900, 100)
  8. split(df, as.factor(df$random_str)) %&gt;%
  9. lapply(., function(mdf) {
  10. map2_dbl(.x = mseq, .y = mseq + 100,
  11. .f = ~nrow(subset(mdf, value &gt; .x &amp; value &lt;= .y))) %&gt;%
  12. set_names(., map2_chr(.x = mseq, .y = mseq + 100,
  13. .f = ~paste0(&quot;(&quot;,.x, &quot;,&quot;, .y, &quot;]&quot;)))}) %&gt;%
  14. bind_rows(., .id = &quot;random_str&quot;) %&gt;%
  15. right_join({df %&gt;%
  16. group_by(random_str) %&gt;%
  17. summarise(max = max(value),
  18. min = min(value))}, .)
  19. #&gt; Joining, by = &quot;random_str&quot;
  20. #&gt; # A tibble: 2 x 13
  21. #&gt; random_str max min `(0,100]` `(100,200]` `(200,300]` `(300,400]`
  22. #&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
  23. #&gt; 1 not 1078. 211. 0 0 3 3
  24. #&gt; 2 ok 722. 294. 0 0 1 2
  25. #&gt; # ... with 6 more variables: (400,500] &lt;dbl&gt;, (500,600] &lt;dbl&gt;, (600,700] &lt;dbl&gt;,
  26. #&gt; # (700,800] &lt;dbl&gt;, (800,900] &lt;dbl&gt;, (900,1000] &lt;dbl&gt;

<sup>Created on 2023-04-07 by the reprex package (v2.0.1)</sup>

huangapple
  • 本文由 发表于 2023年4月7日 00:49:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/75951930.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定