用R代码找到一个数据框的另一个向量上设置的条件,并计算向量的累积和。

huangapple go评论146阅读模式
英文:

R code to find a cumulative sum of a vector with a criteria set on another vector of a dataframe

问题

Sure, here is the translated code portion:

  1. p <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
  2. q <- c("a", "a", "a", "b", "b", "c", "c", "c", "c", "c")
  3. r <- c(0, 1, 3, 0, 4, 0, 6, 13, 21, 30)
  4. t <- data.frame(p, q, r)

If you need further assistance with the code or have any questions, please feel free to ask.

英文:
  1. p&lt;-c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
  2. q&lt;-c(&quot;a&quot;, &quot;a&quot;, &quot;a&quot;, &quot;b&quot;, &quot;b&quot;, &quot;c&quot;, &quot;c&quot;, &quot;c&quot;, &quot;c&quot;, &quot;c&quot;)
  3. r&lt;-c(0, 1, 3, 0, 4, 0, 6, 13, 21, 30)
  4. t&lt;-data.frame(p,q,r)

In the above data I am trying to achieve vector r.
Vector r is a cumulative sum of vector p with a lag and a criteria set on vector q.
Please assume that dataframe is already sorted alphabetically using vector q

I have tried creating multiple vector with a lag and then trying to sum. but it isnt ideal.

答案1

得分: 4

最高效的方法是使用 collapse

首先创建一个 GRP 对象。

然后使用该 GRP 对象执行分组的滞后和累积求和。

collapse vs data.table 基准测试

3 组

  1. mark(e1 = {
  2. g <- GRP(t, by = "q")
  3. t[, result := fcumsum(flag(p, g = g, fill = 0), g = g)][]
  4. },
  5. e2 = {
  6. t[, result := cumsum(shift(p, type = "lag", fill = 0)), by = q][]
  7. }
  8. )
  9. # A tibble: 2 x 13
  10. expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
  11. <bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
  12. 1 e1 434us 484us 1905. 36.7KB 0 953 0 500ms <dt> <Rprofmem> <bench_tm> <tibble>
  13. 2 e2 511us 548us 1646. 32.5KB 2.27 725 1 440ms <dt> <Rprofmem> <bench_tm> <tibble>

10^7 行,约 10^6 组

  1. t <- t[sample.int(.N, 10^7, T)]
  2. t[, q := sample.int(10^6, 10^7, T)]
  3. mark(e1 = {
  4. g <- GRP(t, by = "q")
  5. t[, result := fcumsum(flag(p, g = g, fill = 0), g = g)][]
  6. },
  7. e2 = {
  8. t[, result := cumsum(shift(p, type = "lag", fill = 0)), by = q][]
  9. }
  10. )
  11. # A tibble: 2 x 13
  12. expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time
  13. <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list>
  14. 1 e1 694.9ms 694.9ms 1.44 244MB 0 1 0 694.9ms <dt> <Rprofmem> <bench_tm [1]>
  15. 2 e2 21.3s 21.3s 0.0469 71MB 1.50 1 32 21.3s <dt> <Rprofmem> <bench_tm [1]>
英文:

The most efficient method would be using collapse.

First create a GRP object.

  1. library(collapse)
  2. library(data.table)
  3. setDT(t)
  4. g &lt;- GRP(t, by = &quot;q&quot;)

Then do a grouped lag & cumulative sum using that GRP object.

  1. t[, result := fcumsum(flag(p, g = g, fill = 0), g = g)][]
  2. p q r result
  3. 1: 1 a 0 0
  4. 2: 2 a 1 1
  5. 3: 3 a 3 3
  6. 4: 4 b 0 0
  7. 5: 5 b 4 4
  8. 6: 6 c 0 0
  9. 7: 7 c 6 6
  10. 8: 8 c 13 13
  11. 9: 9 c 21 21
  12. 10: 10 c 30 30

collapse vs data.table benchmark

3 groups

  1. mark(e1 = {
  2. g &lt;- GRP(t, by = &quot;q&quot;)
  3. t[, result := fcumsum(flag(p, g = g, fill = 0), g = g)][]
  4. },
  5. e2 = {
  6. t[, result := cumsum(shift(p, type = &quot;lag&quot;, fill = 0)), by = q][]
  7. }
  8. )
  9. # A tibble: 2 x 13
  10. expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
  11. &lt;bch:expr&gt; &lt;bch:&gt; &lt;bch:&gt; &lt;dbl&gt; &lt;bch:byt&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;bch:tm&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt;
  12. 1 e1 434us 484us 1905. 36.7KB 0 953 0 500ms &lt;dt&gt; &lt;Rprofmem&gt; &lt;bench_tm&gt; &lt;tibble&gt;
  13. 2 e2 511us 548us 1646. 32.5KB 2.27 725 1 440ms &lt;dt&gt; &lt;Rprofmem&gt; &lt;bench_tm&gt; &lt;tibble&gt;

10^7 rows, ~ 10^6 groups

  1. t &lt;- t[sample.int(.N, 10^7, T)]
  2. t[, q := sample.int(10^6, 10^7, T)]
  3. mark(e1 = {
  4. g &lt;- GRP(t, by = &quot;q&quot;)
  5. t[, result := fcumsum(flag(p, g = g, fill = 0), g = g)][]
  6. },
  7. e2 = {
  8. t[, result := cumsum(shift(p, type = &quot;lag&quot;, fill = 0)), by = q][]
  9. }
  10. )
  11. # A tibble: 2 x 13
  12. expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time
  13. &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt; &lt;dbl&gt; &lt;bch:byt&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;bch:tm&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt;
  14. 1 e1 694.9ms 694.9ms 1.44 244MB 0 1 0 694.9ms &lt;dt&gt; &lt;Rprofmem&gt; &lt;bench_tm [1]&gt;
  15. 2 e2 21.3s 21.3s 0.0469 71MB 1.50 1 32 21.3s &lt;dt&gt; &lt;Rprofmem&gt; &lt;bench_tm [1]&gt;

答案2

得分: 3

1) 这是一个使用 avecumsum 的基本R解决方案。

  1. ave(p, q, FUN = cumsum) - p
  2. ## [1] 0 1 3 0 4 0 6 13 21 30

2) 这个变种也有效:

  1. ave(p, q, FUN = function(x) c(0, head(cumsum(x), -1)))

3) 相同的思路可以使用 collapse 包来实现:

  1. library(collapse)
  2. fcumsum(p, q) - p
  3. ## [1] 0 1 3 0 4 0 6 13 21 30
英文:

1) This is a base R solution using ave and cumsum.

  1. ave(p, q, FUN = cumsum) - p
  2. ## [1] 0 1 3 0 4 0 6 13 21 30

2) This variation also works:

  1. ave(p, q, FUN = function(x) c(0, head(cumsum(x), -1)))

3) The same idea can be used with the collapse package

  1. library(collapse)
  2. fcumsum(p, q) - p
  3. ## [1] 0 1 3 0 4 0 6 13 21 30

答案3

得分: 2

这将非常快速:

  1. library(dplyr)
  2. t |&gt;
  3. mutate(result = cumsum(lag(p, default = 0)), .by = q)
  4. # p q r result
  5. # 1 1 a 0 0
  6. # 2 2 a 1 1
  7. # 3 3 a 3 3
  8. # 4 4 b 0 0
  9. # 5 5 b 4 4
  10. # 6 6 c 0 0
  11. # 7 7 c 6 6
  12. # 8 8 c 13 13
  13. # 9 9 c 21 21
  14. # 10 10 c 30 30

这将更快:

  1. library(data.table)
  2. setDT(t)[, result := cumsum(shift(p, type = &quot;lag&quot;, fill = 0)), by = q]
英文:

This will be quite fast:

  1. library(dplyr)
  2. t |&gt;
  3. mutate(result = cumsum(lag(p, default = 0)), .by = q)
  4. # p q r result
  5. # 1 1 a 0 0
  6. # 2 2 a 1 1
  7. # 3 3 a 3 3
  8. # 4 4 b 0 0
  9. # 5 5 b 4 4
  10. # 6 6 c 0 0
  11. # 7 7 c 6 6
  12. # 8 8 c 13 13
  13. # 9 9 c 21 21
  14. # 10 10 c 30 30

This will be even faster:

  1. library(data.table)
  2. setDT(t)[, result := cumsum(shift(p, type = &quot;lag&quot;, fill = 0)), by = q]

答案4

得分: 1

如果我们需要在dplyr内保持良好的性能,可以使用dtplyr。dtplyr是dplyr的data.table前端。它大部分使用data.table来运行代码。并非所有dplyr的动词和操作都可以翻译,但对于像这个示例这样的简单用例,它运行得很好。

  1. library(dtplyr)
  2. t <- lazy_dt(t)
  3. t |>
  4. group_by(q) |>
  5. mutate(result = cumsum(lag(p, default = 0))) |>
  6. as_tibble()
  7. # A tibble: 10 × 4
  8. p q r result
  9. <dbl> <chr> <dbl> <dbl>
  10. 1 1 a 0 0
  11. 2 2 a 1 1
  12. 3 3 a 3 3
  13. 4 4 b 0 0
  14. 5 5 b 4 4
  15. 6 6 c 0 0
  16. 7 7 c 6 6
  17. 8 8 c 13 13
  18. 9 9 c 21 21
  19. 10 10 c 30 30
英文:

If we need good performance while keeping within dplyr, we can use dtplyr. dtplyr is a data.table frontend for dplyr.
It runs most of the code with data.table. Not all dplyr verbs and operations can be translated, but for simple use cases like this one, it works just fine.

  1. library(dtplyr)
  2. t &lt;- lazy_dt(t)
  3. t |&gt;
  4. group_by(q) |&gt;
  5. mutate(result = cumsum(lag(p, default = 0))) |&gt;
  6. as_tibble()
  7. # A tibble: 10 &#215; 4
  8. p q r result
  9. &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
  10. 1 1 a 0 0
  11. 2 2 a 1 1
  12. 3 3 a 3 3
  13. 4 4 b 0 0
  14. 5 5 b 4 4
  15. 6 6 c 0 0
  16. 7 7 c 6 6
  17. 8 8 c 13 13
  18. 9 9 c 21 21
  19. 10 10 c 30 30
  20. </details>
  21. # 答案5
  22. **得分**: 0
  23. 使用`data.table`,我们可以尝试以下操作:
  24. ```R
  25. setDT(t)[, rr := shift(cumsum(p), fill = 0), q][]

这段代码会生成如下结果:

  1. p q r rr
  2. 1: 1 a 0 0
  3. 2: 2 a 1 1
  4. 3: 3 a 3 3
  5. 4: 4 b 0 0
  6. 5: 5 b 4 4
  7. 6: 6 c 0 0
  8. 7: 7 c 6 6
  9. 8: 8 c 13 13
  10. 9: 9 c 21 21
  11. 10: 10 c 30 30

请注意,这是给定代码的翻译部分。

英文:

with data.table we can try

  1. &gt; setDT(t)[, rr := shift(cumsum(p), fill = 0), q][]
  2. p q r rr
  3. 1: 1 a 0 0
  4. 2: 2 a 1 1
  5. 3: 3 a 3 3
  6. 4: 4 b 0 0
  7. 5: 5 b 4 4
  8. 6: 6 c 0 0
  9. 7: 7 c 6 6
  10. 8: 8 c 13 13
  11. 9: 9 c 21 21
  12. 10: 10 c 30 30

答案6

得分: 0

以下是您要翻译的内容:

"如在问题中所述,可以假设向量已经排序,您可以获取在q中发生变化的位置的索引,然后在这些位置减去cumsum

  1. n <- length(q)
  2. i <- 1 + which(q[-1] != q[-n])
  3. r <- cumsum(p) - p
  4. r <- r - rep(c(0, r[i]), diff(c(1, i, n+1)))
  5. r
  6. # [1] 0 1 3 0 4 0 6 13 21 30

一个可能的问题是,由于cumsum是在整个向量上构建的,所以其结果可能不太准确,与分组拆分时相比。因此,另一种变体是对向量进行子集化。

  1. n <- length(q)
  2. i <- which(q[-1] != q[-n])
  3. unlist(Map(\(i,j) {. <- p[i:j]; cumsum(.) -.}, c(1, i+1), c(i,n)), FALSE, FALSE)
  4. # [1] 0 1 3 0 4 0 6 13 21 30

性能基准

  1. set.seed(42)
  2. q <- rep(letters, sample(1e5:1e6, length(letters), TRUE))
  3. length(q)
  4. #[1] 13535111
  5. p <- sample(as.numeric(1:10), length(q), TRUE)
  6. library(collapse)
  7. library(data.table)
  8. t <- data.frame(p,q)
  9. bench::mark(min_iterations = 7L,
  10. collapse = {g <- GRP(t, by = "q")
  11. as.data.table(t)[, result := fcumsum(flag(p, g = g, fill = 0), g = g)][]$result},
  12. Map = {n <- length(q)
  13. i <- which(q[-1] != q[-n])
  14. unlist(Map(\(i,j) {. <- p[i:j]; cumsum(.) -.}, c(1, i+1), c(i,n)), FALSE, FALSE)},
  15. rep = {n <- length(q)
  16. i <- 1 + which(q[-1] != q[-n])
  17. r <- cumsum(p) - p
  18. r - rep(c(0, r[i]), diff(c(1, i, n+1)))},
  19. ave = {ave(p, q, FUN = cumsum) - p} ) # @G. Grothendieck

结果

  1. expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
  2. <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
  3. 1 collapse 196.59ms 248.53ms 4.17 518MB 5.36 7 9
  4. 2 Map 539.4ms 568ms 1.78 981MB 3.80 7 15
  5. 3 rep 442.83ms 481.89ms 2.03 826MB 3.77 7 13
  6. 4 ave 1.18s 1.23s 0.801 1006MB 1.37 7 12

在这种情况下,使用collapsedata.table比基本的Maprep快大约两倍,比ave快5倍。"

英文:

As in the question is stated that it could be assumed that the vectors are already sorted you can take the indices where there is a change in q and subtract this cumsum at those positions.

  1. n &lt;- length(q)
  2. i &lt;- 1 + which(q[-1] != q[-n])
  3. r &lt;- cumsum(p) - p
  4. r &lt;- r - rep(c(0, r[i]), diff(c(1, i, n+1)))
  5. r
  6. # [1] 0 1 3 0 4 0 6 13 21 30

A problem might be, that due to the fact that cumsum is build over the whole vector, its result might not be that accurate as it could be when splitting in groups. So another variant which is subsetting the vector.

  1. n &lt;- length(q)
  2. i &lt;- which(q[-1] != q[-n])
  3. unlist(Map(\(i,j) {. &lt;- p[i:j]; cumsum(.) -.}, c(1, i+1), c(i,n)), FALSE, FALSE)
  4. # [1] 0 1 3 0 4 0 6 13 21 30

Benchmark

  1. set.seed(42)
  2. q &lt;- rep(letters, sample(1e5:1e6, length(letters), TRUE))
  3. length(q)
  4. #[1] 13535111
  5. p &lt;- sample(as.numeric(1:10), length(q), TRUE)
  6. library(collapse)
  7. library(data.table)
  8. t &lt;- data.frame(p,q)
  9. bench::mark(min_iterations = 7L,
  10. collapse = {g &lt;- GRP(t, by = &quot;q&quot;)
  11. as.data.table(t)[, result := fcumsum(flag(p, g = g, fill = 0), g = g)][]$result},
  12. Map = {n &lt;- length(q)
  13. i &lt;- which(q[-1] != q[-n])
  14. unlist(Map(\(i,j) {. &lt;- p[i:j]; cumsum(.) -.}, c(1, i+1), c(i,n)), FALSE, FALSE)},
  15. rep = {n &lt;- length(q)
  16. i &lt;- 1 + which(q[-1] != q[-n])
  17. r &lt;- cumsum(p) - p
  18. r - rep(c(0, r[i]), diff(c(1, i, n+1)))},
  19. ave = {ave(p, q, FUN = cumsum) - p} ) # @G. Grothendieck

Result

  1. expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
  2. &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt; &lt;dbl&gt; &lt;bch:byt&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
  3. 1 collapse 196.59ms 248.53ms 4.17 518MB 5.36 7 9
  4. 2 Map 539.4ms 568ms 1.78 981MB 3.80 7 15
  5. 3 rep 442.83ms 481.89ms 2.03 826MB 3.77 7 13
  6. 4 ave 1.18s 1.23s 0.801 1006MB 1.37 7 12

In this case using collapse with data.table is about two times faster than base Map and rev and 5 times than ave.

huangapple
  • 本文由 发表于 2023年6月13日 01:36:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76459051.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定