使用R tidyverse根据固定变量创建重复行

huangapple go评论109阅读模式
英文:

R tidyverse create duplicate rows based on a fixed variable

问题

我有一个R数据框,其中包含每个idstartend测量值。end - start的最小可能值是min_sz = 2(已知差异,但实际数据中可能并不发生)。我希望基于一个固定值创建“块”,并根据startend重叠的块数为每个id创建重复的行。

如果我将min_sz = 2作为我的块大小,我的计算和结果将如下所示:

  1. min_sz = 2
  2. df = tibble(
  3. id = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j"),
  4. start = c(0, 0, 0, 4, 4, 8, 10, 10, 16, 32),
  5. end = c(4, 6, 16, 10, 22, 18, 36, 56, 42, 84),
  6. )
  7. df %>%
  8. mutate(dur = end - start) %>%
  9. mutate(n_min_chunks = dur / min_sz) %>%
  10. uncount(n_min_chunks) %>%
  11. group_by(id) %>%
  12. mutate(row = row_number()) %>%
  13. mutate(
  14. chunk_start = start + (min_sz * (row - 1)),
  15. chunk_end = start + (min_sz * row),
  16. )
# id start end dur row chunk_start chunk_end
1 a 0 4 4 1 0 2
2 a 0 4 4 2 2 4
3 b 0 6 6 1 0 2
4 b 0 6 6 2 2 4
5 b 0 6 6 3 4 6
6 c 0 16 16 1 0 2
7 c 0 16 16 2 2 4
8 c 0 16 16 3 4 6
.. .. ... ... ... ... ... ...

我希望执行类似的操作,但对于可能的任何req_sz,它是min_sz的倍数,并包括部分重叠。例如,如果我使用req_sz = 20,则我的输出应该如下所示(请注意,#5和#6具有部分重叠“4-20”和“20-22”):

# id start end dur chunk_start chunk_end
1 a 0 4 4 0 20
2 b 0 6 6 0 20
3 c 0 16 16 0 20
4 c 4 10 6 0 20
5 c 4 22 18 0 20
6 c 4 22 18 20 40
.. .. ... ... ... ... ...

但我一直无法找到一个允许我扩展这个“块分割”操作的数学解决方案。

任何帮助都将不胜感激!

英文:

I have an R dataframe that has start and end measurements for each id. The least possible value for end - start is min_sz = 2 (known difference, but may not actually occur in the data). I wish to create "chunks" based on a fixed value and create duplicate rows for each id, based on the number of chunks that start and end overlap.

If I were to use min_sz = 2 as my chunk size, my calculation and result would look something like this:

  1. min_sz = 2
  2. df = tibble(
  3. id = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j"),
  4. start = c(0, 0, 0, 4, 4, 8, 10, 10, 16, 32),
  5. end = c(4, 6, 16, 10, 22, 18, 36, 56, 42, 84),
  6. )
  7. df %>%
  8. mutate(dur = end - start) %>%
  9. mutate(n_min_chunks = dur / min_sz) %>%
  10. uncount(n_min_chunks) %>%
  11. group_by(id) %>%
  12. mutate(row = row_number()) %>%
  13. mutate(
  14. chunk_start = start + (min_sz * (row - 1)),
  15. chunk_end = start + (min_sz * row),
  16. )
# id start end dur row chunk_start chunk_end
1 a 0 4 4 1 0 2
2 a 0 4 4 2 2 4
3 b 0 6 6 1 0 2
4 b 0 6 6 2 2 4
5 b 0 6 6 3 4 6
6 c 0 16 16 1 0 2
7 c 0 16 16 2 2 4
8 c 0 16 16 3 4 6
.. .. ... ... ... ... ... ...

I wish to apply a similar operation, but for potentially any req_sz that's a multiple of min_sz and includes partial overlaps. If I were to use req_sz = 20 for example, my output should look something like this (note #5 & #6 with partial overlaps "4-20" & "20-22"):

# id start end dur chunk_start chunk_end
1 a 0 4 4 0 20
2 b 0 6 6 0 20
3 c 0 16 16 0 20
4 c 4 10 6 0 20
5 c 4 22 18 0 20
6 c 4 22 18 20 40
.. .. ... ... ... ... ...

but I've been unable to come up with a mathematical solution that allows me to scale this "chunking" operation.

Any help would be greatly appreciated!

答案1

得分: 1

在基本的R语言中,您可以编写以下函数来实现相同的功能:

  1. explode <- function(data, by){
  2. f <- function(...){
  3. s <- list(...)[c('start', 'end')]
  4. mat <- embed(seq((s$start %/% by) * by, (s$end %/% by + 1) * by, by), 2)
  5. data.frame(...,dur = s$end - s$start, chk_st = mat[,2], chk_ed = mat[,1])
  6. }
  7. do.call(rbind, do.call(Map, c(f, data, by = by)))
  8. }
  9. explode(df, by=20)
  10. id start end dur chunck_start chunk_end
  11. a a 0 4 4 0 20
  12. b b 0 6 6 0 20
  13. c c 0 16 16 0 20
  14. d d 4 10 6 0 20
  15. e.1 e 4 22 18 0 20
  16. e.2 e 4 22 18 20 40
  17. f f 8 18 10 0 20
  18. g.1 g 10 36 26 0 20
  19. g.2 g 10 36 26 20 40
  20. h.1 h 10 56 46 0 20
  21. h.2 h 10 56 46 20 40
  22. h.3 h 10 56 46 40 60
  23. i.1 i 16 42 26 0 20
  24. i.2 i 16 42 26 20 40
  25. i.3 i 16 42 26 40 60
  26. j.1 j 32 84 52 20 40
  27. j.2 j 32 84 52 40 60
  28. j.3 j 32 84 52 60 80
  29. j.4 j 32 84 52 80 100

请注意,这是基于R语言的函数,用于实现您提供的功能,并对输入数据进行操作。

英文:

in base R you could write a function to accomplish the same:

  1. explode &lt;- function(data, by){
  2. f &lt;- function(...){
  3. s &lt;- list(...)[c(&#39;start&#39;, &#39;end&#39;)]
  4. mat &lt;- embed(seq((s$start %/% by) * by, (s$end %/% by + 1) * by, by), 2)
  5. data.frame(...,dur = s$end - s$start, chk_st = mat[,2], chk_ed = mat[,1])
  6. }
  7. do.call(rbind, do.call(Map, c(f, data, by = by)))
  8. }
  9. explode(df, by=20)
  10. id start end dur chunck_start chunk_end
  11. a a 0 4 4 0 20
  12. b b 0 6 6 0 20
  13. c c 0 16 16 0 20
  14. d d 4 10 6 0 20
  15. e.1 e 4 22 18 0 20
  16. e.2 e 4 22 18 20 40
  17. f f 8 18 10 0 20
  18. g.1 g 10 36 26 0 20
  19. g.2 g 10 36 26 20 40
  20. h.1 h 10 56 46 0 20
  21. h.2 h 10 56 46 20 40
  22. h.3 h 10 56 46 40 60
  23. i.1 i 16 42 26 0 20
  24. i.2 i 16 42 26 20 40
  25. i.3 i 16 42 26 40 60
  26. j.1 j 32 84 52 20 40
  27. j.2 j 32 84 52 40 60
  28. j.3 j 32 84 52 60 80
  29. j.4 j 32 84 52 80 100

huangapple
  • 本文由 发表于 2023年7月11日 07:34:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76657902.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定