英文:
R tidyverse create duplicate rows based on a fixed variable
问题
我有一个R数据框,其中包含每个id
的start
和end
测量值。end - start
的最小可能值是min_sz = 2
(已知差异,但实际数据中可能并不发生)。我希望基于一个固定值创建“块”,并根据start
和end
重叠的块数为每个id
创建重复的行。
如果我将min_sz = 2
作为我的块大小,我的计算和结果将如下所示:
min_sz = 2
df = tibble(
id = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j"),
start = c(0, 0, 0, 4, 4, 8, 10, 10, 16, 32),
end = c(4, 6, 16, 10, 22, 18, 36, 56, 42, 84),
)
df %>%
mutate(dur = end - start) %>%
mutate(n_min_chunks = dur / min_sz) %>%
uncount(n_min_chunks) %>%
group_by(id) %>%
mutate(row = row_number()) %>%
mutate(
chunk_start = start + (min_sz * (row - 1)),
chunk_end = start + (min_sz * row),
)
# | id | start | end | dur | row | chunk_start | chunk_end |
---|---|---|---|---|---|---|---|
1 | a | 0 | 4 | 4 | 1 | 0 | 2 |
2 | a | 0 | 4 | 4 | 2 | 2 | 4 |
3 | b | 0 | 6 | 6 | 1 | 0 | 2 |
4 | b | 0 | 6 | 6 | 2 | 2 | 4 |
5 | b | 0 | 6 | 6 | 3 | 4 | 6 |
6 | c | 0 | 16 | 16 | 1 | 0 | 2 |
7 | c | 0 | 16 | 16 | 2 | 2 | 4 |
8 | c | 0 | 16 | 16 | 3 | 4 | 6 |
.. | .. | ... | ... | ... | ... | ... | ... |
我希望执行类似的操作,但对于可能的任何req_sz
,它是min_sz
的倍数,并包括部分重叠。例如,如果我使用req_sz = 20
,则我的输出应该如下所示(请注意,#5和#6具有部分重叠“4-20”和“20-22”):
# | id | start | end | dur | chunk_start | chunk_end |
---|---|---|---|---|---|---|
1 | a | 0 | 4 | 4 | 0 | 20 |
2 | b | 0 | 6 | 6 | 0 | 20 |
3 | c | 0 | 16 | 16 | 0 | 20 |
4 | c | 4 | 10 | 6 | 0 | 20 |
5 | c | 4 | 22 | 18 | 0 | 20 |
6 | c | 4 | 22 | 18 | 20 | 40 |
.. | .. | ... | ... | ... | ... | ... |
但我一直无法找到一个允许我扩展这个“块分割”操作的数学解决方案。
任何帮助都将不胜感激!
英文:
I have an R dataframe that has start
and end
measurements for each id
. The least possible value for end - start
is min_sz = 2
(known difference, but may not actually occur in the data). I wish to create "chunks" based on a fixed value and create duplicate rows for each id
, based on the number of chunks that start
and end
overlap.
If I were to use min_sz = 2
as my chunk size, my calculation and result would look something like this:
min_sz = 2
df = tibble(
id = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j"),
start = c(0, 0, 0, 4, 4, 8, 10, 10, 16, 32),
end = c(4, 6, 16, 10, 22, 18, 36, 56, 42, 84),
)
df %>%
mutate(dur = end - start) %>%
mutate(n_min_chunks = dur / min_sz) %>%
uncount(n_min_chunks) %>%
group_by(id) %>%
mutate(row = row_number()) %>%
mutate(
chunk_start = start + (min_sz * (row - 1)),
chunk_end = start + (min_sz * row),
)
# | id | start | end | dur | row | chunk_start | chunk_end |
---|---|---|---|---|---|---|---|
1 | a | 0 | 4 | 4 | 1 | 0 | 2 |
2 | a | 0 | 4 | 4 | 2 | 2 | 4 |
3 | b | 0 | 6 | 6 | 1 | 0 | 2 |
4 | b | 0 | 6 | 6 | 2 | 2 | 4 |
5 | b | 0 | 6 | 6 | 3 | 4 | 6 |
6 | c | 0 | 16 | 16 | 1 | 0 | 2 |
7 | c | 0 | 16 | 16 | 2 | 2 | 4 |
8 | c | 0 | 16 | 16 | 3 | 4 | 6 |
.. | .. | ... | ... | ... | ... | ... | ... |
I wish to apply a similar operation, but for potentially any req_sz
that's a multiple of min_sz
and includes partial overlaps. If I were to use req_sz = 20
for example, my output should look something like this (note #5 & #6 with partial overlaps "4-20" & "20-22"):
# | id | start | end | dur | chunk_start | chunk_end |
---|---|---|---|---|---|---|
1 | a | 0 | 4 | 4 | 0 | 20 |
2 | b | 0 | 6 | 6 | 0 | 20 |
3 | c | 0 | 16 | 16 | 0 | 20 |
4 | c | 4 | 10 | 6 | 0 | 20 |
5 | c | 4 | 22 | 18 | 0 | 20 |
6 | c | 4 | 22 | 18 | 20 | 40 |
.. | .. | ... | ... | ... | ... | ... |
but I've been unable to come up with a mathematical solution that allows me to scale this "chunking" operation.
Any help would be greatly appreciated!
答案1
得分: 1
在基本的R语言中,您可以编写以下函数来实现相同的功能:
explode <- function(data, by){
f <- function(...){
s <- list(...)[c('start', 'end')]
mat <- embed(seq((s$start %/% by) * by, (s$end %/% by + 1) * by, by), 2)
data.frame(...,dur = s$end - s$start, chk_st = mat[,2], chk_ed = mat[,1])
}
do.call(rbind, do.call(Map, c(f, data, by = by)))
}
explode(df, by=20)
id start end dur chunck_start chunk_end
a a 0 4 4 0 20
b b 0 6 6 0 20
c c 0 16 16 0 20
d d 4 10 6 0 20
e.1 e 4 22 18 0 20
e.2 e 4 22 18 20 40
f f 8 18 10 0 20
g.1 g 10 36 26 0 20
g.2 g 10 36 26 20 40
h.1 h 10 56 46 0 20
h.2 h 10 56 46 20 40
h.3 h 10 56 46 40 60
i.1 i 16 42 26 0 20
i.2 i 16 42 26 20 40
i.3 i 16 42 26 40 60
j.1 j 32 84 52 20 40
j.2 j 32 84 52 40 60
j.3 j 32 84 52 60 80
j.4 j 32 84 52 80 100
请注意,这是基于R语言的函数,用于实现您提供的功能,并对输入数据进行操作。
英文:
in base R you could write a function to accomplish the same:
explode <- function(data, by){
f <- function(...){
s <- list(...)[c('start', 'end')]
mat <- embed(seq((s$start %/% by) * by, (s$end %/% by + 1) * by, by), 2)
data.frame(...,dur = s$end - s$start, chk_st = mat[,2], chk_ed = mat[,1])
}
do.call(rbind, do.call(Map, c(f, data, by = by)))
}
explode(df, by=20)
id start end dur chunck_start chunk_end
a a 0 4 4 0 20
b b 0 6 6 0 20
c c 0 16 16 0 20
d d 4 10 6 0 20
e.1 e 4 22 18 0 20
e.2 e 4 22 18 20 40
f f 8 18 10 0 20
g.1 g 10 36 26 0 20
g.2 g 10 36 26 20 40
h.1 h 10 56 46 0 20
h.2 h 10 56 46 20 40
h.3 h 10 56 46 40 60
i.1 i 16 42 26 0 20
i.2 i 16 42 26 20 40
i.3 i 16 42 26 40 60
j.1 j 32 84 52 20 40
j.2 j 32 84 52 40 60
j.3 j 32 84 52 60 80
j.4 j 32 84 52 80 100
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论