2023年7月11日 07:34:29go评论109阅读模式

英文:

R tidyverse create duplicate rows based on a fixed variable

问题

我有一个R数据框，其中包含每个id的start和end测量值。end - start的最小可能值是min_sz = 2（已知差异，但实际数据中可能并不发生）。我希望基于一个固定值创建“块”，并根据start和end重叠的块数为每个id创建重复的行。

如果我将min_sz = 2作为我的块大小，我的计算和结果将如下所示：

min_sz = 2
df = tibble(
  id = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j"),
  start = c(0, 0, 0, 4, 4, 8, 10, 10, 16, 32),
  end = c(4, 6, 16, 10, 22, 18, 36, 56, 42, 84),
)
df %>% 
  mutate(dur = end - start) %>% 
  mutate(n_min_chunks = dur / min_sz) %>% 
  uncount(n_min_chunks) %>% 
  group_by(id) %>% 
  mutate(row = row_number()) %>% 
  mutate(
    chunk_start = start + (min_sz * (row - 1)),
    chunk_end = start + (min_sz * row),
  )

#	id	start	end	dur	row	chunk_start	chunk_end
1	a	0	4	4	1	0	2
2	a	0	4	4	2	2	4
3	b	0	6	6	1	0	2
4	b	0	6	6	2	2	4
5	b	0	6	6	3	4	6
6	c	0	16	16	1	0	2
7	c	0	16	16	2	2	4
8	c	0	16	16	3	4	6
..	..	...	...	...	...	...	...

我希望执行类似的操作，但对于可能的任何req_sz，它是min_sz的倍数，并包括部分重叠。例如，如果我使用req_sz = 20，则我的输出应该如下所示（请注意，#5和#6具有部分重叠“4-20”和“20-22”）：

#	id	start	end	dur	chunk_start	chunk_end
1	a	0	4	4	0	20
2	b	0	6	6	0	20
3	c	0	16	16	0	20
4	c	4	10	6	0	20
5	c	4	22	18	0	20
6	c	4	22	18	20	40
..	..	...	...	...	...	...

但我一直无法找到一个允许我扩展这个“块分割”操作的数学解决方案。

任何帮助都将不胜感激！

英文:

I have an R dataframe that has start and end measurements for each id. The least possible value for end - start is min_sz = 2 (known difference, but may not actually occur in the data). I wish to create "chunks" based on a fixed value and create duplicate rows for each id, based on the number of chunks that start and end overlap.

If I were to use min_sz = 2 as my chunk size, my calculation and result would look something like this:

min_sz = 2
df = tibble(
  id = c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;, &quot;e&quot;, &quot;f&quot;, &quot;g&quot;, &quot;h&quot;, &quot;i&quot;, &quot;j&quot;),
  start = c(0, 0, 0, 4, 4, 8, 10, 10, 16, 32),
  end = c(4, 6, 16, 10, 22, 18, 36, 56, 42, 84),
)
df %&gt;% 
  mutate(dur = end - start) %&gt;% 
  mutate(n_min_chunks = dur / min_sz) %&gt;% 
  uncount(n_min_chunks) %&gt;% 
  group_by(id) %&gt;% 
  mutate(row = row_number()) %&gt;% 
  mutate(
    chunk_start = start + (min_sz * (row - 1)),
    chunk_end = start + (min_sz * row),
  )

#	id	start	end	dur	row	chunk_start	chunk_end
1	a	0	4	4	1	0	2
2	a	0	4	4	2	2	4
3	b	0	6	6	1	0	2
4	b	0	6	6	2	2	4
5	b	0	6	6	3	4	6
6	c	0	16	16	1	0	2
7	c	0	16	16	2	2	4
8	c	0	16	16	3	4	6
..	..	...	...	...	...	...	...

I wish to apply a similar operation, but for potentially any req_sz that's a multiple of min_sz and includes partial overlaps. If I were to use req_sz = 20 for example, my output should look something like this (note #5 & #6 with partial overlaps "4-20" & "20-22"):

#	id	start	end	dur	chunk_start	chunk_end
1	a	0	4	4	0	20
2	b	0	6	6	0	20
3	c	0	16	16	0	20
4	c	4	10	6	0	20
5	c	4	22	18	0	20
6	c	4	22	18	20	40
..	..	...	...	...	...	...

but I've been unable to come up with a mathematical solution that allows me to scale this "chunking" operation.

Any help would be greatly appreciated!

答案1

得分: 1

在基本的R语言中，您可以编写以下函数来实现相同的功能：

explode <- function(data, by){
  f <- function(...){ 
    s <- list(...)[c('start', 'end')]
    mat <- embed(seq((s$start %/% by) * by, (s$end %/% by + 1) * by, by), 2)
    data.frame(...,dur = s$end - s$start, chk_st = mat[,2], chk_ed = mat[,1])
  }
  do.call(rbind, do.call(Map, c(f, data, by = by)))
}
explode(df, by=20)
    id start end dur chunck_start chunk_end
a    a     0   4   4            0        20
b    b     0   6   6            0        20
c    c     0  16  16            0        20
d    d     4  10   6            0        20
e.1  e     4  22  18            0        20
e.2  e     4  22  18           20        40
f    f     8  18  10            0        20
g.1  g    10  36  26            0        20
g.2  g    10  36  26           20        40
h.1  h    10  56  46            0        20
h.2  h    10  56  46           20        40
h.3  h    10  56  46           40        60
i.1  i    16  42  26            0        20
i.2  i    16  42  26           20        40
i.3  i    16  42  26           40        60
j.1  j    32  84  52           20        40
j.2  j    32  84  52           40        60
j.3  j    32  84  52           60        80
j.4  j    32  84  52           80       100

请注意，这是基于R语言的函数，用于实现您提供的功能，并对输入数据进行操作。

英文:

in base R you could write a function to accomplish the same:

explode &lt;- function(data, by){
  f &lt;- function(...){ 
   s &lt;- list(...)[c(&#39;start&#39;, &#39;end&#39;)]
   mat &lt;- embed(seq((s$start %/% by) * by, (s$end %/% by + 1) * by, by), 2)
   data.frame(...,dur = s$end - s$start, chk_st = mat[,2], chk_ed = mat[,1])
  }
  do.call(rbind, do.call(Map, c(f, data, by = by)))
}
 explode(df, by=20)
    id start end dur chunck_start chunk_end
a    a     0   4   4            0        20
b    b     0   6   6            0        20
c    c     0  16  16            0        20
d    d     4  10   6            0        20
e.1  e     4  22  18            0        20
e.2  e     4  22  18           20        40
f    f     8  18  10            0        20
g.1  g    10  36  26            0        20
g.2  g    10  36  26           20        40
h.1  h    10  56  46            0        20
h.2  h    10  56  46           20        40
h.3  h    10  56  46           40        60
i.1  i    16  42  26            0        20
i.2  i    16  42  26           20        40
i.3  i    16  42  26           40        60
j.1  j    32  84  52           20        40
j.2  j    32  84  52           40        60
j.3  j    32  84  52           60        80
j.4  j    32  84  52           80       100

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用R tidyverse根据固定变量创建重复行

问题

答案1

将元素附加到列表只有在它们存在的情况下。

R包lme4和glmmTMB对于相同的模型和数据产生不同的AIC值。

修改现有绘图的颜色比例尺。

筛选具有正值和适当负值的行。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。