在R中,如果我知道范围,可以对数据框进行填充或填补。

huangapple go评论91阅读模式
英文:

Padding or filling a dataframe in R if I know the range

问题

  1. 我正在寻找类似于 [bedtools subtract](https://bedtools.readthedocs.io/en/latest/content/tools/subtract.html) 的东西,但是使用数据框。
  2. 例如,假设我有一个如下的范围数据框:
  3. ```r
  4. 起始 结束 值
  5. 0 100 P

我还有另一个已排序的数据框:

  1. 起始 结束
  2. 10 25 A
  3. 50 63 B

是否有一种方法可以填充它,使其变成这样:

  1. 起始 结束
  2. 0 9 P1
  3. 10 25 A
  4. 26 49 P2
  5. 50 63 B
  6. 64 100 P3

P1,P2 和 P3 是填充第二个数据框的标签,以便覆盖值的整个范围。

我尝试过使用 Dplyr 的 Lag 函数并手动添加填充值,但鉴于基因组特征的长度(包括起始和结束坐标)可能会改变,我希望这种范围填充是自动的。

谢谢!

例如,这是数据的一个小子集:

  1. data_range <- data.frame(start=0, end=100, value="P")
  2. tofill_range <- data.frame(start=c(15, 51, 70),end = c(39, 62, 79), value = c("A","B","C"))
  1. <details>
  2. <summary>英文:</summary>
  3. I&#39;m looking for something similar to [bedtools subtract](https://bedtools.readthedocs.io/en/latest/content/tools/subtract.html) but with dataframes.
  4. For example, say I have the range as a dataframe here:

Start End Value
0 100 P

  1. And I have another dataframe, which is sorted:

Start End Value
10 25 A
50 63 B

  1. Would there be a way to fill this like so:

Start End Value
0 9 P1
10 25 A
26 49 P2
50 63 B
64 100 P3

  1. P1, P2 and P3 labels which are filled in to pad the 2nd dataframe so that the entire range of value gets covered.
  2. I tried using Dplyr&#39;s Lag function and adding the padding values manually, but given that the range can change depending on the length of genomic feature (including the start and end co-ordinates), I wanted this range filling to be automatic.
  3. Thank you!
  4. For example, this is a small subset of the data:
  5. ```R
  6. data_range&lt;- data.frame(start=0, end=100, value=&quot;P&quot;)
  7. tofill_range&lt;- data.frame(start=c(15, 51, 70),end = c(39, 62, 79), value = c(&quot;A&quot;,&quot;B&quot;,&quot;C&quot;))

答案1

得分: 2

以下是使用'dplyr'计算数据框范围的一种方法。对于您的第二个示例,我重新命名了列。我们可以做一些额外的工作,使其适用于任何列名。

  1. 库(dplyr)
  2. calc_range <- function(df1, df2) {
  3. df3 <- df2 %>%
  4. transmute(开始 = 结束 + 1,
  5. 结束 = 开始 - 1) %>%
  6. rename(开始 = 开始)
  7. 开始_df <- bind_rows(df1, df2, df3)
  8. 开始_df %>%
  9. 选择(!价值) %>%
  10. unlist %>%
  11. sort %>%
  12. 矩阵(列数 = 2, 按行 = TRUE) %>%
  13. 数据框() %>%
  14. rename(开始 = X1, 结束 = X2) %>%
  15. left_join(开始_df, by = c("开始", "结束")) %>%
  16. mutate(价值 = ifelse(is.na(价值) | 价值 == "P",
  17. paste0("P", cumsum(is.na(价值) | 价值 == "P")),
  18. 价值)) %>%
  19. 排序(开始)
  20. }
  21. # 测试 1
  22. dfa <- tribble(
  23. ~开始, ~结束, ~价值,
  24. 0, 100, "P"
  25. )
  26. dfb <- tribble(~开始, ~结束, ~价值,
  27. 10, 25, "A",
  28. 50, 63, "B")
  29. calc_range(dfa, dfb)
  30. #> 开始 结束 价值
  31. #> 1 0 9 P1
  32. #> 2 10 25 A
  33. #> 3 26 49 P2
  34. #> 4 50 63 B
  35. #> 5 64 100 P3
  36. # 测试 2
  37. 数据范围 <- 数据框(开始=0, 结束=100, 价值="P")
  38. 填充范围 <- 数据框(开始=c(15, 51, 70),
  39. 结束 = c(39, 62, 79),
  40. 价值 = c("A","B","C"))
  41. calc_range(数据范围, 填充范围)
  42. #> 开始 结束 价值
  43. #> 1 0 14 P1
  44. #> 2 15 39 A
  45. #> 3 40 50 P2
  46. #> 4 51 62 B
  47. #> 5 63 69 P3
  48. #> 6 70 79 C
  49. #> 7 80 100 P4
英文:

Here is one way to calculate the range of a data.frame with just using 'dplyr'. For your second example I renamed the columns. We could put some more work in to make it work with any column names.

  1. library(dplyr)
  2. calc_range &lt;- function(df1, df2) {
  3. df3 &lt;- df2 %&gt;%
  4. transmute(start = End + 1,
  5. End = Start - 1) %&gt;%
  6. rename(Start = start)
  7. start_df &lt;- bind_rows(df1, df2, df3)
  8. start_df %&gt;%
  9. select(!Value) %&gt;%
  10. unlist %&gt;%
  11. sort %&gt;%
  12. matrix(ncol = 2, byrow = TRUE) %&gt;%
  13. data.frame() %&gt;%
  14. rename(Start = X1, End = X2) %&gt;%
  15. left_join(start_df, by = c(&quot;Start&quot;, &quot;End&quot;)) %&gt;%
  16. mutate(Value = ifelse(is.na(Value) | Value == &quot;P&quot;,
  17. paste0(&quot;P&quot;, cumsum(is.na(Value) | Value == &quot;P&quot;)),
  18. Value)) %&gt;%
  19. arrange(Start)
  20. }
  21. # Test 1
  22. dfa &lt;- tribble(
  23. ~Start, ~End, ~Value,
  24. 0, 100, &quot;P&quot;
  25. )
  26. dfb &lt;- tribble(~Start, ~End, ~Value,
  27. 10, 25, &quot;A&quot;,
  28. 50, 63, &quot;B&quot;)
  29. calc_range(dfa, dfb)
  30. #&gt; Start End Value
  31. #&gt; 1 0 9 P1
  32. #&gt; 2 10 25 A
  33. #&gt; 3 26 49 P2
  34. #&gt; 4 50 63 B
  35. #&gt; 5 64 100 P3
  36. # Test 2
  37. data_range &lt;- data.frame(Start=0, End=100, Value=&quot;P&quot;)
  38. tofill_range &lt;- data.frame(Start=c(15, 51, 70),
  39. End = c(39, 62, 79),
  40. Value = c(&quot;A&quot;,&quot;B&quot;,&quot;C&quot;))
  41. calc_range(data_range, tofill_range)
  42. #&gt; Start End Value
  43. #&gt; 1 0 14 P1
  44. #&gt; 2 15 39 A
  45. #&gt; 3 40 50 P2
  46. #&gt; 4 51 62 B
  47. #&gt; 5 63 69 P3
  48. #&gt; 6 70 79 C
  49. #&gt; 7 80 100 P4

<sup>Created on 2023-02-23 with reprex v2.0.2</sup>

答案2

得分: 2

使用 dplyr(版本 v1.1.0 或更高)的 consecutive_id 来获取缺失的范围,使用 between

  1. library(dplyr)
  2. ranges <- rowSums(apply(tofill_range[,1:2], 1, function(x)
  3. between(seq(data_range$start, data_range$end), x[1], x[2])))
  1. as_tibble(cbind(ranges, grp = consecutive_id(ranges),
  2. val = seq(data_range[,1], data_range[,2))) %>%
  3. group_by(grp) %>%
  4. filter(ranges == 0) %>%
  5. summarize(start = first(val),
  6. end = last(val),
  7. value = paste0(data_range$value, cur_group_id())) %>%
  8. select(-grp) %>%
  9. bind_rows(., tofill_range) %>%
  10. arrange(start)
  11. # A tibble: 7 × 3
  12. start end value
  13. <dbl> <dbl> <chr>
  14. 1 0 14 P1
  15. 2 15 39 A
  16. 3 40 50 P2
  17. 4 51 62 B
  18. 5 63 69 P3
  19. 6 70 79 C
  20. 7 80 100 P4

如果您需要进一步的解释或有任何其他问题,请随时提出。

英文:

Using dplyr (>= v1.1.0 for consecutive_id)

Get the missing ranges with between

  1. library(dplyr)
  2. ranges &lt;- rowSums(apply(tofill_range[,1:2], 1, function(x)
  3. between(seq(data_range$start, data_range$end), x[1], x[2])))
  1. as_tibble(cbind(ranges, grp = consecutive_id(ranges),
  2. val = seq(data_range[,1], data_range[,2]))) %&gt;%
  3. group_by(grp) %&gt;%
  4. filter(ranges == 0) %&gt;%
  5. summarize(start = first(val),
  6. end = last(val),
  7. value = paste0(data_range$value, cur_group_id())) %&gt;%
  8. select(-grp) %&gt;%
  9. bind_rows(., tofill_range) %&gt;%
  10. arrange(start)
  11. # A tibble: 7 &#215; 3
  12. start end value
  13. &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;
  14. 1 0 14 P1
  15. 2 15 39 A
  16. 3 40 50 P2
  17. 4 51 62 B
  18. 5 63 69 P3
  19. 6 70 79 C
  20. 7 80 100 P4

答案3

得分: 2

  1. 在基本的 R 中:
  2. ``` r
  3. all_ranges <- function(df1, df2){
  4. a <- sort(c(t(df1[-3]), t(df2[-3]), t(df2[-3]) + c(-1,1)))
  5. b <- data.frame(t(matrix(a,2)))
  6. d <- merge(df2, setNames(b, names(df1)[-3]), all = TRUE)
  7. replace(d, is.na(d), paste0(df1[,3], seq(sum(is.na(d)))))
  8. }
  9. data_range<- data.frame(start=0, end=100, value="P")
  10. tofill_range<- data.frame(start=c(15, 51, 70),end = c(39, 62, 79), value = c("A","B","C"))
  11. all_ranges(data_range, tofill_range)
  12. #> start end value
  13. #> 1 0 14 P1
  14. #> 2 15 39 A
  15. #> 3 40 50 P2
  16. #> 4 51 62 B
  17. #> 5 63 69 P3
  18. #> 6 70 79 C
  19. #> 7 80 100 P4

创建于 2023-02-23,使用 reprex v2.0.2

  1. <details>
  2. <summary>英文:</summary>
  3. In base R:
  4. ``` r
  5. all_ranges &lt;- function(df1, df2){
  6. a &lt;- sort(c(t(df1[-3]), t(df2[-3]), t(df2[-3]) + c(-1,1)))
  7. b &lt;- data.frame(t(matrix(a,2)))
  8. d &lt;- merge(df2, setNames(b, names(df1)[-3]), all = TRUE)
  9. replace(d, is.na(d), paste0(df1[,3], seq(sum(is.na(d)))))
  10. }
  11. data_range&lt;- data.frame(start=0, end=100, value=&quot;P&quot;)
  12. tofill_range&lt;- data.frame(start=c(15, 51, 70),end = c(39, 62, 79), value = c(&quot;A&quot;,&quot;B&quot;,&quot;C&quot;))
  13. all_ranges(data_range, tofill_range)
  14. #&gt; start end value
  15. #&gt; 1 0 14 P1
  16. #&gt; 2 15 39 A
  17. #&gt; 3 40 50 P2
  18. #&gt; 4 51 62 B
  19. #&gt; 5 63 69 P3
  20. #&gt; 6 70 79 C
  21. #&gt; 7 80 100 P4

<sup>Created on 2023-02-23 with reprex v2.0.2</sup>

答案4

得分: 0

  1. 一个非常适合这个任务的包是“IRanges”:
  2. library(IRanges)
  3. r1 = IRanges(start = 0, end = 100, names = "P")
  4. r2 = IRanges(start = c(10, 50), end = c(25, 63), names = c("A", "B"))
  5. # 找到间隙
  6. dif = setdiff(r1, r2)
  7. names(dif) = sprintf("%s%d", names(r1), seq_len(length(dif)))
  8. # 合并并排序
  9. ans = sort(c(r2, dif))
  10. as.data.frame(ans)
  11. # start end width names
  12. #1 0 9 10 P1
  13. #2 10 25 16 A
  14. #3 26 49 24 P2
  15. #4 50 63 14 B
  16. #5 64 100 37 P3
英文:

A very suitable package for this task is "IRanges":

  1. library(IRanges)
  2. r1 = IRanges(start = 0, end = 100, names = &quot;P&quot;)
  3. r2 = IRanges(start = c(10, 50), end = c(25, 63), names = c(&quot;A&quot;, &quot;B&quot;))
  4. # find gaps
  5. dif = setdiff(r1, r2)
  6. names(dif) = sprintf(&quot;%s%d&quot;, names(r1), seq_len(length(dif)))
  7. # merge and sort
  8. ans = sort(c(r2, dif))
  9. as.data.frame(ans)
  10. # start end width names
  11. #1 0 9 10 P1
  12. #2 10 25 16 A
  13. #3 26 49 24 P2
  14. #4 50 63 14 B
  15. #5 64 100 37 P3

huangapple
  • 本文由 发表于 2023年2月24日 05:29:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/75550494.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定