在R中对两个数据框进行分组后找到重叠的范围。

huangapple go评论104阅读模式
英文:

Find overlapping ranges between two data frames after grouping in R

问题

我有两个类似这样的大型数据框:

  1. df1 <- tibble(chrom=c(1,1,1,2,2,2),
  2. start=c(100,200,300,100,200,300),
  3. end=c(150,250,350,120,220,320))
  4. df2 <- tibble(chrom=c(1,1,1,2,2,2),
  5. start2=c(100,50,280,100,10,200),
  6. end2=c(125,100,320,115,15,350))
  7. df1
  8. #> # A tibble: 6 × 3
  9. #> chrom start end
  10. #> <dbl> <dbl> <dbl>
  11. #> 1 1 100 150
  12. #> 2 1 200 250
  13. #> 3 1 300 350
  14. #> 4 2 100 120
  15. #> 5 2 200 220
  16. #> 6 2 300 320
  17. df2
  18. #> # A tibble: 6 × 3
  19. #> chrom start2 end2
  20. #> <dbl> <dbl> <dbl>
  21. #> 1 1 100 125
  22. #> 2 1 50 100
  23. #> 3 1 280 320
  24. #> 4 2 100 115
  25. #> 5 2 10 15
  26. #> 6 2 200 350

想要找到df2的范围[start2-end2]与df1的范围[start-end]重叠的部分。理想的输出可能类似于以下内容,但不一定需要。主要是我想要重叠范围的坐标。

  1. #> # A tibble: 6 × 8
  2. #> chrom start end start2 end2 overlap overlap_start overlap_end
  3. #> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
  4. #> 1 1 100 150 100 125 yes 100 125
  5. #> 2 1 200 250 50 100 no NA NA
  6. #> 3 1 300 350 280 320 yes 300 320
  7. #> 4 2 100 120 100 115 yes 100 115
  8. #> 5 2 200 220 10 15 no NA NA
  9. #> 6 2 300 320 200 350 yes 200,220 300,320

请注意,在最后一行中,范围200-350已与df1中的两个范围[200-220,300-320]重叠。

英文:

I have two large data frames that look like this:

  1. df1 &lt;- tibble(chrom=c(1,1,1,2,2,2),
  2. start=c(100,200,300,100,200,300),
  3. end=c(150,250,350,120,220,320))
  4. df2 &lt;- tibble(chrom=c(1,1,1,2,2,2),
  5. start2=c(100,50,280,100,10,200),
  6. end2=c(125,100,320,115,15,350))
  7. df1
  8. #&gt; # A tibble: 6 &#215; 3
  9. #&gt; chrom start end
  10. #&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
  11. #&gt; 1 1 100 150
  12. #&gt; 2 1 200 250
  13. #&gt; 3 1 300 350
  14. #&gt; 4 2 100 120
  15. #&gt; 5 2 200 220
  16. #&gt; 6 2 300 320
  17. df2
  18. #&gt; # A tibble: 6 &#215; 3
  19. #&gt; chrom start2 end2
  20. #&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
  21. #&gt; 1 1 100 125
  22. #&gt; 2 1 50 100
  23. #&gt; 3 1 280 320
  24. #&gt; 4 2 100 115
  25. #&gt; 5 2 10 15
  26. #&gt; 6 2 200 350

<sup>Created on 2023-01-09 with reprex v2.0.2</sup>

I want to find which range[start2-end2] of df2 overlaps with the range[start-end] of df1.
An ideal output would be something like this, but it's not necessary. Mostly I want the coordinates of the overlapping ranges.

  1. #&gt; # A tibble: 6 &#215; 8
  2. #&gt; chrom start end start2 end2 overlap overlap_start overlap_end
  3. #&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  4. #&gt; 1 1 100 150 100 125 yes 100 125
  5. #&gt; 2 1 200 250 50 100 no &lt;NA&gt; &lt;NA&gt;
  6. #&gt; 3 1 300 350 280 320 yes 300 320
  7. #&gt; 4 2 100 120 100 115 yes 100 115
  8. #&gt; 5 2 200 220 10 15 no &lt;NA&gt; &lt;NA&gt;
  9. #&gt; 6 2 300 320 200 350 yes 200,220 300,320

<sup>Created on 2023-01-09 with reprex v2.0.2</sup>

!Note that on the last line, the range 200-350 overlaps already with two ranges from df1[200-220, 300-320].

答案1

得分: 2

I believe you are looking for something like this?

我相信你在寻找类似这样的内容:

I see no need to summarize here, so you'll get two results for the df2-range 200-350.

我在这里不需要总结,所以你将得到df2范围为200-350的两个结果。

  1. library(data.table)
  2. library(matrixStats)
  3. # set to data.table format
  4. setDT(df1); setDT(df2)
  5. # perform join
  6. ans <- df1[df2, .(chrom,
  7. start = x.start, end = x.end,
  8. start2 = i.start2, end2 = i.end2),
  9. on = .(chrom, start < end2, end > start2),
  10. nomatch = NA]
  11. # calculate new columns
  12. ans[, overlap_start := rowMaxs(as.matrix(.SD)), .SDcols = c("start", "start2")]
  13. ans[, overlap_end := rowMins(as.matrix(.SD)), .SDcols = c("end", "end2")]
  14. # chrom start end start2 end2 overlap_start overlap_end
  15. # 1: 1 100 150 100 125 100 125
  16. # 2: 1 NA NA 50 100 NA NA
  17. # 3: 1 300 350 280 320 280 320
  18. # 4: 2 100 120 100 115 100 115
  19. # 5: 2 NA NA 10 15 NA NA
  20. # 6: 2 200 220 200 350 200 220
  21. # 7: 2 300 320 200 350 200 320

以上是翻译好的代码部分。

英文:

I believe you are looking for sometehing like this?

I see no need to summarise here, so you'll get two results for the df2-range 200-350.

  1. library(data.table)
  2. library(matrixStats)
  3. # set to data.table format
  4. setDT(df1); setDT(df2)
  5. # perform join
  6. ans &lt;- df1[df2, .(chrom,
  7. start = x.start, end = x.end,
  8. start2 = i.start2, end2 = i.end2),
  9. on = .(chrom, start &lt; end2, end &gt; start2),
  10. nomatch = NA]
  11. # calculate new columns
  12. ans[, overlap_start := rowMaxs(as.matrix(.SD)), .SDcols = c(&quot;start&quot;, &quot;start2&quot;)]
  13. ans[, overlap_end := rowMins(as.matrix(.SD)), .SDcols = c(&quot;end&quot;, &quot;end2&quot;)]
  14. # chrom start end start2 end2 overlap_start overlap_end
  15. # 1: 1 100 150 100 125 100 125
  16. # 2: 1 NA NA 50 100 NA NA
  17. # 3: 1 300 350 280 320 280 320
  18. # 4: 2 100 120 100 115 100 115
  19. # 5: 2 NA NA 10 15 NA NA
  20. # 6: 2 200 220 200 350 200 220
  21. # 7: 2 300 320 200 350 200 320

答案2

得分: 2

我的建议是使用Bioconductor包GenomicRanges,它可以使用最优的数据结构来查找区间重叠。

  1. library(GenomicRanges)
  2. df1 <- tibble(chrom=c(1,1,1,2,2,2),
  3. start=c(100,200,300,100,200,300),
  4. end=c(150,250,350,120,220,320))
  5. df2 <- tibble(chrom=c(1,1,1,2,2,2),
  6. start2=c(100,50,280,100,10,200),
  7. end2=c(125,100,320,115,15,350))
  8. overlaps <- findOverlapPairs(makeGRangesFromDataFrame(df1),
  9. makeGRangesFromDataFrame(df2,
  10. end.field = "end2",
  11. start.field = "start2"))
  12. > overlaps
  13. Pairs object with 6 pairs and 0 metadata columns:
  14. first second
  15. <GRanges> <GRanges>
  16. [1] 1:100-150 1:50-100
  17. [2] 1:100-150 1:100-125
  18. [3] 1:300-350 1:280-320
  19. [4] 2:100-120 2:100-115
  20. [5] 2:200-220 2:200-350
  21. [6] 2:300-320 2:200-350
  22. mapply(as.data.frame,
  23. list(S4Vectors::first(overlaps),
  24. S4Vectors::second(overlaps)),
  25. SIMPLIFY = FALSE) |
  26. do.call(what = `cbind`)
  27. seqnames start end width strand seqnames start end width strand
  28. 1 1 100 150 51 * 1 50 100 51 *
  29. 2 1 100 150 51 * 1 100 125 26 *
  30. 3 1 300 350 51 * 1 280 320 41 *
  31. 4 2 100 120 21 * 2 100 115 16 *
  32. 5 2 200 220 21 * 2 200 350 151 *
  33. 6 2 300 320 21 * 2 200 350 151 *
英文:

My advise is to use the Bioconductor package GenomicRanges, which can use optimal data structures for finding interval overlaps.

  1. library(GenomicRanges)
  2. df1 &lt;- tibble(chrom=c(1,1,1,2,2,2),
  3. start=c(100,200,300,100,200,300),
  4. end=c(150,250,350,120,220,320))
  5. df2 &lt;- tibble(chrom=c(1,1,1,2,2,2),
  6. start2=c(100,50,280,100,10,200),
  7. end2=c(125,100,320,115,15,350))
  8. overlaps &lt;- findOverlapPairs(makeGRangesFromDataFrame(df1),
  9. makeGRangesFromDataFrame(df2,
  10. end.field = &quot;end2&quot;,
  11. start.field = &quot;start2&quot;))
  12. &gt; overlaps
  13. Pairs object with 6 pairs and 0 metadata columns:
  14. first second
  15. &lt;GRanges&gt; &lt;GRanges&gt;
  16. [1] 1:100-150 1:50-100
  17. [2] 1:100-150 1:100-125
  18. [3] 1:300-350 1:280-320
  19. [4] 2:100-120 2:100-115
  20. [5] 2:200-220 2:200-350
  21. [6] 2:300-320 2:200-350
  22. mapply(as.data.frame,
  23. list(S4Vectors::first(overlaps),
  24. S4Vectors::second(overlaps)),
  25. SIMPLIFY = FALSE) |&gt;
  26. do.call(what = `cbind`)
  27. seqnames start end width strand seqnames start end width strand
  28. 1 1 100 150 51 * 1 50 100 51 *
  29. 2 1 100 150 51 * 1 100 125 26 *
  30. 3 1 300 350 51 * 1 280 320 41 *
  31. 4 2 100 120 21 * 2 100 115 16 *
  32. 5 2 200 220 21 * 2 200 350 151 *
  33. 6 2 300 320 21 * 2 200 350 151 *

答案3

得分: 0

以下是翻译好的部分:

  1. # 一个更长的“整洁风格”版本:
  2. ```R
  3. library(dplyr)
  4. df1 |&gt;
  5. left_join(df2, by = 'chrom') |&gt;
  6. rowwise() |&gt;
  7. mutate(range1 = list(start:end),
  8. range2 = list(start2:end2),
  9. intersect = list(intersect(start:end, start2:end2)),
  10. overlap = c('no', 'yes')[1 + sign(length(intersect))],
  11. overlap_start = ifelse(length(intersect), min(intersect), NA),
  12. overlap_end = ifelse(length(intersect), max(intersect), NA),
  13. ) |&gt;
  14. group_by(paste(start2, end2)) |&gt;
  15. summarise(across(chrom : end2),
  16. overlap,
  17. across(starts_with('overlap_'),
  18. ~ paste(na.omit(.x), collapse = ','))
  19. ) |&gt;
  20. ungroup() |&gt;
  21. select(chrom:overlap_end)
  1. # 一个数据框:18 x 8
  2. chrom start end start2 end2 overlap overlap_start overlap_end
  3. <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
  4. 1 2 100 120 10 15 no "" ""
  5. 2 2 200 220 10 15 no "" ""
  6. 3 2 300 320 10 15 no "" ""
  7. 4 2 100 120 100 115 yes "100" "115"
  8. 5 2 200 220 100 115 no "100" "115"
  9. 6 2 300 320 100 115 no "100" "115"
  10. 7 1 100 150 100 125 yes "100" "125"
  11. 8 1 200 250 100 125 no "100" "125"
  12. 9 1 300 350 100 125 no "100" "125"
  13. 10 2 100 120 200 350 no "200,300" "220,320"
  14. # ...

要获取数值向量而不是多个重叠的逗号分隔字符串,请使用以下代码片段进行总结:

  1. ## ...
  2. across(starts_with('overlap_'),
  3. ~ list(c(na.omit(.x)))
  4. )
英文:

A lengthier "tidy-style" version:

  1. library(dplyr)
  2. df1 |&gt;
  3. left_join(df2, by = &#39;chrom&#39;) |&gt;
  4. rowwise() |&gt;
  5. mutate(range1 = list(start:end),
  6. range2 = list(start2:end2),
  7. intersect = list(intersect(start:end, start2:end2)),
  8. overlap = c(&#39;no&#39;, &#39;yes&#39;)[1 + sign(length(intersect))],
  9. overlap_start = ifelse(length(intersect), min(intersect), NA),
  10. overlap_end = ifelse(length(intersect), max(intersect), NA),
  11. ) |&gt;
  12. group_by(paste(start2, end2)) |&gt;
  13. summarise(across(chrom : end2),
  14. overlap,
  15. across(starts_with(&#39;overlap_&#39;),
  16. ~ paste(na.omit(.x), collapse = &#39;,&#39;))
  17. ) |&gt;
  18. ungroup() |&gt;
  19. select(chrom:overlap_end)
  1. # A tibble: 18 x 8
  2. chrom start end start2 end2 overlap overlap_start overlap_end
  3. &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  4. 1 2 100 120 10 15 no &quot;&quot; &quot;&quot;
  5. 2 2 200 220 10 15 no &quot;&quot; &quot;&quot;
  6. 3 2 300 320 10 15 no &quot;&quot; &quot;&quot;
  7. 4 2 100 120 100 115 yes &quot;100&quot; &quot;115&quot;
  8. 5 2 200 220 100 115 no &quot;100&quot; &quot;115&quot;
  9. 6 2 300 320 100 115 no &quot;100&quot; &quot;115&quot;
  10. 7 1 100 150 100 125 yes &quot;100&quot; &quot;125&quot;
  11. 8 1 200 250 100 125 no &quot;100&quot; &quot;125&quot;
  12. 9 1 300 350 100 125 no &quot;100&quot; &quot;125&quot;
  13. 10 2 100 120 200 350 no &quot;200,300&quot; &quot;220,320&quot;
  14. # ...

to obtain numeric vectors instead of comma-separated strings for multiple overlaps, summarize with the following fragment instead:

  1. ## ...
  2. across(starts_with(&#39;overlap_&#39;),
  3. ~ list(c(na.omit(.x)))
  4. )

huangapple
  • 本文由 发表于 2023年1月9日 18:46:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/75056152.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定