在 “join_by” 函数中关键词 ‘within’ 和 ‘overlaps’ 的使用

huangapple go评论100阅读模式
英文:

The usage of the key word 'within' and 'overlaps' in join_by

问题

样本1:within

  1. by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))
  2. inner_join(segments, reference, by)

样本2:overlaps

  1. by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end))
  2. full_join(segments, reference, by)
英文:

In R/dplyr help file , there are have below code as attached which have within and overlaps, how to understand this two key words ? Thanks!

  1. library(dplyr)
  2. segments &lt;- tibble(
  3. segment_id = 1:4,
  4. chromosome = c(&quot;chr1&quot;, &quot;chr2&quot;, &quot;chr2&quot;, &quot;chr1&quot;),
  5. start = c(140, 210, 380, 230),
  6. end = c(150, 240, 415, 280)
  7. )
  8. reference &lt;- tibble(
  9. reference_id = 1:4,
  10. chromosome = c(&quot;chr1&quot;, &quot;chr1&quot;, &quot;chr2&quot;, &quot;chr2&quot;),
  11. start = c(100, 200, 300, 415),
  12. end = c(150, 250, 399, 450)
  13. )

sample 1: within

  1. by &lt;- join_by(chromosome, within(x$start, x$end, y$start, y$end))
  2. inner_join(segments, reference, by)

sample 2: overlaps

  1. by &lt;- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end))
  2. full_join(segments, reference, by)

答案1

得分: 3

within 只会捕捉在 x 范围完全包含在 y 范围内的行。

overlaps 会捕捉 x 范围与 y 范围存在任何类型的重叠的行。但是,它不会捕捉完全包含在内的行,即如果 x_lower >= y_lowerx_upper <= y_upper

可能更容易理解的方式如下(请注意这使用了 overlap 的默认边界:"[]"

示例:

  1. x_lower = c(1, 10, 5, 10)
  2. x_upper = c(4, 25, 6, 15)
  3. y_lower = c(0, 15, 10, 3)
  4. y_upper = c(10, 16, 20, 30)
  5. df <- data.frame(x_lower, x_upper, y_lower, y_upper)
  6. transform(df,
  7. is_within = x_lower >= y_lower & x_upper <= y_upper,
  8. is_overlap = x_lower <= y_lower & x_upper >= y_upper)
  9. # x_lower x_upper y_lower y_upper is_within is_overlap
  10. # 1 1 4 0 10 TRUE FALSE
  11. # 2 10 25 15 16 FALSE TRUE
  12. # 3 5 6 10 20 FALSE FALSE
  13. # 4 10 15 3 30 TRUE FALSE

从文档中可以看到:

within(x_lower, x_upper, y_lower, y_upper)

对于 ⁠[x_lower, x_upper]⁠ 中的每个范围,这会找到该范围完全位于 ⁠[y_lower, y_upper]⁠ 中的所有位置。等同于 ⁠x_lower >= y_lower, x_upper <= y_upper⁠。

以及

overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = "[]")

对于 ⁠[x_lower, x_upper]⁠ 中的每个范围,这会找到该范围在任何容量上与 ⁠[y_lower, y_upper]⁠ 重叠的所有位置。默认情况下等同于 ⁠x_lower <= y_upper, x_upper >= y_lower⁠。

英文:

within only captures rows if the range in x is entirely in the range of y.

overlaps capture rows if there is any type of overlap between the range of x and y. BUT it does not capture rows that are entirely within, i.e. if x_lower &gt; y_lower &amp; x_upper &lt; y_upper.

It might be easier to understand like this (note this uses overlap's default bound: &quot;[]&quot;)

在 “join_by” 函数中关键词 ‘within’ 和 ‘overlaps’ 的使用

Example:

  1. x_lower = c(1, 10, 5, 10)
  2. x_upper = c(4, 25, 6, 15)
  3. y_lower = c(0, 15, 10, 3)
  4. y_upper = c(10, 16, 20, 30)
  5. df &lt;- data.frame(x_lower, x_upper, y_lower, y_upper)
  6. transform(df,
  7. is_within = x_lower &gt;= y_lower &amp; x_upper &lt;= y_upper,
  8. is_overlap = x_lower &lt;= y_lower &amp; x_upper &gt;= y_upper)
  9. # x_lower x_upper y_lower y_upper is_within is_overlap
  10. # 1 1 4 0 10 TRUE FALSE
  11. # 2 10 25 15 16 FALSE TRUE
  12. # 3 5 6 10 20 FALSE FALSE
  13. # 4 10 15 3 30 TRUE FALSE

From the documentation:
> within(x_lower, x_upper, y_lower, y_upper)
>
> For each range in ⁠[x_lower, x_upper]⁠, this finds everywhere that
> range falls completely within ⁠[y_lower, y_upper]⁠. Equivalent to
> ⁠x_lower >= y_lower, x_upper <= y_upper⁠.

And

> overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = &quot;[]&quot;)
>
> For each range in ⁠[x_lower, x_upper]⁠, this finds everywhere that
> range overlaps ⁠[y_lower, y_upper]⁠ in any capacity. Equivalent to
> ⁠x_lower <= y_upper, x_upper >= y_lower⁠ by default.

答案2

得分: 1

以下是翻译好的内容:

对于 within(我加粗的部分):

> 对于⁠[x_lower, x_upper]⁠中的每个范围,它找到了范围完全位于⁠[y_lower, y_upper]⁠内的所有位置。等价于⁠x_lower >= y_lower, x_upper <= y_upper⁠。
>
> 用于构建 within() 的不等式与提供的范围的包含性无关。

因此,join_by(within()) 会得到:

  1. by &lt;- join_by(chromosome, within(x$start, x$end, y$start, y$end))
  2. inner_join(segments, reference, join_by(chromosome, within(x$start, x$end, y$start, y$end)))
  3. # A tibble: 1 &#215; 7
  4. segment_id chromosome start.x end.x reference_id start.y end.y
  5. 1 1 chr1 140 150 1 100 150

对于 overlaps(我加粗的部分):

> 对于⁠[x_lower, x_upper]⁠中的每个范围,它找到了范围在任何情况下与⁠[y_lower, y_upper]⁠重叠的所有位置。默认情况下等价于⁠x_lower <= y_upper, x_upper >= y_lower⁠
>
> bounds 可以是 "[]"、"[)"、"(]" 或 "()" 中的一个,以改变下限和上限的包含性。 "[]" 使用 <= 和 >=,但其他 3 个选项使用 < 和 > 并生成完全相同的不等式。

因此,join_by(overlaps()) 会得到:

  1. # A tibble: 5 &#215; 7
  2. segment_id chromosome start.x end.x reference_id start.y end.y
  3. 1 1 chr1 140 150 1 100 150
  4. 2 2 chr2 210 240 NA NA NA
  5. 3 3 chr2 380 415 3 300 399
  6. 4 3 chr2 380 415 4 415 450
  7. 5 4 chr1 230 280 2 200 250
英文:

The documentation of join_by actually covers these two helper functions.

For within (my bold):

>For each range in ⁠[x_lower, x_upper]⁠, this finds everywhere that range falls completely within ⁠[y_lower, y_upper]⁠. Equivalent to ⁠x_lower >= y_lower, x_upper <= y_upper⁠.
>
>The inequalities used to build within() are the same regardless of the inclusiveness of the supplied ranges.

  1. library(dplyr)
  2. full_join(segments, reference, by = &quot;chromosome&quot;)
  3. # A tibble: 8 &#215; 7
  4. segment_id chromosome start.x end.x reference_id start.y end.y
  5. &lt;int&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;
  6. 1 1 chr1 140 150 1 100 150 # yes
  7. 2 1 chr1 140 150 2 200 250 # both x smaller than y
  8. 3 2 chr2 210 240 3 300 399 # both x smaller than y
  9. 4 2 chr2 210 240 4 415 450 # both x smaller than y
  10. 5 3 chr2 380 415 3 300 399 # x$end (415) outside range
  11. 6 3 chr2 380 415 4 415 450 # x$start (380) outside range
  12. 7 4 chr1 230 280 1 100 150 # both x greater than y
  13. 8 4 chr1 230 280 2 200 250 # x$end (280) outside range

Therefore, join_by(within()) gives:

  1. by &lt;- join_by(chromosome, within(x$start, x$end, y$start, y$end))
  2. inner_join(segments, reference, join_by(chromosome, within(x$start, x$end, y$start, y$end)))
  3. # A tibble: 1 &#215; 7
  4. segment_id chromosome start.x end.x reference_id start.y end.y
  5. &lt;int&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;
  6. 1 1 chr1 140 150 1 100 150

<hr>

For overlaps (my bold):

>For each range in ⁠[x_lower, x_upper]⁠, this finds everywhere that range overlaps ⁠[y_lower, y_upper]⁠ in any capacity. Equivalent to ⁠x_lower <= y_upper, x_upper >= y_lower⁠ by default.
>
>bounds can be one of "[]", "[)", "(]", or "()" to alter the inclusiveness of the lower and upper bounds. "[]" uses <= and >=, but the 3 other options use < and > and generate the exact same inequalities.

  1. # A tibble: 8 &#215; 7
  2. segment_id chromosome start.x end.x reference_id start.y end.y
  3. &lt;int&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;
  4. 1 1 chr1 140 150 1 100 150 # yes
  5. 2 1 chr1 140 150 2 200 250 # x$end (150) smaller than y$start (200)
  6. 3 2 chr2 210 240 3 300 399 # x$end (240) smaller than y$start (300)
  7. 4 2 chr2 210 240 4 415 450 # x$end (240) smaller than y$start (415)
  8. 5 3 chr2 380 415 3 300 399 # yes
  9. 6 3 chr2 380 415 4 415 450 # yes
  10. 7 4 chr1 230 280 1 100 150 # x$start (230) &gt; y$end (150)
  11. 8 4 chr1 230 280 2 200 250 # yes

Therefore, join_by(overlaps()) gives:

  1. # A tibble: 5 &#215; 7
  2. segment_id chromosome start.x end.x reference_id start.y end.y
  3. &lt;int&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;
  4. 1 1 chr1 140 150 1 100 150
  5. 2 2 chr2 210 240 NA NA NA
  6. 3 3 chr2 380 415 3 300 399
  7. 4 3 chr2 380 415 4 415 450
  8. 5 4 chr1 230 280 2 200 250

huangapple
  • 本文由 发表于 2023年6月5日 16:36:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76404727.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定