在 “join_by” 函数中关键词 ‘within’ 和 ‘overlaps’ 的使用

huangapple go评论79阅读模式
英文:

The usage of the key word 'within' and 'overlaps' in join_by

问题

样本1:within

by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, by)

样本2:overlaps

by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end))
full_join(segments, reference, by)
英文:

In R/dplyr help file , there are have below code as attached which have within and overlaps, how to understand this two key words ? Thanks!

library(dplyr)

segments &lt;- tibble(
  segment_id = 1:4,
  chromosome = c(&quot;chr1&quot;, &quot;chr2&quot;, &quot;chr2&quot;, &quot;chr1&quot;),
  start = c(140, 210, 380, 230),
  end = c(150, 240, 415, 280)
)


reference &lt;- tibble(
  reference_id = 1:4,
  chromosome = c(&quot;chr1&quot;, &quot;chr1&quot;, &quot;chr2&quot;, &quot;chr2&quot;),
  start = c(100, 200, 300, 415),
  end = c(150, 250, 399, 450)
)

sample 1: within

by &lt;- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, by)

sample 2: overlaps

by &lt;- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end))
full_join(segments, reference, by)

答案1

得分: 3

within 只会捕捉在 x 范围完全包含在 y 范围内的行。

overlaps 会捕捉 x 范围与 y 范围存在任何类型的重叠的行。但是,它不会捕捉完全包含在内的行,即如果 x_lower >= y_lowerx_upper <= y_upper

可能更容易理解的方式如下(请注意这使用了 overlap 的默认边界:"[]"

示例:

x_lower = c(1, 10, 5, 10)
x_upper = c(4, 25, 6, 15)

y_lower = c(0, 15, 10, 3)
y_upper = c(10, 16, 20, 30)

df <- data.frame(x_lower, x_upper, y_lower, y_upper)
transform(df, 
          is_within = x_lower >= y_lower & x_upper <= y_upper,
          is_overlap = x_lower <= y_lower & x_upper >= y_upper)

#   x_lower x_upper y_lower y_upper is_within is_overlap
# 1       1       4       0      10      TRUE      FALSE
# 2      10      25      15      16     FALSE       TRUE
# 3       5       6      10      20     FALSE      FALSE
# 4      10      15       3      30      TRUE      FALSE

从文档中可以看到:

within(x_lower, x_upper, y_lower, y_upper)

对于 ⁠[x_lower, x_upper]⁠ 中的每个范围,这会找到该范围完全位于 ⁠[y_lower, y_upper]⁠ 中的所有位置。等同于 ⁠x_lower >= y_lower, x_upper <= y_upper⁠。

以及

overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = "[]")

对于 ⁠[x_lower, x_upper]⁠ 中的每个范围,这会找到该范围在任何容量上与 ⁠[y_lower, y_upper]⁠ 重叠的所有位置。默认情况下等同于 ⁠x_lower <= y_upper, x_upper >= y_lower⁠。

英文:

within only captures rows if the range in x is entirely in the range of y.

overlaps capture rows if there is any type of overlap between the range of x and y. BUT it does not capture rows that are entirely within, i.e. if x_lower &gt; y_lower &amp; x_upper &lt; y_upper.

It might be easier to understand like this (note this uses overlap's default bound: &quot;[]&quot;)

在 “join_by” 函数中关键词 ‘within’ 和 ‘overlaps’ 的使用

Example:

x_lower = c(1, 10, 5, 10)
x_upper = c(4, 25, 6, 15)

y_lower = c(0, 15, 10, 3)
y_upper = c(10, 16, 20, 30)

df &lt;- data.frame(x_lower, x_upper, y_lower, y_upper)
transform(df, 
          is_within = x_lower &gt;= y_lower &amp; x_upper &lt;= y_upper,
          is_overlap = x_lower &lt;= y_lower &amp; x_upper &gt;= y_upper)

#   x_lower x_upper y_lower y_upper is_within is_overlap
# 1       1       4       0      10      TRUE      FALSE
# 2      10      25      15      16     FALSE       TRUE
# 3       5       6      10      20     FALSE      FALSE
# 4      10      15       3      30      TRUE      FALSE

From the documentation:
> within(x_lower, x_upper, y_lower, y_upper)
>
> For each range in ⁠[x_lower, x_upper]⁠, this finds everywhere that
> range falls completely within ⁠[y_lower, y_upper]⁠. Equivalent to
> ⁠x_lower >= y_lower, x_upper <= y_upper⁠.

And

> overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = &quot;[]&quot;)
>
> For each range in ⁠[x_lower, x_upper]⁠, this finds everywhere that
> range overlaps ⁠[y_lower, y_upper]⁠ in any capacity. Equivalent to
> ⁠x_lower <= y_upper, x_upper >= y_lower⁠ by default.

答案2

得分: 1

以下是翻译好的内容:

对于 within(我加粗的部分):

> 对于⁠[x_lower, x_upper]⁠中的每个范围,它找到了范围完全位于⁠[y_lower, y_upper]⁠内的所有位置。等价于⁠x_lower >= y_lower, x_upper <= y_upper⁠。
>
> 用于构建 within() 的不等式与提供的范围的包含性无关。

因此,join_by(within()) 会得到:

by &lt;- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, join_by(chromosome, within(x$start, x$end, y$start, y$end)))

# A tibble: 1 &#215; 7
  segment_id chromosome start.x end.x reference_id start.y end.y
1          1 chr1           140   150            1     100   150

对于 overlaps(我加粗的部分):

> 对于⁠[x_lower, x_upper]⁠中的每个范围,它找到了范围在任何情况下与⁠[y_lower, y_upper]⁠重叠的所有位置。默认情况下等价于⁠x_lower <= y_upper, x_upper >= y_lower⁠
>
> bounds 可以是 "[]"、"[)"、"(]" 或 "()" 中的一个,以改变下限和上限的包含性。 "[]" 使用 <= 和 >=,但其他 3 个选项使用 < 和 > 并生成完全相同的不等式。

因此,join_by(overlaps()) 会得到:

# A tibble: 5 &#215; 7
  segment_id chromosome start.x end.x reference_id start.y end.y
1          1 chr1           140   150            1     100   150
2          2 chr2           210   240           NA      NA    NA
3          3 chr2           380   415            3     300   399
4          3 chr2           380   415            4     415   450
5          4 chr1           230   280            2     200   250
英文:

The documentation of join_by actually covers these two helper functions.

For within (my bold):

>For each range in ⁠[x_lower, x_upper]⁠, this finds everywhere that range falls completely within ⁠[y_lower, y_upper]⁠. Equivalent to ⁠x_lower >= y_lower, x_upper <= y_upper⁠.
>
>The inequalities used to build within() are the same regardless of the inclusiveness of the supplied ranges.

library(dplyr)

full_join(segments, reference, by = &quot;chromosome&quot;)
# A tibble: 8 &#215; 7
  segment_id chromosome start.x end.x reference_id start.y end.y
       &lt;int&gt; &lt;chr&gt;        &lt;dbl&gt; &lt;dbl&gt;        &lt;int&gt;   &lt;dbl&gt; &lt;dbl&gt;
1          1 chr1           140   150            1     100   150 # yes
2          1 chr1           140   150            2     200   250 # both x smaller than y
3          2 chr2           210   240            3     300   399 # both x smaller than y
4          2 chr2           210   240            4     415   450 # both x smaller than y
5          3 chr2           380   415            3     300   399 # x$end (415) outside range
6          3 chr2           380   415            4     415   450 # x$start (380) outside range
7          4 chr1           230   280            1     100   150 # both x greater than y
8          4 chr1           230   280            2     200   250 # x$end (280) outside range

Therefore, join_by(within()) gives:

by &lt;- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, join_by(chromosome, within(x$start, x$end, y$start, y$end)))

# A tibble: 1 &#215; 7
  segment_id chromosome start.x end.x reference_id start.y end.y
       &lt;int&gt; &lt;chr&gt;        &lt;dbl&gt; &lt;dbl&gt;        &lt;int&gt;   &lt;dbl&gt; &lt;dbl&gt;
1          1 chr1           140   150            1     100   150

<hr>

For overlaps (my bold):

>For each range in ⁠[x_lower, x_upper]⁠, this finds everywhere that range overlaps ⁠[y_lower, y_upper]⁠ in any capacity. Equivalent to ⁠x_lower <= y_upper, x_upper >= y_lower⁠ by default.
>
>bounds can be one of "[]", "[)", "(]", or "()" to alter the inclusiveness of the lower and upper bounds. "[]" uses <= and >=, but the 3 other options use < and > and generate the exact same inequalities.

# A tibble: 8 &#215; 7
  segment_id chromosome start.x end.x reference_id start.y end.y
       &lt;int&gt; &lt;chr&gt;        &lt;dbl&gt; &lt;dbl&gt;        &lt;int&gt;   &lt;dbl&gt; &lt;dbl&gt;
1          1 chr1           140   150            1     100   150 # yes
2          1 chr1           140   150            2     200   250 # x$end (150) smaller than y$start (200)
3          2 chr2           210   240            3     300   399 # x$end (240) smaller than y$start (300)
4          2 chr2           210   240            4     415   450 # x$end (240) smaller than y$start (415)
5          3 chr2           380   415            3     300   399 # yes 
6          3 chr2           380   415            4     415   450 # yes
7          4 chr1           230   280            1     100   150 # x$start (230) &gt; y$end (150)
8          4 chr1           230   280            2     200   250 # yes

Therefore, join_by(overlaps()) gives:

# A tibble: 5 &#215; 7
  segment_id chromosome start.x end.x reference_id start.y end.y
       &lt;int&gt; &lt;chr&gt;        &lt;dbl&gt; &lt;dbl&gt;        &lt;int&gt;   &lt;dbl&gt; &lt;dbl&gt;
1          1 chr1           140   150            1     100   150
2          2 chr2           210   240           NA      NA    NA
3          3 chr2           380   415            3     300   399
4          3 chr2           380   415            4     415   450
5          4 chr1           230   280            2     200   250

huangapple
  • 本文由 发表于 2023年6月5日 16:36:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76404727.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定