2023年6月5日 16:36:57go评论100阅读模式

英文:

The usage of the key word 'within' and 'overlaps' in join_by

问题

样本1：within

by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, by)

样本2：overlaps

by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end))
full_join(segments, reference, by)

英文:

In R/dplyr help file , there are have below code as attached which have within and overlaps, how to understand this two key words ? Thanks!

library(dplyr)
segments &lt;- tibble(
  segment_id = 1:4,
  chromosome = c(&quot;chr1&quot;, &quot;chr2&quot;, &quot;chr2&quot;, &quot;chr1&quot;),
  start = c(140, 210, 380, 230),
  end = c(150, 240, 415, 280)
)
reference &lt;- tibble(
  reference_id = 1:4,
  chromosome = c(&quot;chr1&quot;, &quot;chr1&quot;, &quot;chr2&quot;, &quot;chr2&quot;),
  start = c(100, 200, 300, 415),
  end = c(150, 250, 399, 450)
)

sample 1: within

by &lt;- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, by)

sample 2: overlaps

by &lt;- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end))
full_join(segments, reference, by)

答案1

得分: 3

within 只会捕捉在 x 范围完全包含在 y 范围内的行。

overlaps 会捕捉 x 范围与 y 范围存在任何类型的重叠的行。但是，它不会捕捉完全包含在内的行，即如果 x_lower >= y_lower 且 x_upper <= y_upper。

可能更容易理解的方式如下（请注意这使用了 overlap 的默认边界："[]"）

示例：

x_lower = c(1, 10, 5, 10)
x_upper = c(4, 25, 6, 15)
y_lower = c(0, 15, 10, 3)
y_upper = c(10, 16, 20, 30)
df <- data.frame(x_lower, x_upper, y_lower, y_upper)
transform(df, 
          is_within = x_lower >= y_lower & x_upper <= y_upper,
          is_overlap = x_lower <= y_lower & x_upper >= y_upper)
#   x_lower x_upper y_lower y_upper is_within is_overlap
# 1       1       4       0      10      TRUE      FALSE
# 2      10      25      15      16     FALSE       TRUE
# 3       5       6      10      20     FALSE      FALSE
# 4      10      15       3      30      TRUE      FALSE

从文档中可以看到：

within(x_lower, x_upper, y_lower, y_upper)

对于 ⁠[x_lower, x_upper]⁠ 中的每个范围，这会找到该范围完全位于 ⁠[y_lower, y_upper]⁠ 中的所有位置。等同于 ⁠x_lower >= y_lower, x_upper <= y_upper⁠。

以及

overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = "[]")

对于 ⁠[x_lower, x_upper]⁠ 中的每个范围，这会找到该范围在任何容量上与 ⁠[y_lower, y_upper]⁠ 重叠的所有位置。默认情况下等同于 ⁠x_lower <= y_upper, x_upper >= y_lower⁠。

英文:

within only captures rows if the range in x is entirely in the range of y.

overlaps capture rows if there is any type of overlap between the range of x and y. BUT it does not capture rows that are entirely within, i.e. if x_lower > y_lower & x_upper < y_upper.

It might be easier to understand like this (note this uses overlap's default bound: "[]")

Example:

x_lower = c(1, 10, 5, 10)
x_upper = c(4, 25, 6, 15)
y_lower = c(0, 15, 10, 3)
y_upper = c(10, 16, 20, 30)
df &lt;- data.frame(x_lower, x_upper, y_lower, y_upper)
transform(df, 
          is_within = x_lower &gt;= y_lower &amp; x_upper &lt;= y_upper,
          is_overlap = x_lower &lt;= y_lower &amp; x_upper &gt;= y_upper)
#   x_lower x_upper y_lower y_upper is_within is_overlap
# 1       1       4       0      10      TRUE      FALSE
# 2      10      25      15      16     FALSE       TRUE
# 3       5       6      10      20     FALSE      FALSE
# 4      10      15       3      30      TRUE      FALSE

From the documentation:
> within(x_lower, x_upper, y_lower, y_upper)
>
> For each range in ⁠[x_lower, x_upper]⁠, this finds everywhere that
> range falls completely within ⁠[y_lower, y_upper]⁠. Equivalent to
> ⁠x_lower >= y_lower, x_upper <= y_upper⁠.

And

> overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = "[]")
>
> For each range in ⁠[x_lower, x_upper]⁠, this finds everywhere that
> range overlaps ⁠[y_lower, y_upper]⁠ in any capacity. Equivalent to
> ⁠x_lower <= y_upper, x_upper >= y_lower⁠ by default.

答案2

得分: 1

以下是翻译好的内容：

对于 within（我加粗的部分）：

> 对于⁠[x_lower, x_upper]⁠中的每个范围，它找到了范围完全位于⁠[y_lower, y_upper]⁠内的所有位置。等价于⁠x_lower >= y_lower, x_upper <= y_upper⁠。
>
> 用于构建 within() 的不等式与提供的范围的包含性无关。

因此，join_by(within()) 会得到：

by &lt;- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, join_by(chromosome, within(x$start, x$end, y$start, y$end)))
# A tibble: 1 &#215; 7
  segment_id chromosome start.x end.x reference_id start.y end.y
1          1 chr1           140   150            1     100   150

对于 overlaps（我加粗的部分）：

> 对于⁠[x_lower, x_upper]⁠中的每个范围，它找到了范围在任何情况下与⁠[y_lower, y_upper]⁠重叠的所有位置。默认情况下等价于⁠x_lower <= y_upper, x_upper >= y_lower⁠。
>
> bounds 可以是 "[]"、"[)"、"(]" 或 "()" 中的一个，以改变下限和上限的包含性。 "[]" 使用 <= 和 >=，但其他 3 个选项使用 < 和 > 并生成完全相同的不等式。

因此，join_by(overlaps()) 会得到：

# A tibble: 5 &#215; 7
  segment_id chromosome start.x end.x reference_id start.y end.y
1          1 chr1           140   150            1     100   150
2          2 chr2           210   240           NA      NA    NA
3          3 chr2           380   415            3     300   399
4          3 chr2           380   415            4     415   450
5          4 chr1           230   280            2     200   250

英文:

The documentation of join_by actually covers these two helper functions.

For within (my bold):

>For each range in ⁠[x_lower, x_upper]⁠, this finds everywhere that range falls completely within ⁠[y_lower, y_upper]⁠. Equivalent to ⁠x_lower >= y_lower, x_upper <= y_upper⁠.
>
>The inequalities used to build within() are the same regardless of the inclusiveness of the supplied ranges.

library(dplyr)
full_join(segments, reference, by = &quot;chromosome&quot;)
# A tibble: 8 &#215; 7
  segment_id chromosome start.x end.x reference_id start.y end.y
       &lt;int&gt; &lt;chr&gt;        &lt;dbl&gt; &lt;dbl&gt;        &lt;int&gt;   &lt;dbl&gt; &lt;dbl&gt;
1          1 chr1           140   150            1     100   150 # yes
2          1 chr1           140   150            2     200   250 # both x smaller than y
3          2 chr2           210   240            3     300   399 # both x smaller than y
4          2 chr2           210   240            4     415   450 # both x smaller than y
5          3 chr2           380   415            3     300   399 # x$end (415) outside range
6          3 chr2           380   415            4     415   450 # x$start (380) outside range
7          4 chr1           230   280            1     100   150 # both x greater than y
8          4 chr1           230   280            2     200   250 # x$end (280) outside range

Therefore, join_by(within()) gives:

by &lt;- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, join_by(chromosome, within(x$start, x$end, y$start, y$end)))
# A tibble: 1 &#215; 7
  segment_id chromosome start.x end.x reference_id start.y end.y
       &lt;int&gt; &lt;chr&gt;        &lt;dbl&gt; &lt;dbl&gt;        &lt;int&gt;   &lt;dbl&gt; &lt;dbl&gt;
1          1 chr1           140   150            1     100   150

<hr>

For overlaps (my bold):

>For each range in ⁠[x_lower, x_upper]⁠, this finds everywhere that range overlaps ⁠[y_lower, y_upper]⁠ in any capacity. Equivalent to ⁠x_lower <= y_upper, x_upper >= y_lower⁠ by default.
>
>bounds can be one of "[]", "[)", "(]", or "()" to alter the inclusiveness of the lower and upper bounds. "[]" uses <= and >=, but the 3 other options use < and > and generate the exact same inequalities.

# A tibble: 8 &#215; 7
  segment_id chromosome start.x end.x reference_id start.y end.y
       &lt;int&gt; &lt;chr&gt;        &lt;dbl&gt; &lt;dbl&gt;        &lt;int&gt;   &lt;dbl&gt; &lt;dbl&gt;
1          1 chr1           140   150            1     100   150 # yes
2          1 chr1           140   150            2     200   250 # x$end (150) smaller than y$start (200)
3          2 chr2           210   240            3     300   399 # x$end (240) smaller than y$start (300)
4          2 chr2           210   240            4     415   450 # x$end (240) smaller than y$start (415)
5          3 chr2           380   415            3     300   399 # yes 
6          3 chr2           380   415            4     415   450 # yes
7          4 chr1           230   280            1     100   150 # x$start (230) &gt; y$end (150)
8          4 chr1           230   280            2     200   250 # yes

Therefore, join_by(overlaps()) gives:

# A tibble: 5 &#215; 7
  segment_id chromosome start.x end.x reference_id start.y end.y
       &lt;int&gt; &lt;chr&gt;        &lt;dbl&gt; &lt;dbl&gt;        &lt;int&gt;   &lt;dbl&gt; &lt;dbl&gt;
1          1 chr1           140   150            1     100   150
2          2 chr2           210   240           NA      NA    NA
3          3 chr2           380   415            3     300   399
4          3 chr2           380   415            4     415   450
5          4 chr1           230   280            2     200   250

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在 “join_by” 函数中关键词 ‘within’ 和 ‘overlaps’ 的使用

问题

答案1

答案2

在R中使用不同的文件名编写多个表格。

在R中创建日期的正态分布。

如何在R数据框中设置一致的小数分隔符？

如何在R中从两列创建一个数据框。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。