英文:
The usage of the key word 'within' and 'overlaps' in join_by
问题
样本1:within
by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, by)
样本2:overlaps
by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end))
full_join(segments, reference, by)
英文:
In R/dplyr help file , there are have below code as attached which have within
and overlaps
, how to understand this two key words ? Thanks!
library(dplyr)
segments <- tibble(
segment_id = 1:4,
chromosome = c("chr1", "chr2", "chr2", "chr1"),
start = c(140, 210, 380, 230),
end = c(150, 240, 415, 280)
)
reference <- tibble(
reference_id = 1:4,
chromosome = c("chr1", "chr1", "chr2", "chr2"),
start = c(100, 200, 300, 415),
end = c(150, 250, 399, 450)
)
sample 1: within
by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, by)
sample 2: overlaps
by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end))
full_join(segments, reference, by)
答案1
得分: 3
within
只会捕捉在 x 范围完全包含在 y 范围内的行。
overlaps
会捕捉 x 范围与 y 范围存在任何类型的重叠的行。但是,它不会捕捉完全包含在内的行,即如果 x_lower >= y_lower
且 x_upper <= y_upper
。
可能更容易理解的方式如下(请注意这使用了 overlap
的默认边界:"[]"
)
示例:
x_lower = c(1, 10, 5, 10)
x_upper = c(4, 25, 6, 15)
y_lower = c(0, 15, 10, 3)
y_upper = c(10, 16, 20, 30)
df <- data.frame(x_lower, x_upper, y_lower, y_upper)
transform(df,
is_within = x_lower >= y_lower & x_upper <= y_upper,
is_overlap = x_lower <= y_lower & x_upper >= y_upper)
# x_lower x_upper y_lower y_upper is_within is_overlap
# 1 1 4 0 10 TRUE FALSE
# 2 10 25 15 16 FALSE TRUE
# 3 5 6 10 20 FALSE FALSE
# 4 10 15 3 30 TRUE FALSE
从文档中可以看到:
within(x_lower, x_upper, y_lower, y_upper)
对于 [x_lower, x_upper] 中的每个范围,这会找到该范围完全位于 [y_lower, y_upper] 中的所有位置。等同于 x_lower >= y_lower, x_upper <= y_upper。
以及
overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = "[]")
对于 [x_lower, x_upper] 中的每个范围,这会找到该范围在任何容量上与 [y_lower, y_upper] 重叠的所有位置。默认情况下等同于 x_lower <= y_upper, x_upper >= y_lower。
英文:
within
only captures rows if the range in x is entirely in the range of y.
overlaps
capture rows if there is any type of overlap between the range of x and y. BUT it does not capture rows that are entirely within, i.e. if x_lower > y_lower & x_upper < y_upper
.
It might be easier to understand like this (note this uses overlap
's default bound: "[]"
)
Example:
x_lower = c(1, 10, 5, 10)
x_upper = c(4, 25, 6, 15)
y_lower = c(0, 15, 10, 3)
y_upper = c(10, 16, 20, 30)
df <- data.frame(x_lower, x_upper, y_lower, y_upper)
transform(df,
is_within = x_lower >= y_lower & x_upper <= y_upper,
is_overlap = x_lower <= y_lower & x_upper >= y_upper)
# x_lower x_upper y_lower y_upper is_within is_overlap
# 1 1 4 0 10 TRUE FALSE
# 2 10 25 15 16 FALSE TRUE
# 3 5 6 10 20 FALSE FALSE
# 4 10 15 3 30 TRUE FALSE
From the documentation:
> within(x_lower, x_upper, y_lower, y_upper)
>
> For each range in [x_lower, x_upper], this finds everywhere that
> range falls completely within [y_lower, y_upper]. Equivalent to
> x_lower >= y_lower, x_upper <= y_upper.
And
> overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = "[]")
>
> For each range in [x_lower, x_upper], this finds everywhere that
> range overlaps [y_lower, y_upper] in any capacity. Equivalent to
> x_lower <= y_upper, x_upper >= y_lower by default.
答案2
得分: 1
以下是翻译好的内容:
对于 within
(我加粗的部分):
> 对于[x_lower, x_upper]中的每个范围,它找到了范围完全位于[y_lower, y_upper]内的所有位置。等价于x_lower >= y_lower, x_upper <= y_upper。
>
> 用于构建 within() 的不等式与提供的范围的包含性无关。
因此,join_by(within())
会得到:
by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, join_by(chromosome, within(x$start, x$end, y$start, y$end)))
# A tibble: 1 × 7
segment_id chromosome start.x end.x reference_id start.y end.y
1 1 chr1 140 150 1 100 150
对于 overlaps
(我加粗的部分):
> 对于[x_lower, x_upper]中的每个范围,它找到了范围在任何情况下与[y_lower, y_upper]重叠的所有位置。默认情况下等价于x_lower <= y_upper, x_upper >= y_lower。
>
> bounds 可以是 "[]"、"[)"、"(]" 或 "()" 中的一个,以改变下限和上限的包含性。 "[]" 使用 <= 和 >=,但其他 3 个选项使用 < 和 > 并生成完全相同的不等式。
因此,join_by(overlaps())
会得到:
# A tibble: 5 × 7
segment_id chromosome start.x end.x reference_id start.y end.y
1 1 chr1 140 150 1 100 150
2 2 chr2 210 240 NA NA NA
3 3 chr2 380 415 3 300 399
4 3 chr2 380 415 4 415 450
5 4 chr1 230 280 2 200 250
英文:
The documentation of join_by
actually covers these two helper functions.
For within
(my bold):
>For each range in [x_lower, x_upper], this finds everywhere that range falls completely within [y_lower, y_upper]. Equivalent to x_lower >= y_lower, x_upper <= y_upper.
>
>The inequalities used to build within() are the same regardless of the inclusiveness of the supplied ranges.
library(dplyr)
full_join(segments, reference, by = "chromosome")
# A tibble: 8 × 7
segment_id chromosome start.x end.x reference_id start.y end.y
<int> <chr> <dbl> <dbl> <int> <dbl> <dbl>
1 1 chr1 140 150 1 100 150 # yes
2 1 chr1 140 150 2 200 250 # both x smaller than y
3 2 chr2 210 240 3 300 399 # both x smaller than y
4 2 chr2 210 240 4 415 450 # both x smaller than y
5 3 chr2 380 415 3 300 399 # x$end (415) outside range
6 3 chr2 380 415 4 415 450 # x$start (380) outside range
7 4 chr1 230 280 1 100 150 # both x greater than y
8 4 chr1 230 280 2 200 250 # x$end (280) outside range
Therefore, join_by(within())
gives:
by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, join_by(chromosome, within(x$start, x$end, y$start, y$end)))
# A tibble: 1 × 7
segment_id chromosome start.x end.x reference_id start.y end.y
<int> <chr> <dbl> <dbl> <int> <dbl> <dbl>
1 1 chr1 140 150 1 100 150
<hr>
For overlaps
(my bold):
>For each range in [x_lower, x_upper], this finds everywhere that range overlaps [y_lower, y_upper] in any capacity. Equivalent to x_lower <= y_upper, x_upper >= y_lower by default.
>
>bounds can be one of "[]", "[)", "(]", or "()" to alter the inclusiveness of the lower and upper bounds. "[]" uses <= and >=, but the 3 other options use < and > and generate the exact same inequalities.
# A tibble: 8 × 7
segment_id chromosome start.x end.x reference_id start.y end.y
<int> <chr> <dbl> <dbl> <int> <dbl> <dbl>
1 1 chr1 140 150 1 100 150 # yes
2 1 chr1 140 150 2 200 250 # x$end (150) smaller than y$start (200)
3 2 chr2 210 240 3 300 399 # x$end (240) smaller than y$start (300)
4 2 chr2 210 240 4 415 450 # x$end (240) smaller than y$start (415)
5 3 chr2 380 415 3 300 399 # yes
6 3 chr2 380 415 4 415 450 # yes
7 4 chr1 230 280 1 100 150 # x$start (230) > y$end (150)
8 4 chr1 230 280 2 200 250 # yes
Therefore, join_by(overlaps())
gives:
# A tibble: 5 × 7
segment_id chromosome start.x end.x reference_id start.y end.y
<int> <chr> <dbl> <dbl> <int> <dbl> <dbl>
1 1 chr1 140 150 1 100 150
2 2 chr2 210 240 NA NA NA
3 3 chr2 380 415 3 300 399
4 3 chr2 380 415 4 415 450
5 4 chr1 230 280 2 200 250
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论