在R中,如何根据区间匹配向data.table添加列?

huangapple go评论64阅读模式
英文:

How to add a column to data.table based on interval matching in R?

问题

Sure, here's the translation of the code portion:

我有两个数据表,A 和 B。数据表 A 有两列,"chrom""pos",而 B 表示从 BED 文件中读取的一系列区间。我想要在数据表 A 中添加一个名为 "select_status" 的新列。如果某行的 "pos" 位于 B 中的任何区间内,那么 "select_status" 中对应的值应设置为 TRUE;否则,应设置为 FALSE
以下是一个示例,用于说明数据结构:

```R
library(data.table)

A <- data.table(chrom = c("chr1", "chr2", "chr3", "chr3", "chr3"),
                pos = c(100, 200, 300, 391, 399))
B <- data.table(chrom = c("chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
                start = c(150, 180, 250, 280, 390, 600),
                end = c(200, 220, 300, 320, 393, 900))

# 我需要向 A 添加一个名为 select_status 的列,并且如果 pos 在 B 中,则设置为 TRUE
# 我希望得到类似下面的结果,但这是错误的

A[, select_status := any(pos >= B$start & pos <= B$end & chrom == B$chrom)]

或者

A[, select_status := sapply(.SD, function(x) any(x >= B$start & x <= B$end)), .SDcols = c("pos"), by = .(chrom)]

A[is.na(select_status), select_status := FALSE]

我的解决方案不起作用,因为它没有按行将 pos 与 B 中的区域进行比较,pos chr3 399 也会被设置为 TRUE

我知道可以使用 apply 逐行遍历 A,然后将遍历结果应用于 B 作为过滤器以实现类似的结果,但在数据行数较多的情况下,这种方法较慢。我想知道是否有另一种更简洁的方法。

我期望的结果如下:

A
   chrom pos select_status
1:  chr1 100         FALSE
2:  chr2 200          TRUE
3:  chr3 300          TRUE
4:  chr3 391          TRUE
5:  chr3 399         FALSE

<details>
<summary>英文:</summary>

I have two data.tables, A and B. Data.table A has two columns, &quot;chrom&quot; and &quot;pos&quot;, while B represents a series of intervals read from a BED file. I want to add a new column called &quot;select_status&quot; to data.table A. If a row&#39;s &quot;pos&quot; falls within any interval in B, the corresponding value in &quot;select_status&quot; should be set to TRUE; otherwise, it should be set to FALSE.

Here is an example to illustrate the data structures:

library(data.table)

A <- data.table(chrom = c("chr1", "chr2", "chr3", "chr3", "chr3"),
pos = c(100, 200, 300, 391, 399))
B <- data.table(chrom = c("chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
start = c(150, 180, 250, 280, 390, 600),
end = c(200, 220, 300, 320, 393, 900))

I need add a col select_status to A, and set it to Ture if pos in B

I want someting like this but this is wrong

A[, select_status := any(pos >= B$start & pos <= B$end & chrom == B$chrom)]


or

A[, select_status := sapply(.SD, function(x) any(x >= B$start & x <= B$end)), .SDcols = c("pos"), by = .(chrom)]

A[is.na(select_status), select_status := FALSE]


My solution is not work because Its not compare pos and region match by row in B, pos `chr3 399` will also be set to `TURE`

I know that I can use `apply` to walk through A line by line and then apply the result of the walk to B as a filter to achieve similar results, but this is slower in cases where the data has many rows, and I wonder if there is another, more concise method

I expected results

A
chrom pos select_status
1: chr1 100 FALSE
2: chr2 200 TRUE
3: chr3 300 TRUE
4: chr3 391 TRUE
5: chr3 399 FALSE


</details>


# 答案1
**得分**: 1

这是一种可考虑的方法:

```R
library(data.table)

A <- data.table(chrom = c("chr1", "chr2", "chr3", "chr3", "chr3"),
                pos = c(100, 200, 300, 391, 399))

B <- data.table(chrom = c("chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
                start = c(150, 180, 250, 280, 390, 600),
                end = c(200, 220, 300, 320, 393, 900))

X_Val <- eval(parse(text = paste0("c(",  paste0(paste0(B$start, ":", B$end), collapse = ","), ")")))
A[["select_status"]] <- ifelse(A$pos %in% X_Val, TRUE, FALSE)

A
   chrom pos select_status
1:  chr1 100         FALSE
2:  chr2 200          TRUE
3:  chr3 300          TRUE
4:  chr3 391          TRUE
5:  chr3 399         FALSE

希望这对你有所帮助。

英文:

Here is an approach that can be considered :

library(data.table)

A &lt;- data.table(chrom = c(&quot;chr1&quot;, &quot;chr2&quot;, &quot;chr3&quot;, &quot;chr3&quot;, &quot;chr3&quot;),
                pos = c(100, 200, 300, 391, 399))

B &lt;- data.table(chrom = c(&quot;chr1&quot;, &quot;chr2&quot;, &quot;chr2&quot;, &quot;chr3&quot;, &quot;chr3&quot;, &quot;chr3&quot;),
                start = c(150, 180, 250, 280, 390, 600),
                end = c(200, 220, 300, 320, 393, 900))

X_Val &lt;- eval(parse(text = paste0(&quot;c(&quot;,  paste0(paste0(B$start, &quot;:&quot;, B$end), collapse = &quot;,&quot;), &quot;)&quot;)))
A[[&quot;select_status&quot;]] &lt;- ifelse(A$pos %in% X_Val, TRUE, FALSE)

 A
   chrom pos select_status
1:  chr1 100         FALSE
2:  chr2 200          TRUE
3:  chr3 300          TRUE
4:  chr3 391          TRUE
5:  chr3 399         FALSE

huangapple
  • 本文由 发表于 2023年5月22日 11:19:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76302850.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定