英文:
How to add a column to data.table based on interval matching in R?
问题
Sure, here's the translation of the code portion:
我有两个数据表,A 和 B。数据表 A 有两列,"chrom" 和 "pos",而 B 表示从 BED 文件中读取的一系列区间。我想要在数据表 A 中添加一个名为 "select_status" 的新列。如果某行的 "pos" 位于 B 中的任何区间内,那么 "select_status" 中对应的值应设置为 TRUE;否则,应设置为 FALSE。
以下是一个示例,用于说明数据结构:
```R
library(data.table)
A <- data.table(chrom = c("chr1", "chr2", "chr3", "chr3", "chr3"),
pos = c(100, 200, 300, 391, 399))
B <- data.table(chrom = c("chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
start = c(150, 180, 250, 280, 390, 600),
end = c(200, 220, 300, 320, 393, 900))
# 我需要向 A 添加一个名为 select_status 的列,并且如果 pos 在 B 中,则设置为 TRUE
# 我希望得到类似下面的结果,但这是错误的
A[, select_status := any(pos >= B$start & pos <= B$end & chrom == B$chrom)]
或者
A[, select_status := sapply(.SD, function(x) any(x >= B$start & x <= B$end)), .SDcols = c("pos"), by = .(chrom)]
A[is.na(select_status), select_status := FALSE]
我的解决方案不起作用,因为它没有按行将 pos 与 B 中的区域进行比较,pos chr3 399
也会被设置为 TRUE
我知道可以使用 apply
逐行遍历 A,然后将遍历结果应用于 B 作为过滤器以实现类似的结果,但在数据行数较多的情况下,这种方法较慢。我想知道是否有另一种更简洁的方法。
我期望的结果如下:
A
chrom pos select_status
1: chr1 100 FALSE
2: chr2 200 TRUE
3: chr3 300 TRUE
4: chr3 391 TRUE
5: chr3 399 FALSE
<details>
<summary>英文:</summary>
I have two data.tables, A and B. Data.table A has two columns, "chrom" and "pos", while B represents a series of intervals read from a BED file. I want to add a new column called "select_status" to data.table A. If a row's "pos" falls within any interval in B, the corresponding value in "select_status" should be set to TRUE; otherwise, it should be set to FALSE.
Here is an example to illustrate the data structures:
library(data.table)
A <- data.table(chrom = c("chr1", "chr2", "chr3", "chr3", "chr3"),
pos = c(100, 200, 300, 391, 399))
B <- data.table(chrom = c("chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
start = c(150, 180, 250, 280, 390, 600),
end = c(200, 220, 300, 320, 393, 900))
I need add a col select_status to A, and set it to Ture if pos in B
I want someting like this but this is wrong
A[, select_status := any(pos >= B$start & pos <= B$end & chrom == B$chrom)]
or
A[, select_status := sapply(.SD, function(x) any(x >= B$start & x <= B$end)), .SDcols = c("pos"), by = .(chrom)]
A[is.na(select_status), select_status := FALSE]
My solution is not work because Its not compare pos and region match by row in B, pos `chr3 399` will also be set to `TURE`
I know that I can use `apply` to walk through A line by line and then apply the result of the walk to B as a filter to achieve similar results, but this is slower in cases where the data has many rows, and I wonder if there is another, more concise method
I expected results
A
chrom pos select_status
1: chr1 100 FALSE
2: chr2 200 TRUE
3: chr3 300 TRUE
4: chr3 391 TRUE
5: chr3 399 FALSE
</details>
# 答案1
**得分**: 1
这是一种可考虑的方法:
```R
library(data.table)
A <- data.table(chrom = c("chr1", "chr2", "chr3", "chr3", "chr3"),
pos = c(100, 200, 300, 391, 399))
B <- data.table(chrom = c("chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
start = c(150, 180, 250, 280, 390, 600),
end = c(200, 220, 300, 320, 393, 900))
X_Val <- eval(parse(text = paste0("c(", paste0(paste0(B$start, ":", B$end), collapse = ","), ")")))
A[["select_status"]] <- ifelse(A$pos %in% X_Val, TRUE, FALSE)
A
chrom pos select_status
1: chr1 100 FALSE
2: chr2 200 TRUE
3: chr3 300 TRUE
4: chr3 391 TRUE
5: chr3 399 FALSE
希望这对你有所帮助。
英文:
Here is an approach that can be considered :
library(data.table)
A <- data.table(chrom = c("chr1", "chr2", "chr3", "chr3", "chr3"),
pos = c(100, 200, 300, 391, 399))
B <- data.table(chrom = c("chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
start = c(150, 180, 250, 280, 390, 600),
end = c(200, 220, 300, 320, 393, 900))
X_Val <- eval(parse(text = paste0("c(", paste0(paste0(B$start, ":", B$end), collapse = ","), ")")))
A[["select_status"]] <- ifelse(A$pos %in% X_Val, TRUE, FALSE)
A
chrom pos select_status
1: chr1 100 FALSE
2: chr2 200 TRUE
3: chr3 300 TRUE
4: chr3 391 TRUE
5: chr3 399 FALSE
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论