2023年5月22日 11:19:43go评论97阅读模式

英文:

How to add a column to data.table based on interval matching in R?

问题

Sure, here's the translation of the code portion:

我有两个数据表，A 和 B。数据表 A 有两列，"chrom" 和 "pos"，而 B 表示从 BED 文件中读取的一系列区间。我想要在数据表 A 中添加一个名为 "select_status" 的新列。如果某行的 "pos" 位于 B 中的任何区间内，那么 "select_status" 中对应的值应设置为 TRUE；否则，应设置为 FALSE。
以下是一个示例，用于说明数据结构：
```R
library(data.table)
A <- data.table(chrom = c("chr1", "chr2", "chr3", "chr3", "chr3"),
                pos = c(100, 200, 300, 391, 399))
B <- data.table(chrom = c("chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
                start = c(150, 180, 250, 280, 390, 600),
                end = c(200, 220, 300, 320, 393, 900))
# 我需要向 A 添加一个名为 select_status 的列，并且如果 pos 在 B 中，则设置为 TRUE
# 我希望得到类似下面的结果，但这是错误的
A[, select_status := any(pos >= B$start & pos <= B$end & chrom == B$chrom)]

或者

A[, select_status := sapply(.SD, function(x) any(x >= B$start & x <= B$end)), .SDcols = c("pos"), by = .(chrom)]
A[is.na(select_status), select_status := FALSE]

我的解决方案不起作用，因为它没有按行将 pos 与 B 中的区域进行比较，pos chr3 399 也会被设置为 TRUE

我知道可以使用 apply 逐行遍历 A，然后将遍历结果应用于 B 作为过滤器以实现类似的结果，但在数据行数较多的情况下，这种方法较慢。我想知道是否有另一种更简洁的方法。

我期望的结果如下：

A
   chrom pos select_status
1:  chr1 100         FALSE
2:  chr2 200          TRUE
3:  chr3 300          TRUE
4:  chr3 391          TRUE
5:  chr3 399         FALSE


<details>
<summary>英文:</summary>
I have two data.tables, A and B. Data.table A has two columns, &quot;chrom&quot; and &quot;pos&quot;, while B represents a series of intervals read from a BED file. I want to add a new column called &quot;select_status&quot; to data.table A. If a row&#39;s &quot;pos&quot; falls within any interval in B, the corresponding value in &quot;select_status&quot; should be set to TRUE; otherwise, it should be set to FALSE.
Here is an example to illustrate the data structures:

library(data.table)

A <- data.table(chrom = c("chr1", "chr2", "chr3", "chr3", "chr3"),
pos = c(100, 200, 300, 391, 399))
B <- data.table(chrom = c("chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
start = c(150, 180, 250, 280, 390, 600),
end = c(200, 220, 300, 320, 393, 900))

I need add a col select_status to A, and set it to Ture if pos in B

I want someting like this but this is wrong

A[, select_status := any(pos >= B$start & pos <= B$end & chrom == B$chrom)]

or

A[, select_status := sapply(.SD, function(x) any(x >= B$start & x <= B$end)), .SDcols = c("pos"), by = .(chrom)]

A[is.na(select_status), select_status := FALSE]


My solution is not work because Its not compare pos and region match by row in B, pos `chr3 399` will also be set to `TURE`
I know that I can use `apply` to walk through A line by line and then apply the result of the walk to B as a filter to achieve similar results, but this is slower in cases where the data has many rows, and I wonder if there is another, more concise method
I expected results

A
chrom pos select_status
1: chr1 100 FALSE
2: chr2 200 TRUE
3: chr3 300 TRUE
4: chr3 391 TRUE
5: chr3 399 FALSE


</details>
# 答案1
**得分**: 1
这是一种可考虑的方法：
```R
library(data.table)
A <- data.table(chrom = c("chr1", "chr2", "chr3", "chr3", "chr3"),
                pos = c(100, 200, 300, 391, 399))
B <- data.table(chrom = c("chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
                start = c(150, 180, 250, 280, 390, 600),
                end = c(200, 220, 300, 320, 393, 900))
X_Val <- eval(parse(text = paste0("c(",  paste0(paste0(B$start, ":", B$end), collapse = ","), ")")))
A[["select_status"]] <- ifelse(A$pos %in% X_Val, TRUE, FALSE)
A
   chrom pos select_status
1:  chr1 100         FALSE
2:  chr2 200          TRUE
3:  chr3 300          TRUE
4:  chr3 391          TRUE
5:  chr3 399         FALSE

希望这对你有所帮助。

英文:

Here is an approach that can be considered :

library(data.table)
A &lt;- data.table(chrom = c(&quot;chr1&quot;, &quot;chr2&quot;, &quot;chr3&quot;, &quot;chr3&quot;, &quot;chr3&quot;),
                pos = c(100, 200, 300, 391, 399))
B &lt;- data.table(chrom = c(&quot;chr1&quot;, &quot;chr2&quot;, &quot;chr2&quot;, &quot;chr3&quot;, &quot;chr3&quot;, &quot;chr3&quot;),
                start = c(150, 180, 250, 280, 390, 600),
                end = c(200, 220, 300, 320, 393, 900))
X_Val &lt;- eval(parse(text = paste0(&quot;c(&quot;,  paste0(paste0(B$start, &quot;:&quot;, B$end), collapse = &quot;,&quot;), &quot;)&quot;)))
A[[&quot;select_status&quot;]] &lt;- ifelse(A$pos %in% X_Val, TRUE, FALSE)
 A
   chrom pos select_status
1:  chr1 100         FALSE
2:  chr2 200          TRUE
3:  chr3 300          TRUE
4:  chr3 391          TRUE
5:  chr3 399         FALSE

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中，如何根据区间匹配向data.table添加列？

问题

I need add a col select_status to A, and set it to Ture if pos in B

I want someting like this but this is wrong

在R中的for循环：model.frame.default()中的错误。

操作数据框并总结

Y轴标题中的换行

将数据框从字符转换为数值形式。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。