在R中,如何根据区间匹配向data.table添加列?

huangapple go评论97阅读模式
英文:

How to add a column to data.table based on interval matching in R?

问题

Sure, here's the translation of the code portion:

  1. 我有两个数据表,A B。数据表 A 有两列,"chrom" "pos",而 B 表示从 BED 文件中读取的一系列区间。我想要在数据表 A 中添加一个名为 "select_status" 的新列。如果某行的 "pos" 位于 B 中的任何区间内,那么 "select_status" 中对应的值应设置为 TRUE;否则,应设置为 FALSE
  2. 以下是一个示例,用于说明数据结构:
  3. ```R
  4. library(data.table)
  5. A <- data.table(chrom = c("chr1", "chr2", "chr3", "chr3", "chr3"),
  6. pos = c(100, 200, 300, 391, 399))
  7. B <- data.table(chrom = c("chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
  8. start = c(150, 180, 250, 280, 390, 600),
  9. end = c(200, 220, 300, 320, 393, 900))
  10. # 我需要向 A 添加一个名为 select_status 的列,并且如果 pos 在 B 中,则设置为 TRUE
  11. # 我希望得到类似下面的结果,但这是错误的
  12. A[, select_status := any(pos >= B$start & pos <= B$end & chrom == B$chrom)]

或者

  1. A[, select_status := sapply(.SD, function(x) any(x >= B$start & x <= B$end)), .SDcols = c("pos"), by = .(chrom)]
  2. A[is.na(select_status), select_status := FALSE]

我的解决方案不起作用,因为它没有按行将 pos 与 B 中的区域进行比较,pos chr3 399 也会被设置为 TRUE

我知道可以使用 apply 逐行遍历 A,然后将遍历结果应用于 B 作为过滤器以实现类似的结果,但在数据行数较多的情况下,这种方法较慢。我想知道是否有另一种更简洁的方法。

我期望的结果如下:

  1. A
  2. chrom pos select_status
  3. 1: chr1 100 FALSE
  4. 2: chr2 200 TRUE
  5. 3: chr3 300 TRUE
  6. 4: chr3 391 TRUE
  7. 5: chr3 399 FALSE
  1. <details>
  2. <summary>英文:</summary>
  3. I have two data.tables, A and B. Data.table A has two columns, &quot;chrom&quot; and &quot;pos&quot;, while B represents a series of intervals read from a BED file. I want to add a new column called &quot;select_status&quot; to data.table A. If a row&#39;s &quot;pos&quot; falls within any interval in B, the corresponding value in &quot;select_status&quot; should be set to TRUE; otherwise, it should be set to FALSE.
  4. Here is an example to illustrate the data structures:

library(data.table)

A <- data.table(chrom = c("chr1", "chr2", "chr3", "chr3", "chr3"),
pos = c(100, 200, 300, 391, 399))
B <- data.table(chrom = c("chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
start = c(150, 180, 250, 280, 390, 600),
end = c(200, 220, 300, 320, 393, 900))

I need add a col select_status to A, and set it to Ture if pos in B

I want someting like this but this is wrong

A[, select_status := any(pos >= B$start & pos <= B$end & chrom == B$chrom)]

  1. or

A[, select_status := sapply(.SD, function(x) any(x >= B$start & x <= B$end)), .SDcols = c("pos"), by = .(chrom)]

A[is.na(select_status), select_status := FALSE]

  1. My solution is not work because Its not compare pos and region match by row in B, pos `chr3 399` will also be set to `TURE`
  2. I know that I can use `apply` to walk through A line by line and then apply the result of the walk to B as a filter to achieve similar results, but this is slower in cases where the data has many rows, and I wonder if there is another, more concise method
  3. I expected results

A
chrom pos select_status
1: chr1 100 FALSE
2: chr2 200 TRUE
3: chr3 300 TRUE
4: chr3 391 TRUE
5: chr3 399 FALSE

  1. </details>
  2. # 答案1
  3. **得分**: 1
  4. 这是一种可考虑的方法:
  5. ```R
  6. library(data.table)
  7. A <- data.table(chrom = c("chr1", "chr2", "chr3", "chr3", "chr3"),
  8. pos = c(100, 200, 300, 391, 399))
  9. B <- data.table(chrom = c("chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
  10. start = c(150, 180, 250, 280, 390, 600),
  11. end = c(200, 220, 300, 320, 393, 900))
  12. X_Val <- eval(parse(text = paste0("c(", paste0(paste0(B$start, ":", B$end), collapse = ","), ")")))
  13. A[["select_status"]] <- ifelse(A$pos %in% X_Val, TRUE, FALSE)
  14. A
  15. chrom pos select_status
  16. 1: chr1 100 FALSE
  17. 2: chr2 200 TRUE
  18. 3: chr3 300 TRUE
  19. 4: chr3 391 TRUE
  20. 5: chr3 399 FALSE

希望这对你有所帮助。

英文:

Here is an approach that can be considered :

  1. library(data.table)
  2. A &lt;- data.table(chrom = c(&quot;chr1&quot;, &quot;chr2&quot;, &quot;chr3&quot;, &quot;chr3&quot;, &quot;chr3&quot;),
  3. pos = c(100, 200, 300, 391, 399))
  4. B &lt;- data.table(chrom = c(&quot;chr1&quot;, &quot;chr2&quot;, &quot;chr2&quot;, &quot;chr3&quot;, &quot;chr3&quot;, &quot;chr3&quot;),
  5. start = c(150, 180, 250, 280, 390, 600),
  6. end = c(200, 220, 300, 320, 393, 900))
  7. X_Val &lt;- eval(parse(text = paste0(&quot;c(&quot;, paste0(paste0(B$start, &quot;:&quot;, B$end), collapse = &quot;,&quot;), &quot;)&quot;)))
  8. A[[&quot;select_status&quot;]] &lt;- ifelse(A$pos %in% X_Val, TRUE, FALSE)
  9. A
  10. chrom pos select_status
  11. 1: chr1 100 FALSE
  12. 2: chr2 200 TRUE
  13. 3: chr3 300 TRUE
  14. 4: chr3 391 TRUE
  15. 5: chr3 399 FALSE

huangapple
  • 本文由 发表于 2023年5月22日 11:19:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76302850.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定