在R中通过匹配范围创建查找表

huangapple go评论84阅读模式
英文:

Lookup table in R by matching ranges

问题

以下是您的代码的翻译部分,不包括代码本身:

  1. 我有一个表格:
  2. CHR POS
  3. 10 4342
  4. 20 100
  5. 22 5422
  6. 我有另一个查找表:
  7. CHR start end Gene
  8. 10 4000 5999 ABC1
  9. 20 50 200 JHT
  10. 22 5000 6000 KLO
  11. 期望的输出:
  12. CHR POS
  13. 10 4342 ABC1
  14. 20 100 JHT
  15. 22 5422 KLO
  16. 实际上,表1中有约 700,000 条记录,表2中大约有 60,000 个基因。我需要按染色体匹配,然后使表1POS位于表2start-end之间,以添加一个新列,其中包含基因名称。
  17. 我尝试过:
  18. library(dplyr)
  19. # 创建示例数据
  20. df1 <- data.frame(chromosome = c("chr1", "chr1", "chr2", "chr3"), position = c(100, 200, 300, 400))
  21. df2 <- data.frame(chromosome = c("chr1", "chr2", "chr3"), start = c(50, 250, 350), end = c(150, 350, 450), gene = c("geneA", "geneB", "geneC"))
  22. # 执行左连接
  23. joined_df <- left_join(df1, df2, by = "chromosome")
  24. # 创建新列,指示每行是否位于基因内
  25. result_df <- joined_df %>%
  26. mutate(in_gene = if_else(position >= start & position <= end, gene, NA_character_))
  27. # 查看结果
  28. result_df
  29. 但矢量太大而无法存储。
英文:

I have a table:

  1. CHR POS
  2. 10 4342
  3. 20 100
  4. 22 5422

I have another lookup:

  1. CHR start end Gene
  2. 10 4000 5999 ABC1
  3. 20 50 200 JHT
  4. 22 5000 6000 KLO

Desired output:

  1. CHR POS
  2. 10 4342 ABC1
  3. 20 100 JHT
  4. 22 5422 KLO

In reality there are 700,000 entries in table 1 and roughly 60000 genes. I need to match on chromsome and then get the POS to be between start-end of table 2 to add a new column with the gene name.

I tried :

  1. library(dplyr)
  2. # create sample data
  3. df1 &lt;- data.frame(chromosome = c(&quot;chr1&quot;, &quot;chr1&quot;, &quot;chr2&quot;, &quot;chr3&quot;), position = c(100, 200, 300, 400))
  4. df2 &lt;- data.frame(chromosome = c(&quot;chr1&quot;, &quot;chr2&quot;, &quot;chr3&quot;), start = c(50, 250, 350), end = c(150, 350, 450), gene = c(&quot;geneA&quot;, &quot;geneB&quot;, &quot;geneC&quot;))
  5. # perform left join
  6. joined_df &lt;- left_join(df1, df2, by = &quot;chromosome&quot;)
  7. # create new column indicating if each row lies within a gene
  8. result_df &lt;- joined_df %&gt;%
  9. mutate(in_gene = if_else(position &gt;= start &amp; position &lt;= end, gene, NA_character_))
  10. # view result
  11. result_df

But the vector was too large to store.

答案1

得分: 4

你可以使用 GenomicRanges 来执行类似的操作。请参考下面的注释部分以获取安装代码。

GRanges 类是一个用于存储基因组位置和相关注释的容器。

makeGRangesFromDataFrame 函数将接受一个数据框作为输入,并自动查找描述基因组范围的列(默认为 startendstop)。

以下使用提供的额外示例数据。

  1. # 如果没有安装 "BiocManager",请取消注释以下代码以安装:
  2. # if (!require("BiocManager", quietly = TRUE))
  3. # install.packages("BiocManager")
  4. # 安装 "GenomicRanges":
  5. # BiocManager::install("GenomicRanges")
  6. # 加载 "GenomicRanges" 库:
  7. library(GenomicRanges)
  8. gr1 <- GRanges(seqnames = df1$chromosome,
  9. IRanges(start = df1$position, width = 1))
  10. gr2 <- makeGRangesFromDataFrame(df2, keep.extra.columns = TRUE)
  11. df1$gene <- NA
  12. ovlp <- findOverlaps(gr1, gr2)
  13. df1$gene[queryHits(ovlp)] <- gr2$gene[subjectHits(ovlp)]
  14. df1

输出

  1. chromosome position gene
  2. 1 chr1 100 geneA
  3. 2 chr1 200 <NA>
  4. 3 chr2 300 geneB
  5. 4 chr3 400 geneC
英文:

You can use GenomicRanges for something like this. See commented out code at the beginning below for installing.

The GRanges class is a container for genomic locations and associated annotations.

The function makeGRangesFromDataFrame will take a data.frame as input and automatically find the columns that describe genomic ranges (default is start and end or stop).

Below uses the additional sample data provided.

  1. # if (!require(&quot;BiocManager&quot;, quietly = TRUE))
  2. # install.packages(&quot;BiocManager&quot;)
  3. # BiocManager::install(&quot;GenomicRanges&quot;)
  4. library(GenomicRanges)
  5. gr1 &lt;- GRanges(seqnames = df1$chromosome,
  6. IRanges(start = df1$position, width = 1))
  7. gr2 &lt;- makeGRangesFromDataFrame(df2, keep.extra.columns = TRUE)
  8. df1$gene &lt;- NA
  9. ovlp &lt;- findOverlaps(gr1, gr2)
  10. df1$gene[queryHits(ovlp)] &lt;- gr2$gene[subjectHits(ovlp)]
  11. df1

Output

  1. chromosome position gene
  2. 1 chr1 100 geneA
  3. 2 chr1 200 &lt;NA&gt;
  4. 3 chr2 300 geneB
  5. 4 chr3 400 geneC

答案2

得分: 3

使用 dplyr 1.1.0,我们可以使用 join_by 进行非等值连接

  1. library(dplyr)
  2. left_join(df1, df2, by = join_by(CHR,
  3. closest(POS >= start), closest(POS <= end))) %>%
  4. select(-start, -end)

-output

  1. CHR POS Gene
  2. 1 10 4342 ABC1
  3. 2 20 100 JHT
  4. 3 22 5422 KLO

或者使用 data.table

  1. library(data.table)
  2. setDT(df1)[df2, Gene := i.Gene, on = .(CHR, POS >= start, POS <= end)]

-output

  1. > df1
  2. CHR POS Gene
  3. 1: 10 4342 ABC1
  4. 2: 20 100 JHT
  5. 3: 22 5422 KLO

数据

  1. df1 <- structure(list(CHR = c(10L, 20L, 22L), POS = c(4342L, 100L, 5422L)), class = "data.frame", row names = c(NA, -3L))
  2. df2 <- structure(list(CHR = c(10L, 20L, 22L), start = c(4000L, 50L, 5000L), end = c(5999L, 200L, 6000L), Gene = c("ABC1", "JHT", "KLO")), class = "data.frame", row names = c(NA, -3L))
英文:

With dplyr 1.1.0, we can use join_by for non-equi joins

  1. library(dplyr)
  2. left_join(df1, df2, by = join_by(CHR,
  3. closest(POS &gt;= start), closest(POS &lt;= end))) %&gt;%
  4. select(-start, -end)

-output

  1. CHR POS Gene
  2. 1 10 4342 ABC1
  3. 2 20 100 JHT
  4. 3 22 5422 KLO

Or with data.table

  1. library(data.table)
  2. setDT(df1)[df2, Gene := i.Gene, on = .(CHR, POS &gt;= start, POS &lt;= end)]

-output

  1. &gt; df1
  2. CHR POS Gene
  3. 1: 10 4342 ABC1
  4. 2: 20 100 JHT
  5. 3: 22 5422 KLO

data

  1. df1 &lt;- structure(list(CHR = c(10L, 20L, 22L), POS = c(4342L, 100L, 5422L
  2. )), class = &quot;data.frame&quot;, row.names = c(NA, -3L))
  3. df2 &lt;- structure(list(CHR = c(10L, 20L, 22L), start = c(4000L, 50L,
  4. 5000L), end = c(5999L, 200L, 6000L), Gene = c(&quot;ABC1&quot;, &quot;JHT&quot;,
  5. &quot;KLO&quot;)), class = &quot;data.frame&quot;, row.names = c(NA, -3L))

答案3

得分: 3

这是 left_join 的一种变体:

  1. library(dplyr)
  2. df1 %>%
  3. left_join(df2, by="CHR") %>%
  4. filter(between(POS, start, end)) %>%
  5. select(-c(start, end))
  6. CHR POS Gene
  7. 1 10 4342 ABC1
  8. 2 20 100 JHT
  9. 3 22 5422 KLO

这段代码执行 left_join 操作,并在之后进行了筛选和选择列的操作。

英文:

Here is a variation of left_join:

  1. library(dplyr)
  2. df1 %&gt;%
  3. left_join(df2, by=&quot;CHR&quot;) %&gt;%
  4. filter(between(POS, start, end)) %&gt;%
  5. select(-c(start, end))
  6. CHR POS Gene
  7. 1 10 4342 ABC1
  8. 2 20 100 JHT
  9. 3 22 5422 KLO

答案4

得分: 2

你可以执行以下操作:

  1. merge(m1, m2) |> {\(.) subset(., data.table::between(.$POS, .$start, .$end))}()
  2. # CHR POS start end Gene
  3. # 1 10 4342 4000 5999 ABC1
  4. # 2 20 100 50 200 JHT
  5. # 3 22 5422 5000 6000 KLO

数据:

  1. m1 <- structure(list(CHR = c(10L, 20L, 22L, 11L), POS = c(4342L, 100L,
  2. 5422L, 10L)), class = "data.frame", row names = c(NA, -4L))
  3. m2 <- structure(list(CHR = c(10L, 20L, 22L, 11L), start = c(4000L,
  4. 50L, 5000L, 5000L), end = c(5999L, 200L, 6000L, 6000L), Gene = c("ABC1",
  5. "JHT", "KLO", "KLO")), class = "data.frame", row names = c(NA,
  6. -4L))
英文:

Considering these tho guys,

  1. m1
  2. # CHR POS
  3. # 1 10 4342
  4. # 2 20 100
  5. # 3 22 5422
  6. # 4 11 10
  7. m2
  8. # CHR start end Gene
  9. # 1 10 4000 5999 ABC1
  10. # 2 20 50 200 JHT
  11. # 3 22 5000 6000 KLO
  12. # 4 11 5000 6000 KLO

you can do:

  1. merge(m1, m2) |&gt; {\(.) subset(., data.table::between(.$POS, .$start, .$end))}()
  2. # CHR POS start end Gene
  3. # 1 10 4342 4000 5999 ABC1
  4. # 2 20 100 50 200 JHT
  5. # 3 22 5422 5000 6000 KLO

Data:

  1. m1 &lt;- structure(list(CHR = c(10L, 20L, 22L, 11L), POS = c(4342L, 100L,
  2. 5422L, 10L)), class = &quot;data.frame&quot;, row.names = c(NA, -4L))
  3. m2 &lt;- structure(list(CHR = c(10L, 20L, 22L, 11L), start = c(4000L,
  4. 50L, 5000L, 5000L), end = c(5999L, 200L, 6000L, 6000L), Gene = c(&quot;ABC1&quot;,
  5. &quot;JHT&quot;, &quot;KLO&quot;, &quot;KLO&quot;)), class = &quot;data.frame&quot;, row.names = c(NA,
  6. -4L))

huangapple
  • 本文由 发表于 2023年3月4日 01:07:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/75629990.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定