在R中通过匹配范围创建查找表

huangapple go评论55阅读模式
英文:

Lookup table in R by matching ranges

问题

以下是您的代码的翻译部分,不包括代码本身:

我有一个表格:

    CHR POS 
    10  4342 
    20  100
    22  5422

我有另一个查找表:

    CHR start end Gene
    10  4000  5999 ABC1
    20  50    200  JHT
    22  5000  6000 KLO

期望的输出:

    CHR POS 
    10  4342  ABC1
    20  100   JHT
    22  5422  KLO

实际上,表1中有约 700,000 条记录,表2中大约有 60,000 个基因。我需要按染色体匹配,然后使表1的POS位于表2的start-end之间,以添加一个新列,其中包含基因名称。

我尝试过:

    library(dplyr)
    
    # 创建示例数据
    df1 <- data.frame(chromosome = c("chr1", "chr1", "chr2", "chr3"), position = c(100, 200, 300, 400))
    df2 <- data.frame(chromosome = c("chr1", "chr2", "chr3"), start = c(50, 250, 350), end = c(150, 350, 450), gene = c("geneA", "geneB", "geneC"))
    
    # 执行左连接
    joined_df <- left_join(df1, df2, by = "chromosome")
    
    # 创建新列,指示每行是否位于基因内
    result_df <- joined_df %>%
                  mutate(in_gene = if_else(position >= start & position <= end, gene, NA_character_))
    
    # 查看结果
    result_df

但矢量太大而无法存储。
英文:

I have a table:

CHR POS 
10  4342 
20  100
22  5422

I have another lookup:

CHR start end Gene
10  4000  5999 ABC1
20  50    200  JHT
22  5000  6000 KLO

Desired output:

CHR POS 
10  4342  ABC1
20  100   JHT
22  5422  KLO

In reality there are 700,000 entries in table 1 and roughly 60000 genes. I need to match on chromsome and then get the POS to be between start-end of table 2 to add a new column with the gene name.

I tried :

library(dplyr)

# create sample data
df1 &lt;- data.frame(chromosome = c(&quot;chr1&quot;, &quot;chr1&quot;, &quot;chr2&quot;, &quot;chr3&quot;), position = c(100, 200, 300, 400))
df2 &lt;- data.frame(chromosome = c(&quot;chr1&quot;, &quot;chr2&quot;, &quot;chr3&quot;), start = c(50, 250, 350), end = c(150, 350, 450), gene = c(&quot;geneA&quot;, &quot;geneB&quot;, &quot;geneC&quot;))

# perform left join
joined_df &lt;- left_join(df1, df2, by = &quot;chromosome&quot;)

# create new column indicating if each row lies within a gene
result_df &lt;- joined_df %&gt;%
              mutate(in_gene = if_else(position &gt;= start &amp; position &lt;= end, gene, NA_character_))

# view result
result_df

But the vector was too large to store.

答案1

得分: 4

你可以使用 GenomicRanges 来执行类似的操作。请参考下面的注释部分以获取安装代码。

GRanges 类是一个用于存储基因组位置和相关注释的容器。

makeGRangesFromDataFrame 函数将接受一个数据框作为输入,并自动查找描述基因组范围的列(默认为 startendstop)。

以下使用提供的额外示例数据。

# 如果没有安装 "BiocManager",请取消注释以下代码以安装:
# if (!require("BiocManager", quietly = TRUE))
#   install.packages("BiocManager")

# 安装 "GenomicRanges":
# BiocManager::install("GenomicRanges")

# 加载 "GenomicRanges" 库:
library(GenomicRanges)

gr1 <- GRanges(seqnames = df1$chromosome,
               IRanges(start = df1$position, width = 1))
gr2 <- makeGRangesFromDataFrame(df2, keep.extra.columns = TRUE)

df1$gene <- NA
ovlp <- findOverlaps(gr1, gr2)
df1$gene[queryHits(ovlp)] <- gr2$gene[subjectHits(ovlp)]

df1

输出

      chromosome position  gene
    1       chr1      100 geneA
    2       chr1      200  <NA>
    3       chr2      300 geneB
    4       chr3      400 geneC
英文:

You can use GenomicRanges for something like this. See commented out code at the beginning below for installing.

The GRanges class is a container for genomic locations and associated annotations.

The function makeGRangesFromDataFrame will take a data.frame as input and automatically find the columns that describe genomic ranges (default is start and end or stop).

Below uses the additional sample data provided.

# if (!require(&quot;BiocManager&quot;, quietly = TRUE))
#   install.packages(&quot;BiocManager&quot;)

# BiocManager::install(&quot;GenomicRanges&quot;)

library(GenomicRanges)

gr1 &lt;- GRanges(seqnames = df1$chromosome,
               IRanges(start = df1$position, width = 1))
gr2 &lt;- makeGRangesFromDataFrame(df2, keep.extra.columns = TRUE)

df1$gene &lt;- NA
ovlp &lt;- findOverlaps(gr1, gr2)
df1$gene[queryHits(ovlp)] &lt;- gr2$gene[subjectHits(ovlp)]

df1

Output

  chromosome position  gene
1       chr1      100 geneA
2       chr1      200  &lt;NA&gt;
3       chr2      300 geneB
4       chr3      400 geneC

答案2

得分: 3

使用 dplyr 1.1.0,我们可以使用 join_by 进行非等值连接

library(dplyr)
left_join(df1, df2, by = join_by(CHR, 
    closest(POS >= start), closest(POS <= end))) %>%
   select(-start, -end)

-output

 CHR  POS Gene
1  10 4342 ABC1
2  20  100  JHT
3  22 5422  KLO

或者使用 data.table

library(data.table)
setDT(df1)[df2, Gene := i.Gene, on = .(CHR, POS >= start, POS <= end)]

-output

> df1
   CHR  POS Gene
1:  10 4342 ABC1
2:  20  100  JHT
3:  22 5422  KLO

数据

df1 <- structure(list(CHR = c(10L, 20L, 22L), POS = c(4342L, 100L, 5422L)), class = "data.frame", row names = c(NA, -3L))

df2 <- structure(list(CHR = c(10L, 20L, 22L), start = c(4000L, 50L, 5000L), end = c(5999L, 200L, 6000L), Gene = c("ABC1", "JHT", "KLO")), class = "data.frame", row names = c(NA, -3L))
英文:

With dplyr 1.1.0, we can use join_by for non-equi joins

library(dplyr)
left_join(df1, df2, by = join_by(CHR, 
    closest(POS &gt;= start), closest(POS &lt;= end))) %&gt;%
   select(-start, -end)

-output

 CHR  POS Gene
1  10 4342 ABC1
2  20  100  JHT
3  22 5422  KLO

Or with data.table

library(data.table)
 setDT(df1)[df2, Gene := i.Gene, on = .(CHR, POS &gt;= start, POS &lt;= end)]

-output

&gt; df1
   CHR  POS Gene
1:  10 4342 ABC1
2:  20  100  JHT
3:  22 5422  KLO

data

df1 &lt;- structure(list(CHR = c(10L, 20L, 22L), POS = c(4342L, 100L, 5422L
)), class = &quot;data.frame&quot;, row.names = c(NA, -3L))

df2 &lt;- structure(list(CHR = c(10L, 20L, 22L), start = c(4000L, 50L, 
5000L), end = c(5999L, 200L, 6000L), Gene = c(&quot;ABC1&quot;, &quot;JHT&quot;, 
&quot;KLO&quot;)), class = &quot;data.frame&quot;, row.names = c(NA, -3L))

答案3

得分: 3

这是 left_join 的一种变体:

library(dplyr)

df1 %>%
  left_join(df2, by="CHR") %>%
  filter(between(POS, start, end)) %>%
  select(-c(start, end))

  CHR  POS Gene
1  10 4342 ABC1
2  20  100  JHT
3  22 5422  KLO

这段代码执行 left_join 操作,并在之后进行了筛选和选择列的操作。

英文:

Here is a variation of left_join:

library(dplyr)

df1 %&gt;% 
  left_join(df2, by=&quot;CHR&quot;) %&gt;% 
  filter(between(POS, start, end)) %&gt;% 
  select(-c(start, end))

  CHR  POS Gene
1  10 4342 ABC1
2  20  100  JHT
3  22 5422  KLO

答案4

得分: 2

你可以执行以下操作:

merge(m1, m2) |> {\(.) subset(., data.table::between(.$POS, .$start, .$end))}()
#   CHR  POS start  end Gene
# 1  10 4342  4000 5999 ABC1
# 2  20  100    50  200  JHT
# 3  22 5422  5000 6000  KLO

数据:

m1 <- structure(list(CHR = c(10L, 20L, 22L, 11L), POS = c(4342L, 100L, 
5422L, 10L)), class = "data.frame", row names = c(NA, -4L))

m2 <- structure(list(CHR = c(10L, 20L, 22L, 11L), start = c(4000L, 
50L, 5000L, 5000L), end = c(5999L, 200L, 6000L, 6000L), Gene = c("ABC1", 
"JHT", "KLO", "KLO")), class = "data.frame", row names = c(NA, 
-4L))
英文:

Considering these tho guys,

m1
#   CHR  POS
# 1  10 4342
# 2  20  100
# 3  22 5422
# 4  11   10

m2
#   CHR start  end Gene
# 1  10  4000 5999 ABC1
# 2  20    50  200  JHT
# 3  22  5000 6000  KLO
# 4  11  5000 6000  KLO

you can do:

merge(m1, m2) |&gt; {\(.) subset(., data.table::between(.$POS, .$start, .$end))}()
#   CHR  POS start  end Gene
# 1  10 4342  4000 5999 ABC1
# 2  20  100    50  200  JHT
# 3  22 5422  5000 6000  KLO

Data:

m1 &lt;- structure(list(CHR = c(10L, 20L, 22L, 11L), POS = c(4342L, 100L, 
5422L, 10L)), class = &quot;data.frame&quot;, row.names = c(NA, -4L))


m2 &lt;- structure(list(CHR = c(10L, 20L, 22L, 11L), start = c(4000L, 
50L, 5000L, 5000L), end = c(5999L, 200L, 6000L, 6000L), Gene = c(&quot;ABC1&quot;, 
&quot;JHT&quot;, &quot;KLO&quot;, &quot;KLO&quot;)), class = &quot;data.frame&quot;, row.names = c(NA, 
-4L))

huangapple
  • 本文由 发表于 2023年3月4日 01:07:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/75629990.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定