2023年3月4日 01:07:36go评论84阅读模式

英文:

Lookup table in R by matching ranges

问题

以下是您的代码的翻译部分，不包括代码本身：

我有一个表格：
    CHR POS 
    10  4342 
    20  100
    22  5422
我有另一个查找表：
    CHR start end Gene
    10  4000  5999 ABC1
    20  50    200  JHT
    22  5000  6000 KLO
期望的输出：
    CHR POS 
    10  4342  ABC1
    20  100   JHT
    22  5422  KLO
实际上，表1中有约 700,000 条记录，表2中大约有 60,000 个基因。我需要按染色体匹配，然后使表1的POS位于表2的start-end之间，以添加一个新列，其中包含基因名称。
我尝试过：
    library(dplyr)
    
    # 创建示例数据
    df1 <- data.frame(chromosome = c("chr1", "chr1", "chr2", "chr3"), position = c(100, 200, 300, 400))
    df2 <- data.frame(chromosome = c("chr1", "chr2", "chr3"), start = c(50, 250, 350), end = c(150, 350, 450), gene = c("geneA", "geneB", "geneC"))
    
    # 执行左连接
    joined_df <- left_join(df1, df2, by = "chromosome")
    
    # 创建新列，指示每行是否位于基因内
    result_df <- joined_df %>%
                  mutate(in_gene = if_else(position >= start & position <= end, gene, NA_character_))
    
    # 查看结果
    result_df
但矢量太大而无法存储。

英文:

I have a table:

I have another lookup:

CHR start end Gene
10  4000  5999 ABC1
20  50    200  JHT
22  5000  6000 KLO

Desired output:

CHR POS 
10  4342  ABC1
20  100   JHT
22  5422  KLO

In reality there are 700,000 entries in table 1 and roughly 60000 genes. I need to match on chromsome and then get the POS to be between start-end of table 2 to add a new column with the gene name.

I tried :

library(dplyr)
# create sample data
df1 &lt;- data.frame(chromosome = c(&quot;chr1&quot;, &quot;chr1&quot;, &quot;chr2&quot;, &quot;chr3&quot;), position = c(100, 200, 300, 400))
df2 &lt;- data.frame(chromosome = c(&quot;chr1&quot;, &quot;chr2&quot;, &quot;chr3&quot;), start = c(50, 250, 350), end = c(150, 350, 450), gene = c(&quot;geneA&quot;, &quot;geneB&quot;, &quot;geneC&quot;))
# perform left join
joined_df &lt;- left_join(df1, df2, by = &quot;chromosome&quot;)
# create new column indicating if each row lies within a gene
result_df &lt;- joined_df %&gt;%
              mutate(in_gene = if_else(position &gt;= start &amp; position &lt;= end, gene, NA_character_))
# view result
result_df

But the vector was too large to store.

答案1

得分: 4

你可以使用 GenomicRanges 来执行类似的操作。请参考下面的注释部分以获取安装代码。

GRanges 类是一个用于存储基因组位置和相关注释的容器。

makeGRangesFromDataFrame 函数将接受一个数据框作为输入，并自动查找描述基因组范围的列（默认为 start 和 end 或 stop）。

以下使用提供的额外示例数据。

# 如果没有安装 "BiocManager"，请取消注释以下代码以安装：
# if (!require("BiocManager", quietly = TRUE))
#   install.packages("BiocManager")
# 安装 "GenomicRanges"：
# BiocManager::install("GenomicRanges")
# 加载 "GenomicRanges" 库：
library(GenomicRanges)
gr1 <- GRanges(seqnames = df1$chromosome,
               IRanges(start = df1$position, width = 1))
gr2 <- makeGRangesFromDataFrame(df2, keep.extra.columns = TRUE)
df1$gene <- NA
ovlp <- findOverlaps(gr1, gr2)
df1$gene[queryHits(ovlp)] <- gr2$gene[subjectHits(ovlp)]
df1

输出

      chromosome position  gene
    1       chr1      100 geneA
    2       chr1      200  <NA>
    3       chr2      300 geneB
    4       chr3      400 geneC

英文:

You can use GenomicRanges for something like this. See commented out code at the beginning below for installing.

The GRanges class is a container for genomic locations and associated annotations.

The function makeGRangesFromDataFrame will take a data.frame as input and automatically find the columns that describe genomic ranges (default is start and end or stop).

Below uses the additional sample data provided.

# if (!require(&quot;BiocManager&quot;, quietly = TRUE))
#   install.packages(&quot;BiocManager&quot;)
# BiocManager::install(&quot;GenomicRanges&quot;)
library(GenomicRanges)
gr1 &lt;- GRanges(seqnames = df1$chromosome,
               IRanges(start = df1$position, width = 1))
gr2 &lt;- makeGRangesFromDataFrame(df2, keep.extra.columns = TRUE)
df1$gene &lt;- NA
ovlp &lt;- findOverlaps(gr1, gr2)
df1$gene[queryHits(ovlp)] &lt;- gr2$gene[subjectHits(ovlp)]
df1

Output

  chromosome position  gene
1       chr1      100 geneA
2       chr1      200  &lt;NA&gt;
3       chr2      300 geneB
4       chr3      400 geneC

答案2

得分: 3

使用 dplyr 1.1.0，我们可以使用 join_by 进行非等值连接

library(dplyr)
left_join(df1, df2, by = join_by(CHR, 
    closest(POS >= start), closest(POS <= end))) %>%
   select(-start, -end)

-output

 CHR  POS Gene
1  10 4342 ABC1
2  20  100  JHT
3  22 5422  KLO

或者使用 data.table

library(data.table)
setDT(df1)[df2, Gene := i.Gene, on = .(CHR, POS >= start, POS <= end)]

-output

> df1
   CHR  POS Gene
1:  10 4342 ABC1
2:  20  100  JHT
3:  22 5422  KLO

数据

df1 <- structure(list(CHR = c(10L, 20L, 22L), POS = c(4342L, 100L, 5422L)), class = "data.frame", row names = c(NA, -3L))
df2 <- structure(list(CHR = c(10L, 20L, 22L), start = c(4000L, 50L, 5000L), end = c(5999L, 200L, 6000L), Gene = c("ABC1", "JHT", "KLO")), class = "data.frame", row names = c(NA, -3L))

英文:

With dplyr 1.1.0, we can use join_by for non-equi joins

library(dplyr)
left_join(df1, df2, by = join_by(CHR, 
    closest(POS &gt;= start), closest(POS &lt;= end))) %&gt;%
   select(-start, -end)

-output

 CHR  POS Gene
1  10 4342 ABC1
2  20  100  JHT
3  22 5422  KLO

Or with data.table

library(data.table)
 setDT(df1)[df2, Gene := i.Gene, on = .(CHR, POS &gt;= start, POS &lt;= end)]

-output

&gt; df1
   CHR  POS Gene
1:  10 4342 ABC1
2:  20  100  JHT
3:  22 5422  KLO

data

df1 &lt;- structure(list(CHR = c(10L, 20L, 22L), POS = c(4342L, 100L, 5422L
)), class = &quot;data.frame&quot;, row.names = c(NA, -3L))
df2 &lt;- structure(list(CHR = c(10L, 20L, 22L), start = c(4000L, 50L, 
5000L), end = c(5999L, 200L, 6000L), Gene = c(&quot;ABC1&quot;, &quot;JHT&quot;, 
&quot;KLO&quot;)), class = &quot;data.frame&quot;, row.names = c(NA, -3L))

答案3

得分: 3

这是 left_join 的一种变体：

library(dplyr)
df1 %>%
  left_join(df2, by="CHR") %>%
  filter(between(POS, start, end)) %>%
  select(-c(start, end))
  CHR  POS Gene
1  10 4342 ABC1
2  20  100  JHT
3  22 5422  KLO

这段代码执行 left_join 操作，并在之后进行了筛选和选择列的操作。

英文:

Here is a variation of left_join:

library(dplyr)
df1 %&gt;% 
  left_join(df2, by=&quot;CHR&quot;) %&gt;% 
  filter(between(POS, start, end)) %&gt;% 
  select(-c(start, end))
  CHR  POS Gene
1  10 4342 ABC1
2  20  100  JHT
3  22 5422  KLO

答案4

得分: 2

你可以执行以下操作：

merge(m1, m2) |> {\(.) subset(., data.table::between(.$POS, .$start, .$end))}()
#   CHR  POS start  end Gene
# 1  10 4342  4000 5999 ABC1
# 2  20  100    50  200  JHT
# 3  22 5422  5000 6000  KLO

数据：

m1 <- structure(list(CHR = c(10L, 20L, 22L, 11L), POS = c(4342L, 100L, 
5422L, 10L)), class = "data.frame", row names = c(NA, -4L))
m2 <- structure(list(CHR = c(10L, 20L, 22L, 11L), start = c(4000L, 
50L, 5000L, 5000L), end = c(5999L, 200L, 6000L, 6000L), Gene = c("ABC1", 
"JHT", "KLO", "KLO")), class = "data.frame", row names = c(NA, 
-4L))

英文:

Considering these tho guys,

m1
#   CHR  POS
# 1  10 4342
# 2  20  100
# 3  22 5422
# 4  11   10
m2
#   CHR start  end Gene
# 1  10  4000 5999 ABC1
# 2  20    50  200  JHT
# 3  22  5000 6000  KLO
# 4  11  5000 6000  KLO

you can do:

merge(m1, m2) |&gt; {\(.) subset(., data.table::between(.$POS, .$start, .$end))}()
#   CHR  POS start  end Gene
# 1  10 4342  4000 5999 ABC1
# 2  20  100    50  200  JHT
# 3  22 5422  5000 6000  KLO

Data:

m1 &lt;- structure(list(CHR = c(10L, 20L, 22L, 11L), POS = c(4342L, 100L, 
5422L, 10L)), class = &quot;data.frame&quot;, row.names = c(NA, -4L))
m2 &lt;- structure(list(CHR = c(10L, 20L, 22L, 11L), start = c(4000L, 
50L, 5000L, 5000L), end = c(5999L, 200L, 6000L, 6000L), Gene = c(&quot;ABC1&quot;, 
&quot;JHT&quot;, &quot;KLO&quot;, &quot;KLO&quot;)), class = &quot;data.frame&quot;, row.names = c(NA, 
-4L))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中通过匹配范围创建查找表

问题

答案1

答案2

数据

data

答案3

答案4

如何在允许多个选择的情况下保持 pickerInput 的选择顺序？

tidymodels: loss_accuracy 不提供变量重要性结果

将多列格式化为百分比。

R的plot_ly动画在有两个轨迹的情况下，动画滑块存在问题。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论