英文:
Lookup table in R by matching ranges
问题
以下是您的代码的翻译部分,不包括代码本身:
我有一个表格:
CHR POS
10 4342
20 100
22 5422
我有另一个查找表:
CHR start end Gene
10 4000 5999 ABC1
20 50 200 JHT
22 5000 6000 KLO
期望的输出:
CHR POS
10 4342 ABC1
20 100 JHT
22 5422 KLO
实际上,表1中有约 700,000 条记录,表2中大约有 60,000 个基因。我需要按染色体匹配,然后使表1的POS位于表2的start-end之间,以添加一个新列,其中包含基因名称。
我尝试过:
library(dplyr)
# 创建示例数据
df1 <- data.frame(chromosome = c("chr1", "chr1", "chr2", "chr3"), position = c(100, 200, 300, 400))
df2 <- data.frame(chromosome = c("chr1", "chr2", "chr3"), start = c(50, 250, 350), end = c(150, 350, 450), gene = c("geneA", "geneB", "geneC"))
# 执行左连接
joined_df <- left_join(df1, df2, by = "chromosome")
# 创建新列,指示每行是否位于基因内
result_df <- joined_df %>%
mutate(in_gene = if_else(position >= start & position <= end, gene, NA_character_))
# 查看结果
result_df
但矢量太大而无法存储。
英文:
I have a table:
CHR POS
10 4342
20 100
22 5422
I have another lookup:
CHR start end Gene
10 4000 5999 ABC1
20 50 200 JHT
22 5000 6000 KLO
Desired output:
CHR POS
10 4342 ABC1
20 100 JHT
22 5422 KLO
In reality there are 700,000 entries in table 1 and roughly 60000 genes. I need to match on chromsome and then get the POS to be between start-end of table 2 to add a new column with the gene name.
I tried :
library(dplyr)
# create sample data
df1 <- data.frame(chromosome = c("chr1", "chr1", "chr2", "chr3"), position = c(100, 200, 300, 400))
df2 <- data.frame(chromosome = c("chr1", "chr2", "chr3"), start = c(50, 250, 350), end = c(150, 350, 450), gene = c("geneA", "geneB", "geneC"))
# perform left join
joined_df <- left_join(df1, df2, by = "chromosome")
# create new column indicating if each row lies within a gene
result_df <- joined_df %>%
mutate(in_gene = if_else(position >= start & position <= end, gene, NA_character_))
# view result
result_df
But the vector was too large to store.
答案1
得分: 4
你可以使用 GenomicRanges
来执行类似的操作。请参考下面的注释部分以获取安装代码。
GRanges
类是一个用于存储基因组位置和相关注释的容器。
makeGRangesFromDataFrame
函数将接受一个数据框作为输入,并自动查找描述基因组范围的列(默认为 start
和 end
或 stop
)。
以下使用提供的额外示例数据。
# 如果没有安装 "BiocManager",请取消注释以下代码以安装:
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# 安装 "GenomicRanges":
# BiocManager::install("GenomicRanges")
# 加载 "GenomicRanges" 库:
library(GenomicRanges)
gr1 <- GRanges(seqnames = df1$chromosome,
IRanges(start = df1$position, width = 1))
gr2 <- makeGRangesFromDataFrame(df2, keep.extra.columns = TRUE)
df1$gene <- NA
ovlp <- findOverlaps(gr1, gr2)
df1$gene[queryHits(ovlp)] <- gr2$gene[subjectHits(ovlp)]
df1
输出
chromosome position gene
1 chr1 100 geneA
2 chr1 200 <NA>
3 chr2 300 geneB
4 chr3 400 geneC
英文:
You can use GenomicRanges
for something like this. See commented out code at the beginning below for installing.
The GRanges
class is a container for genomic locations and associated annotations.
The function makeGRangesFromDataFrame
will take a data.frame as input and automatically find the columns that describe genomic ranges (default is start
and end
or stop
).
Below uses the additional sample data provided.
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("GenomicRanges")
library(GenomicRanges)
gr1 <- GRanges(seqnames = df1$chromosome,
IRanges(start = df1$position, width = 1))
gr2 <- makeGRangesFromDataFrame(df2, keep.extra.columns = TRUE)
df1$gene <- NA
ovlp <- findOverlaps(gr1, gr2)
df1$gene[queryHits(ovlp)] <- gr2$gene[subjectHits(ovlp)]
df1
Output
chromosome position gene
1 chr1 100 geneA
2 chr1 200 <NA>
3 chr2 300 geneB
4 chr3 400 geneC
答案2
得分: 3
使用 dplyr
1.1.0,我们可以使用 join_by
进行非等值连接
library(dplyr)
left_join(df1, df2, by = join_by(CHR,
closest(POS >= start), closest(POS <= end))) %>%
select(-start, -end)
-output
CHR POS Gene
1 10 4342 ABC1
2 20 100 JHT
3 22 5422 KLO
或者使用 data.table
library(data.table)
setDT(df1)[df2, Gene := i.Gene, on = .(CHR, POS >= start, POS <= end)]
-output
> df1
CHR POS Gene
1: 10 4342 ABC1
2: 20 100 JHT
3: 22 5422 KLO
数据
df1 <- structure(list(CHR = c(10L, 20L, 22L), POS = c(4342L, 100L, 5422L)), class = "data.frame", row names = c(NA, -3L))
df2 <- structure(list(CHR = c(10L, 20L, 22L), start = c(4000L, 50L, 5000L), end = c(5999L, 200L, 6000L), Gene = c("ABC1", "JHT", "KLO")), class = "data.frame", row names = c(NA, -3L))
英文:
With dplyr
1.1.0, we can use join_by
for non-equi joins
library(dplyr)
left_join(df1, df2, by = join_by(CHR,
closest(POS >= start), closest(POS <= end))) %>%
select(-start, -end)
-output
CHR POS Gene
1 10 4342 ABC1
2 20 100 JHT
3 22 5422 KLO
Or with data.table
library(data.table)
setDT(df1)[df2, Gene := i.Gene, on = .(CHR, POS >= start, POS <= end)]
-output
> df1
CHR POS Gene
1: 10 4342 ABC1
2: 20 100 JHT
3: 22 5422 KLO
data
df1 <- structure(list(CHR = c(10L, 20L, 22L), POS = c(4342L, 100L, 5422L
)), class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(CHR = c(10L, 20L, 22L), start = c(4000L, 50L,
5000L), end = c(5999L, 200L, 6000L), Gene = c("ABC1", "JHT",
"KLO")), class = "data.frame", row.names = c(NA, -3L))
答案3
得分: 3
这是 left_join 的一种变体:
library(dplyr)
df1 %>%
left_join(df2, by="CHR") %>%
filter(between(POS, start, end)) %>%
select(-c(start, end))
CHR POS Gene
1 10 4342 ABC1
2 20 100 JHT
3 22 5422 KLO
这段代码执行 left_join 操作,并在之后进行了筛选和选择列的操作。
英文:
Here is a variation of left_join:
library(dplyr)
df1 %>%
left_join(df2, by="CHR") %>%
filter(between(POS, start, end)) %>%
select(-c(start, end))
CHR POS Gene
1 10 4342 ABC1
2 20 100 JHT
3 22 5422 KLO
答案4
得分: 2
你可以执行以下操作:
merge(m1, m2) |> {\(.) subset(., data.table::between(.$POS, .$start, .$end))}()
# CHR POS start end Gene
# 1 10 4342 4000 5999 ABC1
# 2 20 100 50 200 JHT
# 3 22 5422 5000 6000 KLO
数据:
m1 <- structure(list(CHR = c(10L, 20L, 22L, 11L), POS = c(4342L, 100L,
5422L, 10L)), class = "data.frame", row names = c(NA, -4L))
m2 <- structure(list(CHR = c(10L, 20L, 22L, 11L), start = c(4000L,
50L, 5000L, 5000L), end = c(5999L, 200L, 6000L, 6000L), Gene = c("ABC1",
"JHT", "KLO", "KLO")), class = "data.frame", row names = c(NA,
-4L))
英文:
Considering these tho guys,
m1
# CHR POS
# 1 10 4342
# 2 20 100
# 3 22 5422
# 4 11 10
m2
# CHR start end Gene
# 1 10 4000 5999 ABC1
# 2 20 50 200 JHT
# 3 22 5000 6000 KLO
# 4 11 5000 6000 KLO
you can do:
merge(m1, m2) |> {\(.) subset(., data.table::between(.$POS, .$start, .$end))}()
# CHR POS start end Gene
# 1 10 4342 4000 5999 ABC1
# 2 20 100 50 200 JHT
# 3 22 5422 5000 6000 KLO
Data:
m1 <- structure(list(CHR = c(10L, 20L, 22L, 11L), POS = c(4342L, 100L,
5422L, 10L)), class = "data.frame", row.names = c(NA, -4L))
m2 <- structure(list(CHR = c(10L, 20L, 22L, 11L), start = c(4000L,
50L, 5000L, 5000L), end = c(5999L, 200L, 6000L, 6000L), Gene = c("ABC1",
"JHT", "KLO", "KLO")), class = "data.frame", row.names = c(NA,
-4L))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论