英文:
Find elements of a vector in a list and collect information in data frame
问题
从具有基因ID的向量中,我想要确定哪个子列表包含一个或多个ID以及相关的p值。结果应该是一个数据框列表,每个基因ID都有一列,第二列指示ID是否出现为布尔值,第三列包含该基因ID的p值。由于Paul的问题,我意识到一个基因ID可能在多个子列表中,p值将表示其特异性。因此,预期的结果应该是每个子列表的数据框列表,以及指示基因ID是否存在的布尔向量。对编辑表示抱歉。
# 示例数据
the_list <- list(
a = data.frame(ids = c("ABB", 'SDG', 'SHD', 'DUR'), pval = c(0.01, 0.03, 0.05, 0.05)),
b = data.frame(ids =c('DYR' ,'LRH' ,'FPR', 'FUR', 'DCTWE', 'IRN', 'DRB'), pval = c(0.01, 0.03, 0.05, 0.05, 4, 5, 6)),
c = data.frame(ids =c('SYR' ,'SDT', 'DFN' ,'FRQ' ,'DFRR', 'SDR'), pval = c(0.01, 0.03, 0.05, 0.05, 5, 2))
)
the_vector <- c("ABB", 'FUR', 'DFN')
expected_result <- list(
'a' = data.frame(
gene_ids = c("ABB", 'FUR', 'DFN'),
in_list = c(TRUE, FALSE, FALSE),
pval = c(0.01, NA, NA)
),
'b' = data.frame(
gene_ids = c("ABB", 'FUR', 'DFN'),
in_list = c(FALSE, TRUE, FALSE),
pval = c(NA, 0.05, NA)
),
'c' = data.frame(
gene_ids = c("ABB", 'FUR', 'DFN'),
in_list = c(FALSE, FALSE, TRUE),
pval = c(NA, NA, 0.05)
)
)
英文:
From a vector with gene ids, I want to identify which sublist contains one or more of the ids and the associated pvalue.
The result should be a list of data frames with a col for each gene ID, a second col indicating if the ID appears as boolean and a third col with the pval of that gene ID)
Thanks to Pauls question i realized that a gene ID might be in several sublists and the pvalue would indicate its specificity. Accordingly, the expected result should be a list for a df for each sublist and a boolean vector indicating if the geneID is present. Sorry for the edit.
example data
the_list <- list(
a = data.frame(ids = c("ABB", 'SDG', 'SHD', 'DUR'), pval = c(0.01, 0.03, 0.05, 0.05)),
b = data.frame(ids =c('DYR' ,'LRH' ,'FPR', 'FUR', 'DCTWE', 'IRN', 'DRB'), pval = c(0.01, 0.03, 0.05, 0.05, 4, 5, 6)),
c = data.frame(ids =c('SYR' ,'SDT', 'DFN' ,'FRQ' ,'DFRR', 'SDR'), pval = c(0.01, 0.03, 0.05, 0.05, 5, 2))
)
the_vector <- c("ABB", 'FUR', 'DFN')
expected_result <- list(
'a' = data.frame(
gene_ids = c("ABB", 'FUR', 'DFN'),
in_list = c(T, F, F),
pval = c(0.01, NA, NA)),
'b' = data.frame(
gene_ids = c("ABB", 'FUR', 'DFN'),
in_list = c(F, T, F),
pval = c(NA, 0.05, NA)),
'c' = data.frame(
gene_ids = c("ABB", 'FUR', 'DFN'),
in_list = c(F, F, T),
pval = c(NA, NA, 0.05))
)
答案1
得分: 4
Base R解决方案:
# 对于每个列表元素:
# res_list => 数据框的列表
res_list <- lapply(
the_list,
function(x){
# 解析匹配元素的索引:
# idx => 整数向量
idx <- match(the_vector, x$ids)
# 解析所需的数据框:
# 数据框 => 环境
data.frame(
gene_ids = the_vector,
in_list = !(is.na(idx)),
p_val = x$pval[idx],
stringsAsFactors = FALSE,
row.names = NULL
)
}
)
英文:
Base R solution:
# For each list element:
# res_list => list of data.frames
res_list <- lapply(
the_list,
function(x){
# Resolve the index of the matched elements:
# idx => integer vector
idx <- match(the_vector, x$ids)
# Resolve the required data.frame:
# data.frame => env
data.frame(
gene_ids = the_vector,
in_list = !(is.na(idx)),
p_val = x$pval[idx],
stringsAsFactors = FALSE,
row.names = NULL
)
}
)
答案2
得分: 2
循环遍历数据框列表并合并:
lapply(the_list, function(i){
res <- merge(data.frame(ids = the_vector), i, all.x = TRUE)
res$in_list <- !is.na(res$pval)
res
})
# $a
# ids pval in_list
# 1 ABB 0.01 TRUE
# 2 DFN NA FALSE
# 3 FUR NA FALSE
#
# $b
# ids pval in_list
# 1 ABB NA FALSE
# 2 DFN NA FALSE
# 3 FUR 0.05 TRUE
#
# $c
# ids pval in_list
# 1 ABB NA FALSE
# 2 DFN 0.05 TRUE
# 3 FUR NA FALSE
英文:
Loop through the list of data.frames and merge:
lapply(the_list, function(i){
res <- merge(data.frame(ids = the_vector), i, all.x = TRUE)
res$in_list <- !is.na(res$pval)
res
})
# $a
# ids pval in_list
# 1 ABB 0.01 TRUE
# 2 DFN NA FALSE
# 3 FUR NA FALSE
#
# $b
# ids pval in_list
# 1 ABB NA FALSE
# 2 DFN NA FALSE
# 3 FUR 0.05 TRUE
#
# $c
# ids pval in_list
# 1 ABB NA FALSE
# 2 DFN 0.05 TRUE
# 3 FUR NA FALSE
答案3
得分: 1
dplyr::bind_rows(the_list, .id = "in_list") %>%
filter(ids %in% the_vector)
结果:
in_list ids pval
1 a ABB 0.01
2 b FUR 0.05
3 c DFN 0.05
编辑 - 也适用于多次匹配 - 即使相同的id以前已经匹配过,也会给出每次匹配的行 - 您可以根据需要按in_list
或ids
排序。
由于问题已被编辑,这不再符合所需的输出,但我会保留它,因为它简单且包含相同的信息。
英文:
dplyr::bind_rows(the_list, .id = "in_list") %>%
filter(ids %in% the_vector)
gives
in_list ids pval
1 a ABB 0.01
2 b FUR 0.05
3 c DFN 0.05
Edit - works with multi matches too - will give a row per match, even if the same id has previously matched - you can then order by in_list
or ids
as appropriate.
This no longer conforms to the desired output since the question was edited, but I will leave it here as it's simple and contains the same information.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论