在列表中查找向量元素并在数据框中收集信息。

huangapple go评论67阅读模式
英文:

Find elements of a vector in a list and collect information in data frame

问题

从具有基因ID的向量中,我想要确定哪个子列表包含一个或多个ID以及相关的p值。结果应该是一个数据框列表,每个基因ID都有一列,第二列指示ID是否出现为布尔值,第三列包含该基因ID的p值。由于Paul的问题,我意识到一个基因ID可能在多个子列表中,p值将表示其特异性。因此,预期的结果应该是每个子列表的数据框列表,以及指示基因ID是否存在的布尔向量。对编辑表示抱歉。

# 示例数据

the_list <- list(
  a = data.frame(ids = c("ABB", 'SDG', 'SHD', 'DUR'), pval = c(0.01, 0.03, 0.05, 0.05)),
  b = data.frame(ids =c('DYR' ,'LRH' ,'FPR', 'FUR', 'DCTWE', 'IRN', 'DRB'), pval = c(0.01, 0.03, 0.05, 0.05, 4, 5, 6)),
  c = data.frame(ids =c('SYR' ,'SDT', 'DFN' ,'FRQ' ,'DFRR', 'SDR'), pval = c(0.01, 0.03, 0.05, 0.05, 5, 2))
)

the_vector <- c("ABB", 'FUR', 'DFN')

expected_result <- list(
  'a' = data.frame(
    gene_ids = c("ABB", 'FUR', 'DFN'),
    in_list = c(TRUE, FALSE, FALSE),
    pval = c(0.01, NA, NA)
  ),
  'b' = data.frame(
    gene_ids = c("ABB", 'FUR', 'DFN'),
    in_list = c(FALSE, TRUE, FALSE),
    pval = c(NA, 0.05, NA)
  ),
  'c' = data.frame(
    gene_ids = c("ABB", 'FUR', 'DFN'),
    in_list = c(FALSE, FALSE, TRUE),
    pval = c(NA, NA, 0.05)
  )
)
英文:

From a vector with gene ids, I want to identify which sublist contains one or more of the ids and the associated pvalue.
The result should be a list of data frames with a col for each gene ID, a second col indicating if the ID appears as boolean and a third col with the pval of that gene ID)

Thanks to Pauls question i realized that a gene ID might be in several sublists and the pvalue would indicate its specificity. Accordingly, the expected result should be a list for a df for each sublist and a boolean vector indicating if the geneID is present. Sorry for the edit.

example data

   the_list &lt;- list(
  a = data.frame(ids = c(&quot;ABB&quot;, &#39;SDG&#39;, &#39;SHD&#39;, &#39;DUR&#39;), pval = c(0.01, 0.03, 0.05, 0.05)),
  b = data.frame(ids =c(&#39;DYR&#39; ,&#39;LRH&#39; ,&#39;FPR&#39;, &#39;FUR&#39;, &#39;DCTWE&#39;, &#39;IRN&#39;, &#39;DRB&#39;), pval = c(0.01, 0.03, 0.05, 0.05, 4, 5, 6)),
  c = data.frame(ids =c(&#39;SYR&#39; ,&#39;SDT&#39;, &#39;DFN&#39; ,&#39;FRQ&#39; ,&#39;DFRR&#39;, &#39;SDR&#39;), pval = c(0.01, 0.03, 0.05, 0.05, 5, 2))
)

the_vector &lt;- c(&quot;ABB&quot;, &#39;FUR&#39;, &#39;DFN&#39;)




 expected_result &lt;- list(
  &#39;a&#39; = data.frame(
    gene_ids = c(&quot;ABB&quot;, &#39;FUR&#39;, &#39;DFN&#39;),
    in_list = c(T, F, F),
    pval = c(0.01, NA, NA)),
  &#39;b&#39; = data.frame(
    gene_ids = c(&quot;ABB&quot;, &#39;FUR&#39;, &#39;DFN&#39;),
    in_list = c(F, T, F),
    pval = c(NA, 0.05, NA)),
  &#39;c&#39; = data.frame(
    gene_ids = c(&quot;ABB&quot;, &#39;FUR&#39;, &#39;DFN&#39;),
    in_list = c(F, F, T),
    pval = c(NA,  NA, 0.05))
  )

答案1

得分: 4

Base R解决方案:

# 对于每个列表元素:
# res_list => 数据框的列表
res_list <- lapply(
  the_list,
  function(x){
    # 解析匹配元素的索引:
    # idx => 整数向量
    idx <- match(the_vector, x$ids)
    # 解析所需的数据框:
    # 数据框 => 环境
    data.frame(
      gene_ids = the_vector, 
      in_list = !(is.na(idx)),
      p_val = x$pval[idx],
      stringsAsFactors = FALSE,
      row.names = NULL
    )
  }
)
英文:

Base R solution:

# For each list element: 
# res_list =&gt; list of data.frames
res_list &lt;- lapply(
  the_list,
  function(x){
    # Resolve the index of the matched elements: 
    # idx =&gt; integer vector
    idx &lt;- match(the_vector, x$ids)
    # Resolve the required data.frame: 
    # data.frame =&gt; env
    data.frame(
      gene_ids = the_vector, 
      in_list = !(is.na(idx)),
      p_val = x$pval[idx],
      stringsAsFactors = FALSE,
      row.names = NULL
    )
  }
)

答案2

得分: 2

循环遍历数据框列表并合并:

lapply(the_list, function(i){
  res <- merge(data.frame(ids = the_vector), i, all.x = TRUE)
  res$in_list <- !is.na(res$pval)
  res
})
# $a
#   ids pval in_list
# 1 ABB 0.01    TRUE
# 2 DFN   NA   FALSE
# 3 FUR   NA   FALSE
# 
# $b
#   ids pval in_list
# 1 ABB   NA   FALSE
# 2 DFN   NA   FALSE
# 3 FUR 0.05    TRUE
# 
# $c
#   ids pval in_list
# 1 ABB   NA   FALSE
# 2 DFN 0.05    TRUE
# 3 FUR   NA   FALSE
英文:

Loop through the list of data.frames and merge:

lapply(the_list, function(i){
  res &lt;- merge(data.frame(ids = the_vector), i, all.x = TRUE)
  res$in_list &lt;- !is.na(res$pval)
  res
  })
# $a
#   ids pval in_list
# 1 ABB 0.01    TRUE
# 2 DFN   NA   FALSE
# 3 FUR   NA   FALSE
# 
# $b
#   ids pval in_list
# 1 ABB   NA   FALSE
# 2 DFN   NA   FALSE
# 3 FUR 0.05    TRUE
# 
# $c
#   ids pval in_list
# 1 ABB   NA   FALSE
# 2 DFN 0.05    TRUE
# 3 FUR   NA   FALSE

答案3

得分: 1

dplyr::bind_rows(the_list, .id = "in_list") %>%
  filter(ids %in% the_vector)

结果:

  in_list ids pval
1       a ABB 0.01
2       b FUR 0.05
3       c DFN 0.05

编辑 - 也适用于多次匹配 - 即使相同的id以前已经匹配过,也会给出每次匹配的行 - 您可以根据需要按in_listids排序。

由于问题已被编辑,这不再符合所需的输出,但我会保留它,因为它简单且包含相同的信息。

英文:
dplyr::bind_rows(the_list, .id = &quot;in_list&quot;) %&gt;%
  filter(ids %in% the_vector)

gives

  in_list ids pval
1       a ABB 0.01
2       b FUR 0.05
3       c DFN 0.05

Edit - works with multi matches too - will give a row per match, even if the same id has previously matched - you can then order by in_list or ids as appropriate.

This no longer conforms to the desired output since the question was edited, but I will leave it here as it's simple and contains the same information.

huangapple
  • 本文由 发表于 2023年6月13日 16:56:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76463247.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定