在列表中查找向量元素并在数据框中收集信息。

huangapple go评论97阅读模式
英文:

Find elements of a vector in a list and collect information in data frame

问题

从具有基因ID的向量中,我想要确定哪个子列表包含一个或多个ID以及相关的p值。结果应该是一个数据框列表,每个基因ID都有一列,第二列指示ID是否出现为布尔值,第三列包含该基因ID的p值。由于Paul的问题,我意识到一个基因ID可能在多个子列表中,p值将表示其特异性。因此,预期的结果应该是每个子列表的数据框列表,以及指示基因ID是否存在的布尔向量。对编辑表示抱歉。

  1. # 示例数据
  2. the_list <- list(
  3. a = data.frame(ids = c("ABB", 'SDG', 'SHD', 'DUR'), pval = c(0.01, 0.03, 0.05, 0.05)),
  4. b = data.frame(ids =c('DYR' ,'LRH' ,'FPR', 'FUR', 'DCTWE', 'IRN', 'DRB'), pval = c(0.01, 0.03, 0.05, 0.05, 4, 5, 6)),
  5. c = data.frame(ids =c('SYR' ,'SDT', 'DFN' ,'FRQ' ,'DFRR', 'SDR'), pval = c(0.01, 0.03, 0.05, 0.05, 5, 2))
  6. )
  7. the_vector <- c("ABB", 'FUR', 'DFN')
  8. expected_result <- list(
  9. 'a' = data.frame(
  10. gene_ids = c("ABB", 'FUR', 'DFN'),
  11. in_list = c(TRUE, FALSE, FALSE),
  12. pval = c(0.01, NA, NA)
  13. ),
  14. 'b' = data.frame(
  15. gene_ids = c("ABB", 'FUR', 'DFN'),
  16. in_list = c(FALSE, TRUE, FALSE),
  17. pval = c(NA, 0.05, NA)
  18. ),
  19. 'c' = data.frame(
  20. gene_ids = c("ABB", 'FUR', 'DFN'),
  21. in_list = c(FALSE, FALSE, TRUE),
  22. pval = c(NA, NA, 0.05)
  23. )
  24. )
英文:

From a vector with gene ids, I want to identify which sublist contains one or more of the ids and the associated pvalue.
The result should be a list of data frames with a col for each gene ID, a second col indicating if the ID appears as boolean and a third col with the pval of that gene ID)

Thanks to Pauls question i realized that a gene ID might be in several sublists and the pvalue would indicate its specificity. Accordingly, the expected result should be a list for a df for each sublist and a boolean vector indicating if the geneID is present. Sorry for the edit.

example data

  1. the_list &lt;- list(
  2. a = data.frame(ids = c(&quot;ABB&quot;, &#39;SDG&#39;, &#39;SHD&#39;, &#39;DUR&#39;), pval = c(0.01, 0.03, 0.05, 0.05)),
  3. b = data.frame(ids =c(&#39;DYR&#39; ,&#39;LRH&#39; ,&#39;FPR&#39;, &#39;FUR&#39;, &#39;DCTWE&#39;, &#39;IRN&#39;, &#39;DRB&#39;), pval = c(0.01, 0.03, 0.05, 0.05, 4, 5, 6)),
  4. c = data.frame(ids =c(&#39;SYR&#39; ,&#39;SDT&#39;, &#39;DFN&#39; ,&#39;FRQ&#39; ,&#39;DFRR&#39;, &#39;SDR&#39;), pval = c(0.01, 0.03, 0.05, 0.05, 5, 2))
  5. )
  6. the_vector &lt;- c(&quot;ABB&quot;, &#39;FUR&#39;, &#39;DFN&#39;)
  7. expected_result &lt;- list(
  8. &#39;a&#39; = data.frame(
  9. gene_ids = c(&quot;ABB&quot;, &#39;FUR&#39;, &#39;DFN&#39;),
  10. in_list = c(T, F, F),
  11. pval = c(0.01, NA, NA)),
  12. &#39;b&#39; = data.frame(
  13. gene_ids = c(&quot;ABB&quot;, &#39;FUR&#39;, &#39;DFN&#39;),
  14. in_list = c(F, T, F),
  15. pval = c(NA, 0.05, NA)),
  16. &#39;c&#39; = data.frame(
  17. gene_ids = c(&quot;ABB&quot;, &#39;FUR&#39;, &#39;DFN&#39;),
  18. in_list = c(F, F, T),
  19. pval = c(NA, NA, 0.05))
  20. )

答案1

得分: 4

  1. Base R解决方案:
  2. # 对于每个列表元素:
  3. # res_list => 数据框的列表
  4. res_list <- lapply(
  5. the_list,
  6. function(x){
  7. # 解析匹配元素的索引:
  8. # idx => 整数向量
  9. idx <- match(the_vector, x$ids)
  10. # 解析所需的数据框:
  11. # 数据框 => 环境
  12. data.frame(
  13. gene_ids = the_vector,
  14. in_list = !(is.na(idx)),
  15. p_val = x$pval[idx],
  16. stringsAsFactors = FALSE,
  17. row.names = NULL
  18. )
  19. }
  20. )
英文:

Base R solution:

  1. # For each list element:
  2. # res_list =&gt; list of data.frames
  3. res_list &lt;- lapply(
  4. the_list,
  5. function(x){
  6. # Resolve the index of the matched elements:
  7. # idx =&gt; integer vector
  8. idx &lt;- match(the_vector, x$ids)
  9. # Resolve the required data.frame:
  10. # data.frame =&gt; env
  11. data.frame(
  12. gene_ids = the_vector,
  13. in_list = !(is.na(idx)),
  14. p_val = x$pval[idx],
  15. stringsAsFactors = FALSE,
  16. row.names = NULL
  17. )
  18. }
  19. )

答案2

得分: 2

循环遍历数据框列表并合并:

  1. lapply(the_list, function(i){
  2. res <- merge(data.frame(ids = the_vector), i, all.x = TRUE)
  3. res$in_list <- !is.na(res$pval)
  4. res
  5. })
  6. # $a
  7. # ids pval in_list
  8. # 1 ABB 0.01 TRUE
  9. # 2 DFN NA FALSE
  10. # 3 FUR NA FALSE
  11. #
  12. # $b
  13. # ids pval in_list
  14. # 1 ABB NA FALSE
  15. # 2 DFN NA FALSE
  16. # 3 FUR 0.05 TRUE
  17. #
  18. # $c
  19. # ids pval in_list
  20. # 1 ABB NA FALSE
  21. # 2 DFN 0.05 TRUE
  22. # 3 FUR NA FALSE
英文:

Loop through the list of data.frames and merge:

  1. lapply(the_list, function(i){
  2. res &lt;- merge(data.frame(ids = the_vector), i, all.x = TRUE)
  3. res$in_list &lt;- !is.na(res$pval)
  4. res
  5. })
  6. # $a
  7. # ids pval in_list
  8. # 1 ABB 0.01 TRUE
  9. # 2 DFN NA FALSE
  10. # 3 FUR NA FALSE
  11. #
  12. # $b
  13. # ids pval in_list
  14. # 1 ABB NA FALSE
  15. # 2 DFN NA FALSE
  16. # 3 FUR 0.05 TRUE
  17. #
  18. # $c
  19. # ids pval in_list
  20. # 1 ABB NA FALSE
  21. # 2 DFN 0.05 TRUE
  22. # 3 FUR NA FALSE

答案3

得分: 1

  1. dplyr::bind_rows(the_list, .id = "in_list") %>%
  2. filter(ids %in% the_vector)

结果:

  1. in_list ids pval
  2. 1 a ABB 0.01
  3. 2 b FUR 0.05
  4. 3 c DFN 0.05

编辑 - 也适用于多次匹配 - 即使相同的id以前已经匹配过,也会给出每次匹配的行 - 您可以根据需要按in_listids排序。

由于问题已被编辑,这不再符合所需的输出,但我会保留它,因为它简单且包含相同的信息。

英文:
  1. dplyr::bind_rows(the_list, .id = &quot;in_list&quot;) %&gt;%
  2. filter(ids %in% the_vector)

gives

  1. in_list ids pval
  2. 1 a ABB 0.01
  3. 2 b FUR 0.05
  4. 3 c DFN 0.05

Edit - works with multi matches too - will give a row per match, even if the same id has previously matched - you can then order by in_list or ids as appropriate.

This no longer conforms to the desired output since the question was edited, but I will leave it here as it's simple and contains the same information.

huangapple
  • 本文由 发表于 2023年6月13日 16:56:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76463247.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定