2023年6月13日 16:56:56go评论97阅读模式

英文:

Find elements of a vector in a list and collect information in data frame

问题

从具有基因ID的向量中，我想要确定哪个子列表包含一个或多个ID以及相关的p值。结果应该是一个数据框列表，每个基因ID都有一列，第二列指示ID是否出现为布尔值，第三列包含该基因ID的p值。由于Paul的问题，我意识到一个基因ID可能在多个子列表中，p值将表示其特异性。因此，预期的结果应该是每个子列表的数据框列表，以及指示基因ID是否存在的布尔向量。对编辑表示抱歉。

# 示例数据
the_list <- list(
  a = data.frame(ids = c("ABB", 'SDG', 'SHD', 'DUR'), pval = c(0.01, 0.03, 0.05, 0.05)),
  b = data.frame(ids =c('DYR' ,'LRH' ,'FPR', 'FUR', 'DCTWE', 'IRN', 'DRB'), pval = c(0.01, 0.03, 0.05, 0.05, 4, 5, 6)),
  c = data.frame(ids =c('SYR' ,'SDT', 'DFN' ,'FRQ' ,'DFRR', 'SDR'), pval = c(0.01, 0.03, 0.05, 0.05, 5, 2))
)
the_vector <- c("ABB", 'FUR', 'DFN')
expected_result <- list(
  'a' = data.frame(
    gene_ids = c("ABB", 'FUR', 'DFN'),
    in_list = c(TRUE, FALSE, FALSE),
    pval = c(0.01, NA, NA)
  ),
  'b' = data.frame(
    gene_ids = c("ABB", 'FUR', 'DFN'),
    in_list = c(FALSE, TRUE, FALSE),
    pval = c(NA, 0.05, NA)
  ),
  'c' = data.frame(
    gene_ids = c("ABB", 'FUR', 'DFN'),
    in_list = c(FALSE, FALSE, TRUE),
    pval = c(NA, NA, 0.05)
  )
)

英文:

From a vector with gene ids, I want to identify which sublist contains one or more of the ids and the associated pvalue.
The result should be a list of data frames with a col for each gene ID, a second col indicating if the ID appears as boolean and a third col with the pval of that gene ID)

Thanks to Pauls question i realized that a gene ID might be in several sublists and the pvalue would indicate its specificity. Accordingly, the expected result should be a list for a df for each sublist and a boolean vector indicating if the geneID is present. Sorry for the edit.

example data

   the_list &lt;- list(
  a = data.frame(ids = c(&quot;ABB&quot;, &#39;SDG&#39;, &#39;SHD&#39;, &#39;DUR&#39;), pval = c(0.01, 0.03, 0.05, 0.05)),
  b = data.frame(ids =c(&#39;DYR&#39; ,&#39;LRH&#39; ,&#39;FPR&#39;, &#39;FUR&#39;, &#39;DCTWE&#39;, &#39;IRN&#39;, &#39;DRB&#39;), pval = c(0.01, 0.03, 0.05, 0.05, 4, 5, 6)),
  c = data.frame(ids =c(&#39;SYR&#39; ,&#39;SDT&#39;, &#39;DFN&#39; ,&#39;FRQ&#39; ,&#39;DFRR&#39;, &#39;SDR&#39;), pval = c(0.01, 0.03, 0.05, 0.05, 5, 2))
)
the_vector &lt;- c(&quot;ABB&quot;, &#39;FUR&#39;, &#39;DFN&#39;)
 expected_result &lt;- list(
  &#39;a&#39; = data.frame(
    gene_ids = c(&quot;ABB&quot;, &#39;FUR&#39;, &#39;DFN&#39;),
    in_list = c(T, F, F),
    pval = c(0.01, NA, NA)),
  &#39;b&#39; = data.frame(
    gene_ids = c(&quot;ABB&quot;, &#39;FUR&#39;, &#39;DFN&#39;),
    in_list = c(F, T, F),
    pval = c(NA, 0.05, NA)),
  &#39;c&#39; = data.frame(
    gene_ids = c(&quot;ABB&quot;, &#39;FUR&#39;, &#39;DFN&#39;),
    in_list = c(F, F, T),
    pval = c(NA,  NA, 0.05))
  )

答案1

得分: 4

Base R解决方案：
# 对于每个列表元素：
# res_list => 数据框的列表
res_list <- lapply(
  the_list,
  function(x){
    # 解析匹配元素的索引：
    # idx => 整数向量
    idx <- match(the_vector, x$ids)
    # 解析所需的数据框：
    # 数据框 => 环境
    data.frame(
      gene_ids = the_vector, 
      in_list = !(is.na(idx)),
      p_val = x$pval[idx],
      stringsAsFactors = FALSE,
      row.names = NULL
    )
  }
)

英文:

Base R solution:

# For each list element: 
# res_list =&gt; list of data.frames
res_list &lt;- lapply(
  the_list,
  function(x){
    # Resolve the index of the matched elements: 
    # idx =&gt; integer vector
    idx &lt;- match(the_vector, x$ids)
    # Resolve the required data.frame: 
    # data.frame =&gt; env
    data.frame(
      gene_ids = the_vector, 
      in_list = !(is.na(idx)),
      p_val = x$pval[idx],
      stringsAsFactors = FALSE,
      row.names = NULL
    )
  }
)

答案2

得分: 2

循环遍历数据框列表并合并：

lapply(the_list, function(i){
  res <- merge(data.frame(ids = the_vector), i, all.x = TRUE)
  res$in_list <- !is.na(res$pval)
  res
})
# $a
#   ids pval in_list
# 1 ABB 0.01    TRUE
# 2 DFN   NA   FALSE
# 3 FUR   NA   FALSE
# 
# $b
#   ids pval in_list
# 1 ABB   NA   FALSE
# 2 DFN   NA   FALSE
# 3 FUR 0.05    TRUE
# 
# $c
#   ids pval in_list
# 1 ABB   NA   FALSE
# 2 DFN 0.05    TRUE
# 3 FUR   NA   FALSE

英文:

Loop through the list of data.frames and merge:

lapply(the_list, function(i){
  res &lt;- merge(data.frame(ids = the_vector), i, all.x = TRUE)
  res$in_list &lt;- !is.na(res$pval)
  res
  })
# $a
#   ids pval in_list
# 1 ABB 0.01    TRUE
# 2 DFN   NA   FALSE
# 3 FUR   NA   FALSE
# 
# $b
#   ids pval in_list
# 1 ABB   NA   FALSE
# 2 DFN   NA   FALSE
# 3 FUR 0.05    TRUE
# 
# $c
#   ids pval in_list
# 1 ABB   NA   FALSE
# 2 DFN 0.05    TRUE
# 3 FUR   NA   FALSE

答案3

得分: 1

dplyr::bind_rows(the_list, .id = "in_list") %>%
  filter(ids %in% the_vector)

结果：

  in_list ids pval
1       a ABB 0.01
2       b FUR 0.05
3       c DFN 0.05

编辑 - 也适用于多次匹配 - 即使相同的id以前已经匹配过，也会给出每次匹配的行 - 您可以根据需要按in_list或ids排序。

由于问题已被编辑，这不再符合所需的输出，但我会保留它，因为它简单且包含相同的信息。

英文:

dplyr::bind_rows(the_list, .id = &quot;in_list&quot;) %&gt;%
  filter(ids %in% the_vector)

gives

  in_list ids pval
1       a ABB 0.01
2       b FUR 0.05
3       c DFN 0.05

Edit - works with multi matches too - will give a row per match, even if the same id has previously matched - you can then order by in_list or ids as appropriate.

This no longer conforms to the desired output since the question was edited, but I will leave it here as it's simple and contains the same information.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在列表中查找向量元素并在数据框中收集信息。

问题

example data

答案1

答案2

答案3

如何使用cowplot和ggplot排列多个图。

在R中如何处理多个csv文件以识别空值？

_targets.R函数能够从targets列表中读取对象吗？

使用select函数选择数据集中的所有行，除了一行。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。