2023年6月18日 21:57:34go评论148阅读模式

英文:

FIltering columns satisfying abundance distribution among samples in groups in R

问题

我有一个看起来像这样的数据框：

         ColumnA ColumnB  ColumnC group
Sample1     0.05 0.00042     0.12     X
Sample2      0.2 0.00084    0.223     X
Sample3     0.04 0.00045    0.063     X
Sample4    1e-04    0.13    0.013     X
Sample5     0.03  0.0058    3e-04     Y
Sample6     0.05  0.0066  0.00021     Y
Sample7    0.006   0.022  3.2e-06     Y
Sample8    8e-04   0.101  0.00063     Y
Sample9    4e-04   2e-04 0.000233     Z
Sample10   0.008   5e-04  0.00075     Z
Sample11    0.02 8.6e-05  0.00076     Z
Sample12   0.035 4.3e-05  0.00044     Z

如果至少有4个样本中的3个样本在至少一个组中的某一列（示例中的列A和列C）的丰度大于0.01，我想要保留该列。

您可以使用dplyr包来找到解决方案吗？

谢谢。

英文:

I have a dataframe looking like;

         ColumnA ColumnB  ColumnC group
Sample1     0.05 0.00042     0.12     X
Sample2      0.2 0.00084    0.223     X
Sample3     0.04 0.00045    0.063     X
Sample4    1e-04    0.13    0.013     X
Sample5     0.03  0.0058    3e-04     Y
Sample6     0.05  0.0066  0.00021     Y
Sample7    0.006   0.022  3.2e-06     Y
Sample8    8e-04   0.101  0.00063     Y
Sample9    4e-04   2e-04 0.000233     Z
Sample10   0.008   5e-04  0.00075     Z
Sample11    0.02 8.6e-05  0.00076     Z
Sample12   0.035 4.3e-05  0.00044     Z

If at least 3 of the 4 samples have abundance more than 0.01 for at least one group (Columns A and C in the example above), I would like to retain that column.

Can you please help me to find a solution using dplyr package?

Thank you

答案1

得分: 1

您可以使用select + where动态选择列，并将这些列的名称存储在一个向量中。然后在原始数据集上使用select：

library(dplyr)
nms <- 
  df %>% 
  # 对所选列进行总结，在每个组中检查是否有至少3个元素大于0.01
  summarise(across(ColumnA:ColumnC, ~ sum(.x > 0.01) >= 3), .by = group) %>% 
  # 选择满足条件的任何组的列，并提取列名
  select(where(~ is.logical(.x) && any(.x))) %>% names()

#> nms
#> [1] "ColumnA" "ColumnC"

df %>% 
  select(all_of(nms))

#          ColumnA  ColumnC
# Sample1   0.0500 1.20e-01
# Sample2   0.2000 2.23e-01
# Sample3   0.0400 6.30e-02
# Sample4   0.0001 1.30e-02
# Sample5   0.0300 3.00e-04
# Sample6   0.0500 2.10e-04
# Sample7   0.0060 3.20e-06
# Sample8   0.0008 6.30e-04
# Sample9   0.0004 2.33e-04
# Sample10  0.0080 7.50e-04
# Sample11  0.0200 7.60e-04
# Sample12  0.0350 4.40e-04

英文:

You can use select + where to select columns dynamically, and store the names of those columns in a vector. Then use select on the original data set:

library(dplyr)
nms &lt;- 
  df %&gt;% 
  # across the selected columns, check if at least 3 elements in each group are above 0.01
  summarise(across(ColumnA:ColumnC, ~ sum(.x &gt; 0.01) &gt;= 3), .by = group) %&gt;% 
  # Select the columns where any of the groups satisfy the conditions and extract the names
  select(where(~ is.logical(.x) &amp;&amp; any(.x))) %&gt;% names()

#&gt; nms
#&gt; [1] &quot;ColumnA&quot; &quot;ColumnC&quot;

df %&gt;% 
  select(all_of(nms))

#          ColumnA  ColumnC
# Sample1   0.0500 1.20e-01
# Sample2   0.2000 2.23e-01
# Sample3   0.0400 6.30e-02
# Sample4   0.0001 1.30e-02
# Sample5   0.0300 3.00e-04
# Sample6   0.0500 2.10e-04
# Sample7   0.0060 3.20e-06
# Sample8   0.0008 6.30e-04
# Sample9   0.0004 2.33e-04
# Sample10  0.0080 7.50e-04
# Sample11  0.0200 7.60e-04
# Sample12  0.0350 4.40e-04

答案2

得分: 0

以下是翻译好的部分：

对于那些感兴趣的人，这是一个基本的R方法

selec <- colSums(aggregate(. ~ group, df, \(x) sum(x > 0.01) >= 3)[-1]) > 0

df[,names(selec)[selec]]
         ColumnA  ColumnC
Sample1   0.0500 1.20e-01
Sample2   0.2000 2.23e-01
Sample3   0.0400 6.30e-02
Sample4   0.0001 1.30e-02
Sample5   0.0300 3.00e-04
Sample6   0.0500 2.10e-04
Sample7   0.0060 3.20e-06
Sample8   0.0008 6.30e-04
Sample9   0.0004 2.33e-04
Sample10  0.0080 7.50e-04
Sample11  0.0200 7.60e-04
Sample12  0.0350 4.40e-04

英文:

For those who are interested, a base R approach

selec &lt;- colSums(aggregate(. ~ group, df, \(x) sum(x &gt; 0.01) &gt;= 3)[-1]) &gt; 0

df[,names(selec)[selec]]
         ColumnA  ColumnC
Sample1   0.0500 1.20e-01
Sample2   0.2000 2.23e-01
Sample3   0.0400 6.30e-02
Sample4   0.0001 1.30e-02
Sample5   0.0300 3.00e-04
Sample6   0.0500 2.10e-04
Sample7   0.0060 3.20e-06
Sample8   0.0008 6.30e-04
Sample9   0.0004 2.33e-04
Sample10  0.0080 7.50e-04
Sample11  0.0200 7.60e-04
Sample12  0.0350 4.40e-04

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中筛选满足样本组内丰度分布的列

问题

答案1

答案2

如何在数据框中“重新框架”所有列？

如何在R中将值从xx毫米更改为仅为xx？

根据数据范围在R中每行计算事件数。

如何手动计算方差膨胀因子（VIF）？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论