在R中筛选满足样本组内丰度分布的列

huangapple go评论61阅读模式
英文:

FIltering columns satisfying abundance distribution among samples in groups in R

问题

我有一个看起来像这样的数据框:

         ColumnA ColumnB  ColumnC group
Sample1     0.05 0.00042     0.12     X
Sample2      0.2 0.00084    0.223     X
Sample3     0.04 0.00045    0.063     X
Sample4    1e-04    0.13    0.013     X
Sample5     0.03  0.0058    3e-04     Y
Sample6     0.05  0.0066  0.00021     Y
Sample7    0.006   0.022  3.2e-06     Y
Sample8    8e-04   0.101  0.00063     Y
Sample9    4e-04   2e-04 0.000233     Z
Sample10   0.008   5e-04  0.00075     Z
Sample11    0.02 8.6e-05  0.00076     Z
Sample12   0.035 4.3e-05  0.00044     Z

如果至少有4个样本中的3个样本在至少一个组中的某一列(示例中的列A和列C)的丰度大于0.01,我想要保留该列。

您可以使用dplyr包来找到解决方案吗?

谢谢。

英文:

I have a dataframe looking like;

         ColumnA ColumnB  ColumnC group
Sample1     0.05 0.00042     0.12     X
Sample2      0.2 0.00084    0.223     X
Sample3     0.04 0.00045    0.063     X
Sample4    1e-04    0.13    0.013     X
Sample5     0.03  0.0058    3e-04     Y
Sample6     0.05  0.0066  0.00021     Y
Sample7    0.006   0.022  3.2e-06     Y
Sample8    8e-04   0.101  0.00063     Y
Sample9    4e-04   2e-04 0.000233     Z
Sample10   0.008   5e-04  0.00075     Z
Sample11    0.02 8.6e-05  0.00076     Z
Sample12   0.035 4.3e-05  0.00044     Z

If at least 3 of the 4 samples have abundance more than 0.01 for at least one group (Columns A and C in the example above), I would like to retain that column.

Can you please help me to find a solution using dplyr package?

Thank you

答案1

得分: 1

您可以使用select + where动态选择列,并将这些列的名称存储在一个向量中。然后在原始数据集上使用select

library(dplyr)
nms <- 
  df %>% 
  # 对所选列进行总结,在每个组中检查是否有至少3个元素大于0.01
  summarise(across(ColumnA:ColumnC, ~ sum(.x > 0.01) >= 3), .by = group) %>% 
  # 选择满足条件的任何组的列,并提取列名
  select(where(~ is.logical(.x) && any(.x))) %>% names()

#> nms
#> [1] "ColumnA" "ColumnC"

df %>% 
  select(all_of(nms))

#          ColumnA  ColumnC
# Sample1   0.0500 1.20e-01
# Sample2   0.2000 2.23e-01
# Sample3   0.0400 6.30e-02
# Sample4   0.0001 1.30e-02
# Sample5   0.0300 3.00e-04
# Sample6   0.0500 2.10e-04
# Sample7   0.0060 3.20e-06
# Sample8   0.0008 6.30e-04
# Sample9   0.0004 2.33e-04
# Sample10  0.0080 7.50e-04
# Sample11  0.0200 7.60e-04
# Sample12  0.0350 4.40e-04
英文:

You can use select + where to select columns dynamically, and store the names of those columns in a vector. Then use select on the original data set:

library(dplyr)
nms &lt;- 
  df %&gt;% 
  # across the selected columns, check if at least 3 elements in each group are above 0.01
  summarise(across(ColumnA:ColumnC, ~ sum(.x &gt; 0.01) &gt;= 3), .by = group) %&gt;% 
  # Select the columns where any of the groups satisfy the conditions and extract the names
  select(where(~ is.logical(.x) &amp;&amp; any(.x))) %&gt;% names()

#&gt; nms
#&gt; [1] &quot;ColumnA&quot; &quot;ColumnC&quot;

df %&gt;% 
  select(all_of(nms))

#          ColumnA  ColumnC
# Sample1   0.0500 1.20e-01
# Sample2   0.2000 2.23e-01
# Sample3   0.0400 6.30e-02
# Sample4   0.0001 1.30e-02
# Sample5   0.0300 3.00e-04
# Sample6   0.0500 2.10e-04
# Sample7   0.0060 3.20e-06
# Sample8   0.0008 6.30e-04
# Sample9   0.0004 2.33e-04
# Sample10  0.0080 7.50e-04
# Sample11  0.0200 7.60e-04
# Sample12  0.0350 4.40e-04

答案2

得分: 0

以下是翻译好的部分:

对于那些感兴趣的人,这是一个基本的R方法

selec <- colSums(aggregate(. ~ group, df, \(x) sum(x > 0.01) >= 3)[-1]) > 0

df[,names(selec)[selec]]
         ColumnA  ColumnC
Sample1   0.0500 1.20e-01
Sample2   0.2000 2.23e-01
Sample3   0.0400 6.30e-02
Sample4   0.0001 1.30e-02
Sample5   0.0300 3.00e-04
Sample6   0.0500 2.10e-04
Sample7   0.0060 3.20e-06
Sample8   0.0008 6.30e-04
Sample9   0.0004 2.33e-04
Sample10  0.0080 7.50e-04
Sample11  0.0200 7.60e-04
Sample12  0.0350 4.40e-04
英文:

For those who are interested, a base R approach

selec &lt;- colSums(aggregate(. ~ group, df, \(x) sum(x &gt; 0.01) &gt;= 3)[-1]) &gt; 0

df[,names(selec)[selec]]
         ColumnA  ColumnC
Sample1   0.0500 1.20e-01
Sample2   0.2000 2.23e-01
Sample3   0.0400 6.30e-02
Sample4   0.0001 1.30e-02
Sample5   0.0300 3.00e-04
Sample6   0.0500 2.10e-04
Sample7   0.0060 3.20e-06
Sample8   0.0008 6.30e-04
Sample9   0.0004 2.33e-04
Sample10  0.0080 7.50e-04
Sample11  0.0200 7.60e-04
Sample12  0.0350 4.40e-04

huangapple
  • 本文由 发表于 2023年6月18日 21:57:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76500911.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定