在R中筛选满足样本组内丰度分布的列

huangapple go评论84阅读模式
英文:

FIltering columns satisfying abundance distribution among samples in groups in R

问题

我有一个看起来像这样的数据框:

  1. ColumnA ColumnB ColumnC group
  2. Sample1 0.05 0.00042 0.12 X
  3. Sample2 0.2 0.00084 0.223 X
  4. Sample3 0.04 0.00045 0.063 X
  5. Sample4 1e-04 0.13 0.013 X
  6. Sample5 0.03 0.0058 3e-04 Y
  7. Sample6 0.05 0.0066 0.00021 Y
  8. Sample7 0.006 0.022 3.2e-06 Y
  9. Sample8 8e-04 0.101 0.00063 Y
  10. Sample9 4e-04 2e-04 0.000233 Z
  11. Sample10 0.008 5e-04 0.00075 Z
  12. Sample11 0.02 8.6e-05 0.00076 Z
  13. Sample12 0.035 4.3e-05 0.00044 Z

如果至少有4个样本中的3个样本在至少一个组中的某一列(示例中的列A和列C)的丰度大于0.01,我想要保留该列。

您可以使用dplyr包来找到解决方案吗?

谢谢。

英文:

I have a dataframe looking like;

  1. ColumnA ColumnB ColumnC group
  2. Sample1 0.05 0.00042 0.12 X
  3. Sample2 0.2 0.00084 0.223 X
  4. Sample3 0.04 0.00045 0.063 X
  5. Sample4 1e-04 0.13 0.013 X
  6. Sample5 0.03 0.0058 3e-04 Y
  7. Sample6 0.05 0.0066 0.00021 Y
  8. Sample7 0.006 0.022 3.2e-06 Y
  9. Sample8 8e-04 0.101 0.00063 Y
  10. Sample9 4e-04 2e-04 0.000233 Z
  11. Sample10 0.008 5e-04 0.00075 Z
  12. Sample11 0.02 8.6e-05 0.00076 Z
  13. Sample12 0.035 4.3e-05 0.00044 Z

If at least 3 of the 4 samples have abundance more than 0.01 for at least one group (Columns A and C in the example above), I would like to retain that column.

Can you please help me to find a solution using dplyr package?

Thank you

答案1

得分: 1

您可以使用select + where动态选择列,并将这些列的名称存储在一个向量中。然后在原始数据集上使用select

  1. library(dplyr)
  2. nms <-
  3. df %>%
  4. # 对所选列进行总结,在每个组中检查是否有至少3个元素大于0.01
  5. summarise(across(ColumnA:ColumnC, ~ sum(.x > 0.01) >= 3), .by = group) %>%
  6. # 选择满足条件的任何组的列,并提取列名
  7. select(where(~ is.logical(.x) && any(.x))) %>% names()
  8. #> nms
  9. #> [1] "ColumnA" "ColumnC"
  10. df %>%
  11. select(all_of(nms))
  12. # ColumnA ColumnC
  13. # Sample1 0.0500 1.20e-01
  14. # Sample2 0.2000 2.23e-01
  15. # Sample3 0.0400 6.30e-02
  16. # Sample4 0.0001 1.30e-02
  17. # Sample5 0.0300 3.00e-04
  18. # Sample6 0.0500 2.10e-04
  19. # Sample7 0.0060 3.20e-06
  20. # Sample8 0.0008 6.30e-04
  21. # Sample9 0.0004 2.33e-04
  22. # Sample10 0.0080 7.50e-04
  23. # Sample11 0.0200 7.60e-04
  24. # Sample12 0.0350 4.40e-04
英文:

You can use select + where to select columns dynamically, and store the names of those columns in a vector. Then use select on the original data set:

  1. library(dplyr)
  2. nms &lt;-
  3. df %&gt;%
  4. # across the selected columns, check if at least 3 elements in each group are above 0.01
  5. summarise(across(ColumnA:ColumnC, ~ sum(.x &gt; 0.01) &gt;= 3), .by = group) %&gt;%
  6. # Select the columns where any of the groups satisfy the conditions and extract the names
  7. select(where(~ is.logical(.x) &amp;&amp; any(.x))) %&gt;% names()
  8. #&gt; nms
  9. #&gt; [1] &quot;ColumnA&quot; &quot;ColumnC&quot;
  10. df %&gt;%
  11. select(all_of(nms))
  12. # ColumnA ColumnC
  13. # Sample1 0.0500 1.20e-01
  14. # Sample2 0.2000 2.23e-01
  15. # Sample3 0.0400 6.30e-02
  16. # Sample4 0.0001 1.30e-02
  17. # Sample5 0.0300 3.00e-04
  18. # Sample6 0.0500 2.10e-04
  19. # Sample7 0.0060 3.20e-06
  20. # Sample8 0.0008 6.30e-04
  21. # Sample9 0.0004 2.33e-04
  22. # Sample10 0.0080 7.50e-04
  23. # Sample11 0.0200 7.60e-04
  24. # Sample12 0.0350 4.40e-04

答案2

得分: 0

以下是翻译好的部分:

对于那些感兴趣的人,这是一个基本的R方法

  1. selec <- colSums(aggregate(. ~ group, df, \(x) sum(x > 0.01) >= 3)[-1]) > 0
  2. df[,names(selec)[selec]]
  3. ColumnA ColumnC
  4. Sample1 0.0500 1.20e-01
  5. Sample2 0.2000 2.23e-01
  6. Sample3 0.0400 6.30e-02
  7. Sample4 0.0001 1.30e-02
  8. Sample5 0.0300 3.00e-04
  9. Sample6 0.0500 2.10e-04
  10. Sample7 0.0060 3.20e-06
  11. Sample8 0.0008 6.30e-04
  12. Sample9 0.0004 2.33e-04
  13. Sample10 0.0080 7.50e-04
  14. Sample11 0.0200 7.60e-04
  15. Sample12 0.0350 4.40e-04
英文:

For those who are interested, a base R approach

  1. selec &lt;- colSums(aggregate(. ~ group, df, \(x) sum(x &gt; 0.01) &gt;= 3)[-1]) &gt; 0
  2. df[,names(selec)[selec]]
  3. ColumnA ColumnC
  4. Sample1 0.0500 1.20e-01
  5. Sample2 0.2000 2.23e-01
  6. Sample3 0.0400 6.30e-02
  7. Sample4 0.0001 1.30e-02
  8. Sample5 0.0300 3.00e-04
  9. Sample6 0.0500 2.10e-04
  10. Sample7 0.0060 3.20e-06
  11. Sample8 0.0008 6.30e-04
  12. Sample9 0.0004 2.33e-04
  13. Sample10 0.0080 7.50e-04
  14. Sample11 0.0200 7.60e-04
  15. Sample12 0.0350 4.40e-04

huangapple
  • 本文由 发表于 2023年6月18日 21:57:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76500911.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定