英文:
FIltering columns satisfying abundance distribution among samples in groups in R
问题
我有一个看起来像这样的数据框:
ColumnA ColumnB ColumnC group
Sample1 0.05 0.00042 0.12 X
Sample2 0.2 0.00084 0.223 X
Sample3 0.04 0.00045 0.063 X
Sample4 1e-04 0.13 0.013 X
Sample5 0.03 0.0058 3e-04 Y
Sample6 0.05 0.0066 0.00021 Y
Sample7 0.006 0.022 3.2e-06 Y
Sample8 8e-04 0.101 0.00063 Y
Sample9 4e-04 2e-04 0.000233 Z
Sample10 0.008 5e-04 0.00075 Z
Sample11 0.02 8.6e-05 0.00076 Z
Sample12 0.035 4.3e-05 0.00044 Z
如果至少有4个样本中的3个样本在至少一个组中的某一列(示例中的列A和列C)的丰度大于0.01,我想要保留该列。
您可以使用dplyr包来找到解决方案吗?
谢谢。
英文:
I have a dataframe looking like;
ColumnA ColumnB ColumnC group
Sample1 0.05 0.00042 0.12 X
Sample2 0.2 0.00084 0.223 X
Sample3 0.04 0.00045 0.063 X
Sample4 1e-04 0.13 0.013 X
Sample5 0.03 0.0058 3e-04 Y
Sample6 0.05 0.0066 0.00021 Y
Sample7 0.006 0.022 3.2e-06 Y
Sample8 8e-04 0.101 0.00063 Y
Sample9 4e-04 2e-04 0.000233 Z
Sample10 0.008 5e-04 0.00075 Z
Sample11 0.02 8.6e-05 0.00076 Z
Sample12 0.035 4.3e-05 0.00044 Z
If at least 3 of the 4 samples have abundance more than 0.01 for at least one group (Columns A and C in the example above), I would like to retain that column.
Can you please help me to find a solution using dplyr package?
Thank you
答案1
得分: 1
您可以使用select
+ where
动态选择列,并将这些列的名称存储在一个向量中。然后在原始数据集上使用select
:
library(dplyr)
nms <-
df %>%
# 对所选列进行总结,在每个组中检查是否有至少3个元素大于0.01
summarise(across(ColumnA:ColumnC, ~ sum(.x > 0.01) >= 3), .by = group) %>%
# 选择满足条件的任何组的列,并提取列名
select(where(~ is.logical(.x) && any(.x))) %>% names()
#> nms
#> [1] "ColumnA" "ColumnC"
df %>%
select(all_of(nms))
# ColumnA ColumnC
# Sample1 0.0500 1.20e-01
# Sample2 0.2000 2.23e-01
# Sample3 0.0400 6.30e-02
# Sample4 0.0001 1.30e-02
# Sample5 0.0300 3.00e-04
# Sample6 0.0500 2.10e-04
# Sample7 0.0060 3.20e-06
# Sample8 0.0008 6.30e-04
# Sample9 0.0004 2.33e-04
# Sample10 0.0080 7.50e-04
# Sample11 0.0200 7.60e-04
# Sample12 0.0350 4.40e-04
英文:
You can use select
+ where
to select columns dynamically, and store the names of those columns in a vector. Then use select
on the original data set:
library(dplyr)
nms <-
df %>%
# across the selected columns, check if at least 3 elements in each group are above 0.01
summarise(across(ColumnA:ColumnC, ~ sum(.x > 0.01) >= 3), .by = group) %>%
# Select the columns where any of the groups satisfy the conditions and extract the names
select(where(~ is.logical(.x) && any(.x))) %>% names()
#> nms
#> [1] "ColumnA" "ColumnC"
df %>%
select(all_of(nms))
# ColumnA ColumnC
# Sample1 0.0500 1.20e-01
# Sample2 0.2000 2.23e-01
# Sample3 0.0400 6.30e-02
# Sample4 0.0001 1.30e-02
# Sample5 0.0300 3.00e-04
# Sample6 0.0500 2.10e-04
# Sample7 0.0060 3.20e-06
# Sample8 0.0008 6.30e-04
# Sample9 0.0004 2.33e-04
# Sample10 0.0080 7.50e-04
# Sample11 0.0200 7.60e-04
# Sample12 0.0350 4.40e-04
答案2
得分: 0
以下是翻译好的部分:
对于那些感兴趣的人,这是一个基本的R方法
selec <- colSums(aggregate(. ~ group, df, \(x) sum(x > 0.01) >= 3)[-1]) > 0
df[,names(selec)[selec]]
ColumnA ColumnC
Sample1 0.0500 1.20e-01
Sample2 0.2000 2.23e-01
Sample3 0.0400 6.30e-02
Sample4 0.0001 1.30e-02
Sample5 0.0300 3.00e-04
Sample6 0.0500 2.10e-04
Sample7 0.0060 3.20e-06
Sample8 0.0008 6.30e-04
Sample9 0.0004 2.33e-04
Sample10 0.0080 7.50e-04
Sample11 0.0200 7.60e-04
Sample12 0.0350 4.40e-04
英文:
For those who are interested, a base R approach
selec <- colSums(aggregate(. ~ group, df, \(x) sum(x > 0.01) >= 3)[-1]) > 0
df[,names(selec)[selec]]
ColumnA ColumnC
Sample1 0.0500 1.20e-01
Sample2 0.2000 2.23e-01
Sample3 0.0400 6.30e-02
Sample4 0.0001 1.30e-02
Sample5 0.0300 3.00e-04
Sample6 0.0500 2.10e-04
Sample7 0.0060 3.20e-06
Sample8 0.0008 6.30e-04
Sample9 0.0004 2.33e-04
Sample10 0.0080 7.50e-04
Sample11 0.0200 7.60e-04
Sample12 0.0350 4.40e-04
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论