2023年4月13日 16:42:27go评论96阅读模式

英文:

Combine apply function with lapply: calculate mean of groups in df

问题

从具有不同组的每个样本（列）的单个表达值（行）的两个数据框中，我想要计算每个组的平均值和中位数。
我的解决方案似乎有点冗长，我想知道是否有更加优雅的解决方案。

数据

# 表达值
genes &lt;- paste("gene",1:1000,sep="")
x &lt;- list(
  A = sample(genes,300), 
  B = sample(genes,525), 
  C = sample(genes,440),
  D = sample(genes,350)
)
# 表达数据框
crete_exp_df &lt;- function(gene_nr, sample_nr){
  df &lt;- replicate(sample_nr, rnorm(gene_nr))
  rownames(df) &lt;- paste("Gene", c(1:nrow(df)))
  colnames(df) &lt;- paste("Sample", c(1:ncol(df)))
  return(df)
}
exp1 &lt;- crete_exp_df(50, 20)
exp2 &lt;- crete_exp_df(50, 20)
# 样本注释
san &lt;- data.frame(
  id = colnames(exp1),
  group = sample(1:4, 20, replace = TRUE))

解决方案

# 获取每组样本的ID
ids_1 &lt;- san %&gt;% filter(group == 1) %&gt;% pull(id)
ids_2 &lt;- san %&gt;% filter(group == 2) %&gt;% pull(id)
ids_3 &lt;- san %&gt;% filter(group == 3) %&gt;% pull(id)
ids_4 &lt;- san %&gt;% filter(group == 4) %&gt;% pull(id)
id_list &lt;- list(group1 = ids_1, group2 = ids_2, group3 = ids_3, group4 = ids_4)
# 函数计算df1的均值
get_means_exp1 &lt;- function(id){
  apply(exp1[, id], 1, mean, na.rm = T)
} 
# 函数计算df2的均值
get_means_exp2 &lt;- function(id){
  apply(exp2[, id], 1, mean, na.rm = T)
} 
# 对df1应用lapply
list_means_exp1 &lt;- lapply(id_list, get_means_exp1)
means_exp1 &lt;- as.data.frame(list_means_exp1)
# 对df2应用lapply
list_means_exp2 &lt;- lapply(id_list, get_means_exp2)
means_exp2 &lt;- as.data.frame(list_means_exp2)

我认为这可以更加优雅地解决。具体来说，如何获取每个组的ID并编写一个适用于两个数据框的函数。
期待从您的解决方案中学到更多，非常感谢！

英文:

From two dataframes with single expression values (rows) per sample (cols) of different groups, I want to calculate the mean and median per group.
My solution seems a bit verbose and I wonder if there is a more elegant solution.

Data

# expression values
genes &lt;- paste(&quot;gene&quot;,1:1000,sep=&quot;&quot;)
x &lt;- list(
  A = sample(genes,300), 
  B = sample(genes,525), 
  C = sample(genes,440),
  D = sample(genes,350)
)
# expression dataframe
crete_exp_df &lt;- function(gene_nr, sample_nr){
  df &lt;- replicate(sample_nr, rnorm(gene_nr))
  rownames(df) &lt;- paste(&quot;Gene&quot;, c(1:nrow(df)))
  colnames(df) &lt;- paste(&quot;Sample&quot;, c(1:ncol(df)))
  return(df)
}
exp1 &lt;- crete_exp_df(50, 20)
exp2 &lt;- crete_exp_df(50, 20)
# sample annotation
san &lt;- data.frame(
  id = colnames(exp1),
  group = sample(1:4, 20, replace = TRUE))

Solution

# get ids of samples per group
ids_1 &lt;- san %&gt;% filter(group == 1) %&gt;% pull(id)
ids_2 &lt;- san %&gt;% filter(group == 2) %&gt;% pull(id)
ids_3 &lt;- san %&gt;% filter(group == 3) %&gt;% pull(id)
ids_4 &lt;- san %&gt;% filter(group == 4) %&gt;% pull(id)
id_list &lt;- list(group1 = ids_1, group2 = ids_2, group3 = ids_3, group4 = ids_4)
# fct means df1
get_means_exp1 &lt;- function(id){
  apply(exp1[, id], 1, mean, na.rm = T)
} 
# fct means df2
get_means_exp2 &lt;- function(id){
  apply(exp2[, id], 1, mean, na.rm = T)
} 
# lapply on df1
list_means_exp1 &lt;- lapply(id_list, get_means_exp1)
means_exp1 &lt;- as.data.frame(list_means_exp1)
# lapply on df2
list_means_exp2 &lt;- lapply(id_list, get_means_exp2)
means_exp2 &lt;- as.data.frame(list_means_exp2)

I suppose this can be solved much more elegant. Specifically, how to get the ids per group and write a function that works for both df.
Looking forwards to learning from your solutions, thanks a lot!

答案1

得分: 3

在使用apply(., 1, FUN)之前，始终明智的做法是检查是否有矢量化的函数可用，因为它们速度更快。对于行的算术均值，可以使用base::rowMeans。对于中位数，我们可以使用matrixStats::rowMedians。对于行均值，还可以使用matrixStats::rowMeans2，它略快一些。在这里使用vapply是有道理的，它类似于lapply，但方便地生成一个矩阵，并且在*apply系列中速度最快，因为我们可以预先分配内存。(注意: 我使用了set.seed(42)来创建您的数据。)

所以也许您正在寻找这个：

vapply(id_list, \(x) rowMeans(exp1[, x]), numeric(dim(exp1)[1]))
#              group1       group2       group3      group4
# Gene 1  -1.35631700 -0.328620048  0.160795323 -0.01011904
# Gene 2   0.33985130  0.432482763 -0.169343033  0.13019294
# Gene 3   0.46623064  0.154045975  0.362607622  0.58710492
# Gene 4   0.17049403 -0.036744170 -0.056742305  1.10934764
# Gene 5  -0.15515465  0.237211068 -0.426415836 -0.50977736
vapply(id_list, \(x) matrixStats::rowMedians(exp1[, x], useNames=TRUE), numeric(dim(exp1)[1]))
#              group1      group2       group3        group4
# Gene 1  -1.22551737 -0.41642403  0.470862918 -1.782411e-01
# Gene 2   0.05680326  0.62277321 -0.512487033  3.943679e-01
# Gene 3   0.58009311 -0.10696651  0.149054062  9.345673e-01
# Gene 4   0.09852832  0.12774134 -0.573525823  1.046751e+00
# Gene 5  -0.44076823  0.11716389 -0.381682466 -8.480807e-01

英文:

Before using apply(., 1, FUN), it's always wise to check, if there is a vectorized function available because they're much faster. For the arithmetic mean of the rows there is base::rowMeans. For the medians we can use matrixStats::rowMedians. For row means you could also use matrixStats::rowMeans2, which is slightly faster. It makes sense to use vapply here, it is similar to lapply, but conveniently yields a matrix and is fastest in the *apply family, because we can pre-allocate memory. (Note: I used set.seed(42) to create your data.)

So maybe you are looking for this:

vapply(id_list, \(x) rowMeans(exp1[, x]), numeric(dim(exp1)[1]))
#              group1       group2       group3      group4
# Gene 1  -1.35631700 -0.328620048  0.160795323 -0.01011904
# Gene 2   0.33985130  0.432482763 -0.169343033  0.13019294
# Gene 3   0.46623064  0.154045975  0.362607622  0.58710492
# Gene 4   0.17049403 -0.036744170 -0.056742305  1.10934764
# Gene 5  -0.15515465  0.237211068 -0.426415836 -0.50977736
vapply(id_list, \(x) matrixStats::rowMedians(exp1[, x], useNames=TRUE), numeric(dim(exp1)[1]))
#              group1      group2       group3        group4
# Gene 1  -1.22551737 -0.41642403  0.470862918 -1.782411e-01
# Gene 2   0.05680326  0.62277321 -0.512487033  3.943679e-01
# Gene 3   0.58009311 -0.10696651  0.149054062  9.345673e-01
# Gene 4   0.09852832  0.12774134 -0.573525823  1.046751e+00
# Gene 5  -0.44076823  0.11716389 -0.381682466 -8.480807e-01

答案2

得分: 2

以下是代码的翻译部分：

library(tidyverse)
as.data.frame(exp1) %>%
  rownames_to_column("Gene") %>%
  pivot_longer(cols= 2:21, names_to = "id", values_to = "Values") %>%
  left_join(., san) %>%
  group_by(group) %>%
  summarise(mean= mean(Values),
            median= median(Values))
#> Joining with `by = join_by(id)`
#> # A tibble: 4 × 3
#>   group     mean  median
#>   <int>    <dbl>   <dbl>
#> 1     1  0.0803   0.0568
#> 2     2 -0.0383  -0.0387
#> 3     3 -0.00929  0.0356
#> 4     4 -0.0840  -0.0306

根据您的评论，也可以通过基因分组，获得预期的输出。

library(tidyverse)
as.data.frame(exp1) %>%
  rownames_to_column("Gene") %>%
  pivot_longer(cols= 2:21, names_to = "id", values_to = "Values") %>%
  left_join(., san) %>%
  group_by(group, Gene) %>%
  summarise(mean= mean(Values),
            median= median(Values))
#> Joining with `by = join_by(id)`
#> `summarise()` has grouped output by 'group'. You can override using the
#> `.groups` argument.
#> # A tibble: 200 × 4
#> # Groups:   group [4]
#>    group Gene       mean  median
#>    <int> <chr>     <dbl>   <dbl>
#>  1     1 Gene 1  -0.0642 -0.122 
#>  2     1 Gene 10  0.0151  0.563 
#>  3     1 Gene 11 -0.0585 -0.0367
#>  4     1 Gene 12 -0.978  -0.917 
#>  5     1 Gene 13 -1.01   -1.37  
#>  6     1 Gene 14  0.160  -0.394 
#>  7     1 Gene 15 -0.295  -0.689 
#>  8     1 Gene 16  0.774   0.729 
#>  9     1 Gene 17 -0.356  -0.336 
#> 10     1 Gene 18 -0.741  -0.103 
#> # … with 190 more rows

<sup>2023-04-13创建，使用 reprex v2.0.2</sup>

英文:

So, I worked with the data generation process you provided and came up with a more simple solution. I changed exp1 into a dataframe, brought it in tidy format (pivot_longer()), added the groups from the san dataframe and finally applied the simple dplyr syntax to summarise your data.

library(tidyverse)
as.data.frame(exp1) %&gt;%
  rownames_to_column(&quot;Gene&quot;) %&gt;%
  pivot_longer(cols= 2:21, names_to = &quot;id&quot;, values_to = &quot;Values&quot;) %&gt;%
  left_join(., san) %&gt;%
  group_by(group) %&gt;%
  summarise(mean= mean(Values),
            median= median(Values))
#&gt; Joining with `by = join_by(id)`
#&gt; # A tibble: 4 &#215; 3
#&gt;   group     mean  median
#&gt;   &lt;int&gt;    &lt;dbl&gt;   &lt;dbl&gt;
#&gt; 1     1  0.0803   0.0568
#&gt; 2     2 -0.0383  -0.0387
#&gt; 3     3 -0.00929  0.0356
#&gt; 4     4 -0.0840  -0.0306

Considering your comment, simply also group by gene and that gets you the expected output.

library(tidyverse)
as.data.frame(exp1) %&gt;%
  rownames_to_column(&quot;Gene&quot;) %&gt;%
  pivot_longer(cols= 2:21, names_to = &quot;id&quot;, values_to = &quot;Values&quot;) %&gt;%
  left_join(., san) %&gt;%
  group_by(group, Gene) %&gt;%
  summarise(mean= mean(Values),
            median= median(Values))
#&gt; Joining with `by = join_by(id)`
#&gt; `summarise()` has grouped output by &#39;group&#39;. You can override using the
#&gt; `.groups` argument.
#&gt; # A tibble: 200 &#215; 4
#&gt; # Groups:   group [4]
#&gt;    group Gene       mean  median
#&gt;    &lt;int&gt; &lt;chr&gt;     &lt;dbl&gt;   &lt;dbl&gt;
#&gt;  1     1 Gene 1  -0.0642 -0.122 
#&gt;  2     1 Gene 10  0.0151  0.563 
#&gt;  3     1 Gene 11 -0.0585 -0.0367
#&gt;  4     1 Gene 12 -0.978  -0.917 
#&gt;  5     1 Gene 13 -1.01   -1.37  
#&gt;  6     1 Gene 14  0.160  -0.394 
#&gt;  7     1 Gene 15 -0.295  -0.689 
#&gt;  8     1 Gene 16  0.774   0.729 
#&gt;  9     1 Gene 17 -0.356  -0.336 
#&gt; 10     1 Gene 18 -0.741  -0.103 
#&gt; # … with 190 more rows

<sup>Created on 2023-04-13 with reprex v2.0.2</sup>

答案3

得分: 0

以下是代码部分的翻译：

### load data.table
library(data.table)
### convert data.frames to data.table
exp1 <- as.data.table(exp1)[, Genes := rownames(exp1), ]
san <- as.data.table(san)
### switch to long format
exp1 <- melt(exp1, id.vars = "Genes", variable.name = "id", value.name = "Expression")
### join based on sample id
exp1Join <- merge.data.table(exp1, san, by = "id")
### compute statistics of choice
exp1Join[, .(mean = mean(Expression), median = median(Expression)), by = .(group, Genes)]

exp1 <- as.data.table(exp1)[, `:=`(Genes = rownames(exp1), Experiment = 1), ]
exp2 <- as.data.table(exp2)[, `:=`(Genes = rownames(exp2), Experiment = 2), ]
exp1 <- melt(exp1, id.vars = c("Genes", "Experiment"), variable.name = "id", value.name = "Expression")
exp2 <- melt(exp2, id.vars = c("Genes", "Experiment"), variable.name = "id", value.name = "Expression")
### combine tables
expCombined <- rbindlist(l = list(exp1, exp2))
expCombined <- merge.data.table(expCombined, san, by = "id")
### compute the mean, median, sd and sample size for every combination of gene, group, and experiment
expCombined[, .(mean = mean(Expression), median = median(Expression), sd = sd(Expression), N = .N), by = .(group, Genes, Experiment)]

希望这些翻译对您有帮助。

英文:

Just as an additional alternative which scales very well you could use data.table.

### load data.table
library(data.table)
### convert data.frames to data.table
exp1 &lt;- as.data.table(exp1)[,Genes:=rownames(exp1),]
san &lt;- as.data.table(san)
### switch to long format
exp1 &lt;- melt(exp1, id.vars = &quot;Genes&quot;, variable.name = &quot;id&quot;, value.name = &quot;Expression&quot;)
### join based on sample id
exp1Join &lt;- merge.data.table(exp1, san, by = &quot;id&quot;)
### compute statistics of choice
exp1Join[,.(mean=mean(Expression), median=median(Expression)),by=.(group, Genes)]

Of course you can also do everything in a combined table if you want to collect all your data and perform computations based on the whole dataset (different experiments).

exp1 &lt;- as.data.table(exp1)[,`:=`(Genes=rownames(exp1), Experiment=1),]
exp2 &lt;- as.data.table(exp2)[,`:=`(Genes=rownames(exp2), Experiment=2),]
exp1 &lt;- melt(exp1, id.vars = c(&quot;Genes&quot;, &quot;Experiment&quot;), variable.name = &quot;id&quot;, value.name = &quot;Expression&quot;)
exp2 &lt;- melt(exp2, id.vars = c(&quot;Genes&quot;, &quot;Experiment&quot;), variable.name = &quot;id&quot;, value.name = &quot;Expression&quot;)
### combine tables
expCombined &lt;- rbindlist(l = list(exp1, exp2))
expCombined &lt;- merge.data.table(expCombined, san, by = &quot;id&quot;)
### compute the mean, median, sd and sample size for every combination of gene, group and experiment
expCombined[,.(mean=mean(Expression),
               median=median(Expression),
               sd=sd(Expression),
               N=.N),
            by=.(group, Genes, Experiment)]
#     group   Genes Experiment        mean      median        sd N
#  1:     1  Gene 1          1 -0.29234057 -0.24008726 0.6278528 5
#  2:     1  Gene 2          1 -0.74158796 -0.82441474 0.6289399 5
#  3:     1  Gene 3          1 -0.49293277 -0.30616603 1.1442834 5
#  4:     1  Gene 4          1 -0.33610311 -0.43948117 0.5331471 5
#  5:     1  Gene 5          1  0.68955333  0.60701836 0.9475727 5
# ---                                                             
#396:     4 Gene 46          2  1.17036249  1.17036249 0.4885201 2
#397:     4 Gene 47          2  0.64894986  0.64894986 0.1122624 2
#398:     4 Gene 48          2 -1.61083175 -1.61083175 0.6319153 2
#399:     4 Gene 49          2 -0.07673634 -0.07673634 0.7263174 2
#400:     4 Gene 50          2 -0.37240955 -0.37240955 0.8037523 2

Also just as comparison I included a small test just for exp1 based on the original post, the provided Tidyverse solution, and the vapply approach. Obviously benchmarks like this make more sense when data sets are large.

Unit: microseconds
      expr       min        lq       mean    median        uq        max neval cld
   TidyWay 57902.546 61651.077 76529.3966 67526.432 79027.012 172911.906   100 a  
     DTWay  2159.780  2490.218  3225.3781  2592.081  2960.918  17196.365   100  b 
    OrgWay  7459.775  8249.155 10667.4395  9224.186 11740.072  27480.962   100   c
 VApplyWay    87.618   133.598   168.3478   146.398   189.990    782.736   100  b

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将apply函数与lapply结合使用：计算数据框中各组的均值。

问题

数据

解决方案

Data

Solution

答案1

答案2

答案3

Adding legends to 3 plots in ggplot2

使用doFuture包完成并行计算后，如何关闭额外的R会话？

我可以帮你翻译这句话：如何在R中着色特定的县？

如何使一个四分之一的HTML文档使用整个屏幕宽度？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。