将apply函数与lapply结合使用:计算数据框中各组的均值。

huangapple go评论96阅读模式
英文:

Combine apply function with lapply: calculate mean of groups in df

问题

从具有不同组的每个样本(列)的单个表达值(行)的两个数据框中,我想要计算每个组的平均值和中位数。
我的解决方案似乎有点冗长,我想知道是否有更加优雅的解决方案。

数据

  1. # 表达值
  2. genes <- paste("gene",1:1000,sep="")
  3. x <- list(
  4. A = sample(genes,300),
  5. B = sample(genes,525),
  6. C = sample(genes,440),
  7. D = sample(genes,350)
  8. )
  9. # 表达数据框
  10. crete_exp_df <- function(gene_nr, sample_nr){
  11. df <- replicate(sample_nr, rnorm(gene_nr))
  12. rownames(df) <- paste("Gene", c(1:nrow(df)))
  13. colnames(df) <- paste("Sample", c(1:ncol(df)))
  14. return(df)
  15. }
  16. exp1 <- crete_exp_df(50, 20)
  17. exp2 <- crete_exp_df(50, 20)
  18. # 样本注释
  19. san <- data.frame(
  20. id = colnames(exp1),
  21. group = sample(1:4, 20, replace = TRUE))

解决方案

  1. # 获取每组样本的ID
  2. ids_1 <- san %>% filter(group == 1) %>% pull(id)
  3. ids_2 <- san %>% filter(group == 2) %>% pull(id)
  4. ids_3 <- san %>% filter(group == 3) %>% pull(id)
  5. ids_4 <- san %>% filter(group == 4) %>% pull(id)
  6. id_list <- list(group1 = ids_1, group2 = ids_2, group3 = ids_3, group4 = ids_4)
  7. # 函数计算df1的均值
  8. get_means_exp1 <- function(id){
  9. apply(exp1[, id], 1, mean, na.rm = T)
  10. }
  11. # 函数计算df2的均值
  12. get_means_exp2 <- function(id){
  13. apply(exp2[, id], 1, mean, na.rm = T)
  14. }
  15. # 对df1应用lapply
  16. list_means_exp1 <- lapply(id_list, get_means_exp1)
  17. means_exp1 <- as.data.frame(list_means_exp1)
  18. # 对df2应用lapply
  19. list_means_exp2 <- lapply(id_list, get_means_exp2)
  20. means_exp2 <- as.data.frame(list_means_exp2)

我认为这可以更加优雅地解决。具体来说,如何获取每个组的ID并编写一个适用于两个数据框的函数。
期待从您的解决方案中学到更多,非常感谢!

英文:

From two dataframes with single expression values (rows) per sample (cols) of different groups, I want to calculate the mean and median per group.
My solution seems a bit verbose and I wonder if there is a more elegant solution.

Data

  1. # expression values
  2. genes <- paste("gene",1:1000,sep="")
  3. x <- list(
  4. A = sample(genes,300),
  5. B = sample(genes,525),
  6. C = sample(genes,440),
  7. D = sample(genes,350)
  8. )
  9. # expression dataframe
  10. crete_exp_df <- function(gene_nr, sample_nr){
  11. df <- replicate(sample_nr, rnorm(gene_nr))
  12. rownames(df) <- paste("Gene", c(1:nrow(df)))
  13. colnames(df) <- paste("Sample", c(1:ncol(df)))
  14. return(df)
  15. }
  16. exp1 <- crete_exp_df(50, 20)
  17. exp2 <- crete_exp_df(50, 20)
  18. # sample annotation
  19. san <- data.frame(
  20. id = colnames(exp1),
  21. group = sample(1:4, 20, replace = TRUE))

Solution

  1. # get ids of samples per group
  2. ids_1 <- san %>% filter(group == 1) %>% pull(id)
  3. ids_2 <- san %>% filter(group == 2) %>% pull(id)
  4. ids_3 <- san %>% filter(group == 3) %>% pull(id)
  5. ids_4 <- san %>% filter(group == 4) %>% pull(id)
  6. id_list <- list(group1 = ids_1, group2 = ids_2, group3 = ids_3, group4 = ids_4)
  7. # fct means df1
  8. get_means_exp1 <- function(id){
  9. apply(exp1[, id], 1, mean, na.rm = T)
  10. }
  11. # fct means df2
  12. get_means_exp2 <- function(id){
  13. apply(exp2[, id], 1, mean, na.rm = T)
  14. }
  15. # lapply on df1
  16. list_means_exp1 <- lapply(id_list, get_means_exp1)
  17. means_exp1 <- as.data.frame(list_means_exp1)
  18. # lapply on df2
  19. list_means_exp2 <- lapply(id_list, get_means_exp2)
  20. means_exp2 <- as.data.frame(list_means_exp2)

I suppose this can be solved much more elegant. Specifically, how to get the ids per group and write a function that works for both df.
Looking forwards to learning from your solutions, thanks a lot!

答案1

得分: 3

在使用apply(., 1, FUN)之前,始终明智的做法是检查是否有矢量化的函数可用,因为它们速度更快。对于行的算术均值,可以使用base::rowMeans。对于中位数,我们可以使用matrixStats::rowMedians。对于行均值,还可以使用matrixStats::rowMeans2,它略快一些。在这里使用vapply是有道理的,它类似于lapply,但方便地生成一个矩阵,并且在*apply系列中速度最快,因为我们可以预先分配内存。(注意: 我使用了set.seed(42)来创建您的数据。)

所以也许您正在寻找这个:

  1. vapply(id_list, \(x) rowMeans(exp1[, x]), numeric(dim(exp1)[1]))
  2. # group1 group2 group3 group4
  3. # Gene 1 -1.35631700 -0.328620048 0.160795323 -0.01011904
  4. # Gene 2 0.33985130 0.432482763 -0.169343033 0.13019294
  5. # Gene 3 0.46623064 0.154045975 0.362607622 0.58710492
  6. # Gene 4 0.17049403 -0.036744170 -0.056742305 1.10934764
  7. # Gene 5 -0.15515465 0.237211068 -0.426415836 -0.50977736
  8. vapply(id_list, \(x) matrixStats::rowMedians(exp1[, x], useNames=TRUE), numeric(dim(exp1)[1]))
  9. # group1 group2 group3 group4
  10. # Gene 1 -1.22551737 -0.41642403 0.470862918 -1.782411e-01
  11. # Gene 2 0.05680326 0.62277321 -0.512487033 3.943679e-01
  12. # Gene 3 0.58009311 -0.10696651 0.149054062 9.345673e-01
  13. # Gene 4 0.09852832 0.12774134 -0.573525823 1.046751e+00
  14. # Gene 5 -0.44076823 0.11716389 -0.381682466 -8.480807e-01
英文:

Before using apply(., 1, FUN), it's always wise to check, if there is a vectorized function available because they're much faster. For the arithmetic mean of the rows there is base::rowMeans. For the medians we can use matrixStats::rowMedians. For row means you could also use matrixStats::rowMeans2, which is slightly faster. It makes sense to use vapply here, it is similar to lapply, but conveniently yields a matrix and is fastest in the *apply family, because we can pre-allocate memory. (Note: I used set.seed(42) to create your data.)

So maybe you are looking for this:

  1. vapply(id_list, \(x) rowMeans(exp1[, x]), numeric(dim(exp1)[1]))
  2. # group1 group2 group3 group4
  3. # Gene 1 -1.35631700 -0.328620048 0.160795323 -0.01011904
  4. # Gene 2 0.33985130 0.432482763 -0.169343033 0.13019294
  5. # Gene 3 0.46623064 0.154045975 0.362607622 0.58710492
  6. # Gene 4 0.17049403 -0.036744170 -0.056742305 1.10934764
  7. # Gene 5 -0.15515465 0.237211068 -0.426415836 -0.50977736
  8. vapply(id_list, \(x) matrixStats::rowMedians(exp1[, x], useNames=TRUE), numeric(dim(exp1)[1]))
  9. # group1 group2 group3 group4
  10. # Gene 1 -1.22551737 -0.41642403 0.470862918 -1.782411e-01
  11. # Gene 2 0.05680326 0.62277321 -0.512487033 3.943679e-01
  12. # Gene 3 0.58009311 -0.10696651 0.149054062 9.345673e-01
  13. # Gene 4 0.09852832 0.12774134 -0.573525823 1.046751e+00
  14. # Gene 5 -0.44076823 0.11716389 -0.381682466 -8.480807e-01

答案2

得分: 2

以下是代码的翻译部分:

  1. library(tidyverse)
  2. as.data.frame(exp1) %>%
  3. rownames_to_column("Gene") %>%
  4. pivot_longer(cols= 2:21, names_to = "id", values_to = "Values") %>%
  5. left_join(., san) %>%
  6. group_by(group) %>%
  7. summarise(mean= mean(Values),
  8. median= median(Values))
  9. #> Joining with `by = join_by(id)`
  10. #> # A tibble: 4 × 3
  11. #> group mean median
  12. #> <int> <dbl> <dbl>
  13. #> 1 1 0.0803 0.0568
  14. #> 2 2 -0.0383 -0.0387
  15. #> 3 3 -0.00929 0.0356
  16. #> 4 4 -0.0840 -0.0306

根据您的评论,也可以通过基因分组,获得预期的输出。

  1. library(tidyverse)
  2. as.data.frame(exp1) %>%
  3. rownames_to_column("Gene") %>%
  4. pivot_longer(cols= 2:21, names_to = "id", values_to = "Values") %>%
  5. left_join(., san) %>%
  6. group_by(group, Gene) %>%
  7. summarise(mean= mean(Values),
  8. median= median(Values))
  9. #> Joining with `by = join_by(id)`
  10. #> `summarise()` has grouped output by 'group'. You can override using the
  11. #> `.groups` argument.
  12. #> # A tibble: 200 × 4
  13. #> # Groups: group [4]
  14. #> group Gene mean median
  15. #> <int> <chr> <dbl> <dbl>
  16. #> 1 1 Gene 1 -0.0642 -0.122
  17. #> 2 1 Gene 10 0.0151 0.563
  18. #> 3 1 Gene 11 -0.0585 -0.0367
  19. #> 4 1 Gene 12 -0.978 -0.917
  20. #> 5 1 Gene 13 -1.01 -1.37
  21. #> 6 1 Gene 14 0.160 -0.394
  22. #> 7 1 Gene 15 -0.295 -0.689
  23. #> 8 1 Gene 16 0.774 0.729
  24. #> 9 1 Gene 17 -0.356 -0.336
  25. #> 10 1 Gene 18 -0.741 -0.103
  26. #> # … with 190 more rows

<sup>2023-04-13创建,使用 reprex v2.0.2</sup>

英文:

So, I worked with the data generation process you provided and came up with a more simple solution. I changed exp1 into a dataframe, brought it in tidy format (pivot_longer()), added the groups from the san dataframe and finally applied the simple dplyr syntax to summarise your data.

  1. library(tidyverse)
  2. as.data.frame(exp1) %&gt;%
  3. rownames_to_column(&quot;Gene&quot;) %&gt;%
  4. pivot_longer(cols= 2:21, names_to = &quot;id&quot;, values_to = &quot;Values&quot;) %&gt;%
  5. left_join(., san) %&gt;%
  6. group_by(group) %&gt;%
  7. summarise(mean= mean(Values),
  8. median= median(Values))
  9. #&gt; Joining with `by = join_by(id)`
  10. #&gt; # A tibble: 4 &#215; 3
  11. #&gt; group mean median
  12. #&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;
  13. #&gt; 1 1 0.0803 0.0568
  14. #&gt; 2 2 -0.0383 -0.0387
  15. #&gt; 3 3 -0.00929 0.0356
  16. #&gt; 4 4 -0.0840 -0.0306

Considering your comment, simply also group by gene and that gets you the expected output.

  1. library(tidyverse)
  2. as.data.frame(exp1) %&gt;%
  3. rownames_to_column(&quot;Gene&quot;) %&gt;%
  4. pivot_longer(cols= 2:21, names_to = &quot;id&quot;, values_to = &quot;Values&quot;) %&gt;%
  5. left_join(., san) %&gt;%
  6. group_by(group, Gene) %&gt;%
  7. summarise(mean= mean(Values),
  8. median= median(Values))
  9. #&gt; Joining with `by = join_by(id)`
  10. #&gt; `summarise()` has grouped output by &#39;group&#39;. You can override using the
  11. #&gt; `.groups` argument.
  12. #&gt; # A tibble: 200 &#215; 4
  13. #&gt; # Groups: group [4]
  14. #&gt; group Gene mean median
  15. #&gt; &lt;int&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
  16. #&gt; 1 1 Gene 1 -0.0642 -0.122
  17. #&gt; 2 1 Gene 10 0.0151 0.563
  18. #&gt; 3 1 Gene 11 -0.0585 -0.0367
  19. #&gt; 4 1 Gene 12 -0.978 -0.917
  20. #&gt; 5 1 Gene 13 -1.01 -1.37
  21. #&gt; 6 1 Gene 14 0.160 -0.394
  22. #&gt; 7 1 Gene 15 -0.295 -0.689
  23. #&gt; 8 1 Gene 16 0.774 0.729
  24. #&gt; 9 1 Gene 17 -0.356 -0.336
  25. #&gt; 10 1 Gene 18 -0.741 -0.103
  26. #&gt; # … with 190 more rows

<sup>Created on 2023-04-13 with reprex v2.0.2</sup>

答案3

得分: 0

以下是代码部分的翻译:

  1. ### load data.table
  2. library(data.table)
  3. ### convert data.frames to data.table
  4. exp1 <- as.data.table(exp1)[, Genes := rownames(exp1), ]
  5. san <- as.data.table(san)
  6. ### switch to long format
  7. exp1 <- melt(exp1, id.vars = "Genes", variable.name = "id", value.name = "Expression")
  8. ### join based on sample id
  9. exp1Join <- merge.data.table(exp1, san, by = "id")
  10. ### compute statistics of choice
  11. exp1Join[, .(mean = mean(Expression), median = median(Expression)), by = .(group, Genes)]
  1. exp1 <- as.data.table(exp1)[, `:=`(Genes = rownames(exp1), Experiment = 1), ]
  2. exp2 <- as.data.table(exp2)[, `:=`(Genes = rownames(exp2), Experiment = 2), ]
  3. exp1 <- melt(exp1, id.vars = c("Genes", "Experiment"), variable.name = "id", value.name = "Expression")
  4. exp2 <- melt(exp2, id.vars = c("Genes", "Experiment"), variable.name = "id", value.name = "Expression")
  5. ### combine tables
  6. expCombined <- rbindlist(l = list(exp1, exp2))
  7. expCombined <- merge.data.table(expCombined, san, by = "id")
  8. ### compute the mean, median, sd and sample size for every combination of gene, group, and experiment
  9. expCombined[, .(mean = mean(Expression), median = median(Expression), sd = sd(Expression), N = .N), by = .(group, Genes, Experiment)]

希望这些翻译对您有帮助。

英文:

Just as an additional alternative which scales very well you could use data.table.

  1. ### load data.table
  2. library(data.table)
  3. ### convert data.frames to data.table
  4. exp1 &lt;- as.data.table(exp1)[,Genes:=rownames(exp1),]
  5. san &lt;- as.data.table(san)
  6. ### switch to long format
  7. exp1 &lt;- melt(exp1, id.vars = &quot;Genes&quot;, variable.name = &quot;id&quot;, value.name = &quot;Expression&quot;)
  8. ### join based on sample id
  9. exp1Join &lt;- merge.data.table(exp1, san, by = &quot;id&quot;)
  10. ### compute statistics of choice
  11. exp1Join[,.(mean=mean(Expression), median=median(Expression)),by=.(group, Genes)]

Of course you can also do everything in a combined table if you want to collect all your data and perform computations based on the whole dataset (different experiments).

  1. exp1 &lt;- as.data.table(exp1)[,`:=`(Genes=rownames(exp1), Experiment=1),]
  2. exp2 &lt;- as.data.table(exp2)[,`:=`(Genes=rownames(exp2), Experiment=2),]
  3. exp1 &lt;- melt(exp1, id.vars = c(&quot;Genes&quot;, &quot;Experiment&quot;), variable.name = &quot;id&quot;, value.name = &quot;Expression&quot;)
  4. exp2 &lt;- melt(exp2, id.vars = c(&quot;Genes&quot;, &quot;Experiment&quot;), variable.name = &quot;id&quot;, value.name = &quot;Expression&quot;)
  5. ### combine tables
  6. expCombined &lt;- rbindlist(l = list(exp1, exp2))
  7. expCombined &lt;- merge.data.table(expCombined, san, by = &quot;id&quot;)
  8. ### compute the mean, median, sd and sample size for every combination of gene, group and experiment
  9. expCombined[,.(mean=mean(Expression),
  10. median=median(Expression),
  11. sd=sd(Expression),
  12. N=.N),
  13. by=.(group, Genes, Experiment)]
  14. # group Genes Experiment mean median sd N
  15. # 1: 1 Gene 1 1 -0.29234057 -0.24008726 0.6278528 5
  16. # 2: 1 Gene 2 1 -0.74158796 -0.82441474 0.6289399 5
  17. # 3: 1 Gene 3 1 -0.49293277 -0.30616603 1.1442834 5
  18. # 4: 1 Gene 4 1 -0.33610311 -0.43948117 0.5331471 5
  19. # 5: 1 Gene 5 1 0.68955333 0.60701836 0.9475727 5
  20. # ---
  21. #396: 4 Gene 46 2 1.17036249 1.17036249 0.4885201 2
  22. #397: 4 Gene 47 2 0.64894986 0.64894986 0.1122624 2
  23. #398: 4 Gene 48 2 -1.61083175 -1.61083175 0.6319153 2
  24. #399: 4 Gene 49 2 -0.07673634 -0.07673634 0.7263174 2
  25. #400: 4 Gene 50 2 -0.37240955 -0.37240955 0.8037523 2

Also just as comparison I included a small test just for exp1 based on the original post, the provided Tidyverse solution, and the vapply approach. Obviously benchmarks like this make more sense when data sets are large.

  1. Unit: microseconds
  2. expr min lq mean median uq max neval cld
  3. TidyWay 57902.546 61651.077 76529.3966 67526.432 79027.012 172911.906 100 a
  4. DTWay 2159.780 2490.218 3225.3781 2592.081 2960.918 17196.365 100 b
  5. OrgWay 7459.775 8249.155 10667.4395 9224.186 11740.072 27480.962 100 c
  6. VApplyWay 87.618 133.598 168.3478 146.398 189.990 782.736 100 b

huangapple
  • 本文由 发表于 2023年4月13日 16:42:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76003410.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定