将apply函数与lapply结合使用:计算数据框中各组的均值。

huangapple go评论66阅读模式
英文:

Combine apply function with lapply: calculate mean of groups in df

问题

从具有不同组的每个样本(列)的单个表达值(行)的两个数据框中,我想要计算每个组的平均值和中位数。
我的解决方案似乎有点冗长,我想知道是否有更加优雅的解决方案。

数据

# 表达值
genes <- paste("gene",1:1000,sep="")
x <- list(
  A = sample(genes,300), 
  B = sample(genes,525), 
  C = sample(genes,440),
  D = sample(genes,350)
)

# 表达数据框
crete_exp_df <- function(gene_nr, sample_nr){
  df <- replicate(sample_nr, rnorm(gene_nr))
  rownames(df) <- paste("Gene", c(1:nrow(df)))
  colnames(df) <- paste("Sample", c(1:ncol(df)))
  return(df)
}

exp1 <- crete_exp_df(50, 20)
exp2 <- crete_exp_df(50, 20)

# 样本注释
san <- data.frame(
  id = colnames(exp1),
  group = sample(1:4, 20, replace = TRUE))

解决方案

# 获取每组样本的ID
ids_1 <- san %>% filter(group == 1) %>% pull(id)
ids_2 <- san %>% filter(group == 2) %>% pull(id)
ids_3 <- san %>% filter(group == 3) %>% pull(id)
ids_4 <- san %>% filter(group == 4) %>% pull(id)
id_list <- list(group1 = ids_1, group2 = ids_2, group3 = ids_3, group4 = ids_4)

# 函数计算df1的均值
get_means_exp1 <- function(id){
  apply(exp1[, id], 1, mean, na.rm = T)
} 
# 函数计算df2的均值
get_means_exp2 <- function(id){
  apply(exp2[, id], 1, mean, na.rm = T)
} 
# 对df1应用lapply
list_means_exp1 <- lapply(id_list, get_means_exp1)
means_exp1 <- as.data.frame(list_means_exp1)
# 对df2应用lapply
list_means_exp2 <- lapply(id_list, get_means_exp2)
means_exp2 <- as.data.frame(list_means_exp2)

我认为这可以更加优雅地解决。具体来说,如何获取每个组的ID并编写一个适用于两个数据框的函数。
期待从您的解决方案中学到更多,非常感谢!

英文:

From two dataframes with single expression values (rows) per sample (cols) of different groups, I want to calculate the mean and median per group.
My solution seems a bit verbose and I wonder if there is a more elegant solution.

Data

# expression values
genes <- paste("gene",1:1000,sep="")
x <- list(
  A = sample(genes,300), 
  B = sample(genes,525), 
  C = sample(genes,440),
  D = sample(genes,350)
)

# expression dataframe
crete_exp_df <- function(gene_nr, sample_nr){
  df <- replicate(sample_nr, rnorm(gene_nr))
  rownames(df) <- paste("Gene", c(1:nrow(df)))
  colnames(df) <- paste("Sample", c(1:ncol(df)))
  return(df)
}

exp1 <- crete_exp_df(50, 20)
exp2 <- crete_exp_df(50, 20)

# sample annotation
san <- data.frame(
  id = colnames(exp1),
  group = sample(1:4, 20, replace = TRUE))

Solution

# get ids of samples per group
ids_1 <- san %>% filter(group == 1) %>% pull(id)
ids_2 <- san %>% filter(group == 2) %>% pull(id)
ids_3 <- san %>% filter(group == 3) %>% pull(id)
ids_4 <- san %>% filter(group == 4) %>% pull(id)
id_list <- list(group1 = ids_1, group2 = ids_2, group3 = ids_3, group4 = ids_4)

# fct means df1
get_means_exp1 <- function(id){
  apply(exp1[, id], 1, mean, na.rm = T)
} 
# fct means df2
get_means_exp2 <- function(id){
  apply(exp2[, id], 1, mean, na.rm = T)
} 
# lapply on df1
list_means_exp1 <- lapply(id_list, get_means_exp1)
means_exp1 <- as.data.frame(list_means_exp1)
# lapply on df2
list_means_exp2 <- lapply(id_list, get_means_exp2)
means_exp2 <- as.data.frame(list_means_exp2)

I suppose this can be solved much more elegant. Specifically, how to get the ids per group and write a function that works for both df.
Looking forwards to learning from your solutions, thanks a lot!

答案1

得分: 3

在使用apply(., 1, FUN)之前,始终明智的做法是检查是否有矢量化的函数可用,因为它们速度更快。对于行的算术均值,可以使用base::rowMeans。对于中位数,我们可以使用matrixStats::rowMedians。对于行均值,还可以使用matrixStats::rowMeans2,它略快一些。在这里使用vapply是有道理的,它类似于lapply,但方便地生成一个矩阵,并且在*apply系列中速度最快,因为我们可以预先分配内存。(注意: 我使用了set.seed(42)来创建您的数据。)

所以也许您正在寻找这个:

vapply(id_list, \(x) rowMeans(exp1[, x]), numeric(dim(exp1)[1]))
#              group1       group2       group3      group4
# Gene 1  -1.35631700 -0.328620048  0.160795323 -0.01011904
# Gene 2   0.33985130  0.432482763 -0.169343033  0.13019294
# Gene 3   0.46623064  0.154045975  0.362607622  0.58710492
# Gene 4   0.17049403 -0.036744170 -0.056742305  1.10934764
# Gene 5  -0.15515465  0.237211068 -0.426415836 -0.50977736

vapply(id_list, \(x) matrixStats::rowMedians(exp1[, x], useNames=TRUE), numeric(dim(exp1)[1]))
#              group1      group2       group3        group4
# Gene 1  -1.22551737 -0.41642403  0.470862918 -1.782411e-01
# Gene 2   0.05680326  0.62277321 -0.512487033  3.943679e-01
# Gene 3   0.58009311 -0.10696651  0.149054062  9.345673e-01
# Gene 4   0.09852832  0.12774134 -0.573525823  1.046751e+00
# Gene 5  -0.44076823  0.11716389 -0.381682466 -8.480807e-01
英文:

Before using apply(., 1, FUN), it's always wise to check, if there is a vectorized function available because they're much faster. For the arithmetic mean of the rows there is base::rowMeans. For the medians we can use matrixStats::rowMedians. For row means you could also use matrixStats::rowMeans2, which is slightly faster. It makes sense to use vapply here, it is similar to lapply, but conveniently yields a matrix and is fastest in the *apply family, because we can pre-allocate memory. (Note: I used set.seed(42) to create your data.)

So maybe you are looking for this:

vapply(id_list, \(x) rowMeans(exp1[, x]), numeric(dim(exp1)[1]))
#              group1       group2       group3      group4
# Gene 1  -1.35631700 -0.328620048  0.160795323 -0.01011904
# Gene 2   0.33985130  0.432482763 -0.169343033  0.13019294
# Gene 3   0.46623064  0.154045975  0.362607622  0.58710492
# Gene 4   0.17049403 -0.036744170 -0.056742305  1.10934764
# Gene 5  -0.15515465  0.237211068 -0.426415836 -0.50977736

vapply(id_list, \(x) matrixStats::rowMedians(exp1[, x], useNames=TRUE), numeric(dim(exp1)[1]))
#              group1      group2       group3        group4
# Gene 1  -1.22551737 -0.41642403  0.470862918 -1.782411e-01
# Gene 2   0.05680326  0.62277321 -0.512487033  3.943679e-01
# Gene 3   0.58009311 -0.10696651  0.149054062  9.345673e-01
# Gene 4   0.09852832  0.12774134 -0.573525823  1.046751e+00
# Gene 5  -0.44076823  0.11716389 -0.381682466 -8.480807e-01

答案2

得分: 2

以下是代码的翻译部分:

library(tidyverse)

as.data.frame(exp1) %>%
  rownames_to_column("Gene") %>%
  pivot_longer(cols= 2:21, names_to = "id", values_to = "Values") %>%
  left_join(., san) %>%
  group_by(group) %>%
  summarise(mean= mean(Values),
            median= median(Values))
#> Joining with `by = join_by(id)`
#> # A tibble: 4 × 3
#>   group     mean  median
#>   <int>    <dbl>   <dbl>
#> 1     1  0.0803   0.0568
#> 2     2 -0.0383  -0.0387
#> 3     3 -0.00929  0.0356
#> 4     4 -0.0840  -0.0306

根据您的评论,也可以通过基因分组,获得预期的输出。

library(tidyverse)

as.data.frame(exp1) %>%
  rownames_to_column("Gene") %>%
  pivot_longer(cols= 2:21, names_to = "id", values_to = "Values") %>%
  left_join(., san) %>%
  group_by(group, Gene) %>%
  summarise(mean= mean(Values),
            median= median(Values))
#> Joining with `by = join_by(id)`
#> `summarise()` has grouped output by 'group'. You can override using the
#> `.groups` argument.
#> # A tibble: 200 × 4
#> # Groups:   group [4]
#>    group Gene       mean  median
#>    <int> <chr>     <dbl>   <dbl>
#>  1     1 Gene 1  -0.0642 -0.122 
#>  2     1 Gene 10  0.0151  0.563 
#>  3     1 Gene 11 -0.0585 -0.0367
#>  4     1 Gene 12 -0.978  -0.917 
#>  5     1 Gene 13 -1.01   -1.37  
#>  6     1 Gene 14  0.160  -0.394 
#>  7     1 Gene 15 -0.295  -0.689 
#>  8     1 Gene 16  0.774   0.729 
#>  9     1 Gene 17 -0.356  -0.336 
#> 10     1 Gene 18 -0.741  -0.103 
#> # … with 190 more rows

<sup>2023-04-13创建,使用 reprex v2.0.2</sup>

英文:

So, I worked with the data generation process you provided and came up with a more simple solution. I changed exp1 into a dataframe, brought it in tidy format (pivot_longer()), added the groups from the san dataframe and finally applied the simple dplyr syntax to summarise your data.

library(tidyverse)

as.data.frame(exp1) %&gt;%
  rownames_to_column(&quot;Gene&quot;) %&gt;%
  pivot_longer(cols= 2:21, names_to = &quot;id&quot;, values_to = &quot;Values&quot;) %&gt;%
  left_join(., san) %&gt;%
  group_by(group) %&gt;%
  summarise(mean= mean(Values),
            median= median(Values))
#&gt; Joining with `by = join_by(id)`
#&gt; # A tibble: 4 &#215; 3
#&gt;   group     mean  median
#&gt;   &lt;int&gt;    &lt;dbl&gt;   &lt;dbl&gt;
#&gt; 1     1  0.0803   0.0568
#&gt; 2     2 -0.0383  -0.0387
#&gt; 3     3 -0.00929  0.0356
#&gt; 4     4 -0.0840  -0.0306

Considering your comment, simply also group by gene and that gets you the expected output.

library(tidyverse)

as.data.frame(exp1) %&gt;%
  rownames_to_column(&quot;Gene&quot;) %&gt;%
  pivot_longer(cols= 2:21, names_to = &quot;id&quot;, values_to = &quot;Values&quot;) %&gt;%
  left_join(., san) %&gt;%
  group_by(group, Gene) %&gt;%
  summarise(mean= mean(Values),
            median= median(Values))
#&gt; Joining with `by = join_by(id)`
#&gt; `summarise()` has grouped output by &#39;group&#39;. You can override using the
#&gt; `.groups` argument.
#&gt; # A tibble: 200 &#215; 4
#&gt; # Groups:   group [4]
#&gt;    group Gene       mean  median
#&gt;    &lt;int&gt; &lt;chr&gt;     &lt;dbl&gt;   &lt;dbl&gt;
#&gt;  1     1 Gene 1  -0.0642 -0.122 
#&gt;  2     1 Gene 10  0.0151  0.563 
#&gt;  3     1 Gene 11 -0.0585 -0.0367
#&gt;  4     1 Gene 12 -0.978  -0.917 
#&gt;  5     1 Gene 13 -1.01   -1.37  
#&gt;  6     1 Gene 14  0.160  -0.394 
#&gt;  7     1 Gene 15 -0.295  -0.689 
#&gt;  8     1 Gene 16  0.774   0.729 
#&gt;  9     1 Gene 17 -0.356  -0.336 
#&gt; 10     1 Gene 18 -0.741  -0.103 
#&gt; # … with 190 more rows

<sup>Created on 2023-04-13 with reprex v2.0.2</sup>

答案3

得分: 0

以下是代码部分的翻译:

### load data.table
library(data.table)

### convert data.frames to data.table
exp1 <- as.data.table(exp1)[, Genes := rownames(exp1), ]
san <- as.data.table(san)

### switch to long format
exp1 <- melt(exp1, id.vars = "Genes", variable.name = "id", value.name = "Expression")

### join based on sample id
exp1Join <- merge.data.table(exp1, san, by = "id")

### compute statistics of choice
exp1Join[, .(mean = mean(Expression), median = median(Expression)), by = .(group, Genes)]
exp1 <- as.data.table(exp1)[, `:=`(Genes = rownames(exp1), Experiment = 1), ]
exp2 <- as.data.table(exp2)[, `:=`(Genes = rownames(exp2), Experiment = 2), ]

exp1 <- melt(exp1, id.vars = c("Genes", "Experiment"), variable.name = "id", value.name = "Expression")
exp2 <- melt(exp2, id.vars = c("Genes", "Experiment"), variable.name = "id", value.name = "Expression")

### combine tables
expCombined <- rbindlist(l = list(exp1, exp2))
expCombined <- merge.data.table(expCombined, san, by = "id")

### compute the mean, median, sd and sample size for every combination of gene, group, and experiment
expCombined[, .(mean = mean(Expression), median = median(Expression), sd = sd(Expression), N = .N), by = .(group, Genes, Experiment)]

希望这些翻译对您有帮助。

英文:

Just as an additional alternative which scales very well you could use data.table.

### load data.table
library(data.table)

### convert data.frames to data.table
exp1 &lt;- as.data.table(exp1)[,Genes:=rownames(exp1),]
san &lt;- as.data.table(san)

### switch to long format
exp1 &lt;- melt(exp1, id.vars = &quot;Genes&quot;, variable.name = &quot;id&quot;, value.name = &quot;Expression&quot;)

### join based on sample id
exp1Join &lt;- merge.data.table(exp1, san, by = &quot;id&quot;)

### compute statistics of choice
exp1Join[,.(mean=mean(Expression), median=median(Expression)),by=.(group, Genes)]

Of course you can also do everything in a combined table if you want to collect all your data and perform computations based on the whole dataset (different experiments).

exp1 &lt;- as.data.table(exp1)[,`:=`(Genes=rownames(exp1), Experiment=1),]
exp2 &lt;- as.data.table(exp2)[,`:=`(Genes=rownames(exp2), Experiment=2),]

exp1 &lt;- melt(exp1, id.vars = c(&quot;Genes&quot;, &quot;Experiment&quot;), variable.name = &quot;id&quot;, value.name = &quot;Expression&quot;)
exp2 &lt;- melt(exp2, id.vars = c(&quot;Genes&quot;, &quot;Experiment&quot;), variable.name = &quot;id&quot;, value.name = &quot;Expression&quot;)

### combine tables
expCombined &lt;- rbindlist(l = list(exp1, exp2))
expCombined &lt;- merge.data.table(expCombined, san, by = &quot;id&quot;)

### compute the mean, median, sd and sample size for every combination of gene, group and experiment
expCombined[,.(mean=mean(Expression),
               median=median(Expression),
               sd=sd(Expression),
               N=.N),
            by=.(group, Genes, Experiment)]

#     group   Genes Experiment        mean      median        sd N
#  1:     1  Gene 1          1 -0.29234057 -0.24008726 0.6278528 5
#  2:     1  Gene 2          1 -0.74158796 -0.82441474 0.6289399 5
#  3:     1  Gene 3          1 -0.49293277 -0.30616603 1.1442834 5
#  4:     1  Gene 4          1 -0.33610311 -0.43948117 0.5331471 5
#  5:     1  Gene 5          1  0.68955333  0.60701836 0.9475727 5
# ---                                                             
#396:     4 Gene 46          2  1.17036249  1.17036249 0.4885201 2
#397:     4 Gene 47          2  0.64894986  0.64894986 0.1122624 2
#398:     4 Gene 48          2 -1.61083175 -1.61083175 0.6319153 2
#399:     4 Gene 49          2 -0.07673634 -0.07673634 0.7263174 2
#400:     4 Gene 50          2 -0.37240955 -0.37240955 0.8037523 2

Also just as comparison I included a small test just for exp1 based on the original post, the provided Tidyverse solution, and the vapply approach. Obviously benchmarks like this make more sense when data sets are large.

Unit: microseconds
      expr       min        lq       mean    median        uq        max neval cld
   TidyWay 57902.546 61651.077 76529.3966 67526.432 79027.012 172911.906   100 a  
     DTWay  2159.780  2490.218  3225.3781  2592.081  2960.918  17196.365   100  b 
    OrgWay  7459.775  8249.155 10667.4395  9224.186 11740.072  27480.962   100   c
 VApplyWay    87.618   133.598   168.3478   146.398   189.990    782.736   100  b 

huangapple
  • 本文由 发表于 2023年4月13日 16:42:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76003410.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定