dplyr – get certain summary statics for multiple columns of a dataframe


I want to create a summary statistics table for some summary functions for multiple variables. I've managed to do it using summarise and across, but I get a wide dataframe which is hard to read. Is there a better alternative (perhaps using purrr), or is there an easy way of reshaping the data?

Here is a reproducible example (the funs list contains additional functions I've created myself):

data <- as.data.frame(cbind(estimator1 = rnorm(3), 
                            estimator2 = runif(3)))
funs <- list(mean = mean, median = median)

If I use summarise and across I obtain:

estimator1_mean estimator1_median estimator2_mean estimator2_median
0.9506083          1.138536       0.5789924         0.7598719

What I would like to obtain is:

         estimator1 estimator2
mean     0.9506083  0.5789924        
median   1.138536   0.7598719

得分: 2

基于 R 的方法:

使用 sapply

sapply(data, \(x) sapply(funs, \(f) f(x) )) 将嵌套应用 sapply() 函数到 datafuns。对于 data 的每个元素 x,它使用内部的 sapply() 函数将每个 funs 中的函数 f 应用到 x 上。

两个被应用的函数都是匿名函数,使用 \(f) 语法定义,它们接受一个参数 f

假设我们有给定的 funs &lt;- list(mean = mean, median = median)

这段代码 sapply(data, \(x) sapply(funs, \(f) f(x) )) 将应用 mean()median()data 的每个元素,并返回一个包含结果的矩阵:

sapply(data, \(x) sapply(funs, \(f) f(x) ))
       estimator1 estimator2
mean    0.3081365  0.4251447
median  0.2159416  0.3198206

base R approach:

Using sapply:

sapply(data, \(x) sapply(funs, \(f) f(x) )) is applying a nested sapply() function to data and funs. For each element x of data, it applies each function f in funs to x using the inner sapply() function.

Both functions applied are anonymous functions defined with the \(f) syntax, which takes one argument f.

Having our given funs &lt;- list(mean = mean, median = median)

The code sapply(data, \(x) sapply(funs, \(f) f(x) )) will apply mean() and median() to each element of data and return a matrix with the results:

sapply(data, \(x) sapply(funs, \(f) f(x) ))
       estimator1 estimator2
mean    0.3081365  0.4251447
median  0.2159416  0.3198206


得分: 1

你可以使用 pivot_longer().value(".value" 表示列名的相应部分定义了包含单元格值的输出列名,完全覆盖了 values_to,请参阅这里),例如:

data %>%
  summarise(across(everything(), list(mean = mean, median = median, var = var))) %>%
  tidyr::pivot_longer(cols = everything(), names_to = c(".value", "stats"), names_sep = "_")


# A tibble: 3 × 3
  stats    estimator1 estimator2
  <chr>        <dbl>        <dbl>
1 mean         0.221        0.448
2 median       0.110        0.429
3 var          0.770        0.00288

You can use pivot_longer() with .value (".value" indicates that the corresponding component of the column name defines the name of the output column containing the cell values, overriding values_to entirely, see here), eg.

  data |&gt;
    summarise(across(everything(), list(mean = mean, median = median, var = var))) |&gt;
    tidyr::pivot_longer(cols = everything(), names_to = c(&quot;.value&quot;, &quot;stats&quot;), names_sep = &quot;_&quot;)

  stats  estimator1 estimator2
  &lt;chr&gt;       &lt;dbl&gt;      &lt;dbl&gt;
1 mean        0.221    0.448  
2 median      0.110    0.429  
3 var         0.770    0.00288

