如何根据列名中的模式计算数据框中的均值?

huangapple go评论94阅读模式
英文:

How to calculate the mean in a data.frame based on patterns in the column names?

问题

我想计算上面示例中每个t1t2t3等的均值。

英文:

I have a data.frame with 21 columns, each three columns are for a specific time-point. I would like to calculate the mean of each of these triplicates.

   geneName  t11  t12  t13  t21  t22  t23  t31  t32  t33  t41  t42  t43  t51 t52  t53  t61  t62  t63
gene1 gene1 3296 5133 3466 2166 1759 2099 1916 1379 1570 2533 1794 1016  800  79  648   99   60  152
gene2 gene2 4210 5505 4173 2736 2748 3052 2409 1944 2237 1158 3475 1488 4023 102  940  265  365  124
...

In the example above, I would like to calculate the mean for the three t1, t2, t3, etc.

答案1

得分: 5

这是一个基本的R方法 -

# 从计算中排除的列编号
columns_not_included <- 1
# 保留我们需要的数据集子集
tmp <- df[-columns_not_included]

cbind.data.frame(df[columns_not_included], 
              sapply(split.default(tmp, 
               sub('\\d$', '', names(tmp))), rowMeans, na.rm =TRUE))

#  geneName       t1       t2       t3       t4       t5       t6
#1    gene1 3965.000 2008.000 1621.667 1781.000  509.000 103.6667
#2    gene2 4629.333 2845.333 2196.667 2040.333 1688.333 251.3333

sub('\\d$', '', names(tmp)) 会移除列名的最后一个数字,以便 t11t12t13 都会返回 t1,其他组也是如此,然后传递给 split.default,根据这些组将数据拆分为数据框的列表。对于每个组,我们使用 rowMeans 计算逐行均值。

英文:

Here is a base R method -

# Column numbers to be ignored from the calculation
columns_not_included &lt;- 1
# Keep only subset of dataset that we need
tmp &lt;- df[-columns_not_included]

cbind.data.frame(df[columns_not_included], 
              sapply(split.default(tmp, 
               sub(&#39;\\d$&#39;, &#39;&#39;, names(tmp))), rowMeans, na.rm =TRUE))

#  geneName       t1       t2       t3       t4       t5       t6
#1    gene1 3965.000 2008.000 1621.667 1781.000  509.000 103.6667
#2    gene2 4629.333 2845.333 2196.667 2040.333 1688.333 251.3333

sub(&#39;\\d$&#39;, &#39;&#39;, names(tmp)) would remove the last number of the column names so that t11, t12 and t13 would all return only t1 and same for other groups which is passed to split.default that splits the data in list of dataframe based on these groups. For each group we calculate row wise mean using rowMeans.

答案2

得分: 2

A tidyverse solution:

library(tidyverse)

df %>%
  pivot_longer(-geneName, names_to = ".value", names_pattern = "(t.)") %>%
  summarise(across(everything(), mean), .by = geneName)

# A tibble: 2 × 7

geneName t1 t2 t3 t4 t5 t6

1 gene1 3965 2008 1622. 1781 509 104.

2 gene2 4629. 2845. 2197. 2040. 1688. 251.


<details>
<summary>英文:</summary>

A `tidyverse` solution:

```r
library(tidyverse)

df %&gt;%
  pivot_longer(-geneName, names_to = &quot;.value&quot;, names_pattern = &quot;(t.)&quot;) %&gt;%
  summarise(across(everything(), mean), .by = geneName)

# # A tibble: 2 &#215; 7
#   geneName    t1    t2    t3    t4    t5    t6
#   &lt;chr&gt;    &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
# 1 gene1    3965  2008  1622. 1781   509   104.
# 2 gene2    4629. 2845. 2197. 2040. 1688.  251.

答案3

得分: 1

另一种基于基础R的方法:

g1 <- t(substr(names(df[-1]), 1, 2))[rep(1, nrow(df)),]
cbind(df[1], tapply(unlist(df[-1]), list(row(g1), g1), mean))

基因名称 t1 t2 t3 t4 t5 t6
1 gene1 3965.000 2008.000 1621.667 1781.000 509.000 103.6667
2 gene2 4629.333 2845.333 2196.667 2040.333 1688.333 251.3333

英文:

Another base R approach:

g1 &lt;- t(substr(names(df[-1]), 1, 2))[rep(1, nrow(df)),]
cbind(df[1], tapply(unlist(df[-1]), list(row(g1), g1), mean))

  geneName       t1       t2       t3       t4       t5       t6
1    gene1 3965.000 2008.000 1621.667 1781.000  509.000 103.6667
2    gene2 4629.333 2845.333 2196.667 2040.333 1688.333 251.3333

答案4

得分: 0

df <- read.table(text =
"geneName  t11  t12  t13  t21  t22  t23  t31  t32  t33  t41  t42  t43  t51 t52  t53  t61  t62  t63
gene1 3296 5133 3466 2166 1759 2099 1916 1379 1570 2533 1794 1016  800  79  648   99   60  152
gene2 4210 5505 4173 2736 2748 3052 2409 1944 2237 1158 3475 1488 4023 102  940  265  365  124", header = TRUE)

library(tidyverse)

df %>% 
  pivot_longer(cols = -geneName) %>% 
  mutate(name = substr(name, 1, 2)) %>%
  group_by(geneName, name) %>%
  summarise(mean = mean(value))
英文:
df&lt;- read.table(text =
&quot;geneName  t11  t12  t13  t21  t22  t23  t31  t32  t33  t41  t42  t43  t51 t52  t53  t61  t62  t63
gene1 3296 5133 3466 2166 1759 2099 1916 1379 1570 2533 1794 1016  800  79  648   99   60  152
gene2 4210 5505 4173 2736 2748 3052 2409 1944 2237 1158 3475 1488 4023 102  940  265  365  124&quot;, header = TRUE)

library(tidyverse)

df %&gt;% 
  pivot_longer(cols = -geneName) %&gt;% 
  mutate(name = substr(name, 1, 2)) %&gt;%
  group_by(geneName, name) %&gt;%
  summarise(mean = mean(value))

# A tibble: 12 &#215; 3
# Groups:   geneName [2]
   geneName name   mean
   &lt;chr&gt;    &lt;chr&gt; &lt;dbl&gt;
 1 gene1    t1    3965 
 2 gene1    t2    2008 
 3 gene1    t3    1622.
 4 gene1    t4    1781 
 5 gene1    t5     509 
 6 gene1    t6     104.
 7 gene2    t1    4629.
 8 gene2    t2    2845.
 9 gene2    t3    2197.
10 gene2    t4    2040.
11 gene2    t5    1688.
12 gene2    t6     251.

答案5

得分: 0

以下是代码的中文翻译部分:

library(tidyverse)

test_data |&gt;
  pivot_longer(-geneName, 
               names_to = &quot;time&quot;, 
               names_pattern = &quot;t(\\d+)&quot;, 
               names_transform = as.numeric) |&gt;
  arrange(geneName, time) |&gt;
  mutate(group = cumsum(time - lag(time, default = first(time)-1)!=1)) |&gt;
  nest(data = -c(geneName, group)) |&gt;
  transmute(geneName = geneName,
            time_range = map_chr(data, ~glue::glue(&quot;t{mn}-t{mx}&quot;, 
                                                mn = min(.x$time),
                                                mx = max(.x$time))),
            mean = map_dbl(data, ~mean(.x$value))) |&gt;
  pivot_wider(names_from = time_range, values_from = mean)
#&gt; # A tibble: 2 x 7
#&gt;   geneName `t11-t13` `t21-t23` `t31-t33` `t41-t43` `t51-t53` `t61-t63`
#&gt;   &lt;chr&gt;        &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;
#&gt; 1 gene1        3965      2008      1622.     1781       509       104.
#&gt; 2 gene2        4629.     2845.     2197.     2040.     1688.      251.

希望这对你有帮助。如果有其他翻译需求,请告诉我。

英文:

Here is another option:

library(tidyverse)

test_data |&gt;
  pivot_longer(-geneName, 
               names_to = &quot;time&quot;, 
               names_pattern = &quot;t(\\d+)&quot;, 
               names_transform = as.numeric) |&gt;
  arrange(geneName, time) |&gt;
  mutate(group = cumsum(time - lag(time, default = first(time)-1)!=1)) |&gt;
  nest(data = -c(geneName, group)) |&gt;
  transmute(geneName = geneName,
            time_range = map_chr(data, ~glue::glue(&quot;t{mn}-t{mx}&quot;, 
                                                mn = min(.x$time),
                                                mx = max(.x$time))),
            mean = map_dbl(data, ~mean(.x$value))) |&gt;
  pivot_wider(names_from = time_range, values_from = mean)
#&gt; # A tibble: 2 x 7
#&gt;   geneName `t11-t13` `t21-t23` `t31-t33` `t41-t43` `t51-t53` `t61-t63`
#&gt;   &lt;chr&gt;        &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;
#&gt; 1 gene1        3965      2008      1622.     1781       509       104.
#&gt; 2 gene2        4629.     2845.     2197.     2040.     1688.      251.

答案6

得分: 0

尽管我喜欢Ronak的简洁答案,但我仍然更喜欢使用tidyverse语法进行计算。我觉得它更干净、更清晰。

但还是谢谢你的帮助。

test.mean <- test_data %>%
  group_by(gene) %>%
  rowwise() %>%
  mutate("meant1" = mean(c(t11, t12, t13))) %>%
  mutate("meant2" = mean(c(t21, t22, t23))) %>%
  mutate("meant3" = mean(c(t31, t32, t33))) %>%
  mutate("meant4" = mean(c(t41, t42, t43))) %>%
  mutate("meant5" = mean(c(t51, t52, t53))) %>%
  mutate("meant6" = mean(c(t61, t62, t63))) %>%
  select(matches("geneName|mean")) %>%
  column_to_rownames("geneName") %>%
  as.data.frame()
英文:

Even though I like the slick answer from Ronak, I still prefer the calculation using tidyverse grammer. I find it to be cleaner and clearer.

But thanks for the help

test.mean &lt;- test_data |&gt;
group_by(gene)
dplyr::rowwise() |&gt; 
mutate(&quot;meant1&quot; = mean(c(t11,t12,t13))) |&gt;
mutate(&quot;meant2&quot; = mean(c(t21,t22,t23))) |&gt;
mutate(&quot;meant3&quot; = mean(c(t31,t32,t33))) |&gt;
mutate(&quot;meant4&quot; = mean(c(t41,t42,t43))) |&gt;
mutate(&quot;meant5&quot; = mean(c(t51,t52,t53))) |&gt;
mutate(&quot;meant6&quot; = mean(c(t61,t62,t63))) |&gt;
select(matches(&quot;geneName|mean&quot;)) |&gt; 
column_to_rownames(&quot;geneName&quot;) |&gt;
as.data.frame()

huangapple
  • 本文由 发表于 2023年7月4日 21:48:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76613312.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定