英文:
How to calculate the mean in a data.frame based on patterns in the column names?
问题
我想计算上面示例中每个t1
,t2
,t3
等的均值。
英文:
I have a data.frame with 21 columns, each three columns are for a specific time-point. I would like to calculate the mean of each of these triplicates.
geneName t11 t12 t13 t21 t22 t23 t31 t32 t33 t41 t42 t43 t51 t52 t53 t61 t62 t63
gene1 gene1 3296 5133 3466 2166 1759 2099 1916 1379 1570 2533 1794 1016 800 79 648 99 60 152
gene2 gene2 4210 5505 4173 2736 2748 3052 2409 1944 2237 1158 3475 1488 4023 102 940 265 365 124
...
In the example above, I would like to calculate the mean for the three t1
, t2
, t3
, etc.
答案1
得分: 5
这是一个基本的R方法 -
# 从计算中排除的列编号
columns_not_included <- 1
# 保留我们需要的数据集子集
tmp <- df[-columns_not_included]
cbind.data.frame(df[columns_not_included],
sapply(split.default(tmp,
sub('\\d$', '', names(tmp))), rowMeans, na.rm =TRUE))
# geneName t1 t2 t3 t4 t5 t6
#1 gene1 3965.000 2008.000 1621.667 1781.000 509.000 103.6667
#2 gene2 4629.333 2845.333 2196.667 2040.333 1688.333 251.3333
sub('\\d$', '', names(tmp))
会移除列名的最后一个数字,以便 t11
、 t12
和 t13
都会返回 t1
,其他组也是如此,然后传递给 split.default
,根据这些组将数据拆分为数据框的列表。对于每个组,我们使用 rowMeans
计算逐行均值。
英文:
Here is a base R method -
# Column numbers to be ignored from the calculation
columns_not_included <- 1
# Keep only subset of dataset that we need
tmp <- df[-columns_not_included]
cbind.data.frame(df[columns_not_included],
sapply(split.default(tmp,
sub('\\d$', '', names(tmp))), rowMeans, na.rm =TRUE))
# geneName t1 t2 t3 t4 t5 t6
#1 gene1 3965.000 2008.000 1621.667 1781.000 509.000 103.6667
#2 gene2 4629.333 2845.333 2196.667 2040.333 1688.333 251.3333
sub('\\d$', '', names(tmp))
would remove the last number of the column names so that t11
, t12
and t13
would all return only t1
and same for other groups which is passed to split.default
that splits the data in list of dataframe based on these groups. For each group we calculate row wise mean using rowMeans
.
答案2
得分: 2
A tidyverse
solution:
library(tidyverse)
df %>%
pivot_longer(-geneName, names_to = ".value", names_pattern = "(t.)") %>%
summarise(across(everything(), mean), .by = geneName)
# A tibble: 2 × 7
geneName t1 t2 t3 t4 t5 t6
1 gene1 3965 2008 1622. 1781 509 104.
2 gene2 4629. 2845. 2197. 2040. 1688. 251.
<details>
<summary>英文:</summary>
A `tidyverse` solution:
```r
library(tidyverse)
df %>%
pivot_longer(-geneName, names_to = ".value", names_pattern = "(t.)") %>%
summarise(across(everything(), mean), .by = geneName)
# # A tibble: 2 × 7
# geneName t1 t2 t3 t4 t5 t6
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 gene1 3965 2008 1622. 1781 509 104.
# 2 gene2 4629. 2845. 2197. 2040. 1688. 251.
答案3
得分: 1
另一种基于基础R的方法:
g1 <- t(substr(names(df[-1]), 1, 2))[rep(1, nrow(df)),]
cbind(df[1], tapply(unlist(df[-1]), list(row(g1), g1), mean))
基因名称 t1 t2 t3 t4 t5 t6
1 gene1 3965.000 2008.000 1621.667 1781.000 509.000 103.6667
2 gene2 4629.333 2845.333 2196.667 2040.333 1688.333 251.3333
英文:
Another base R approach:
g1 <- t(substr(names(df[-1]), 1, 2))[rep(1, nrow(df)),]
cbind(df[1], tapply(unlist(df[-1]), list(row(g1), g1), mean))
geneName t1 t2 t3 t4 t5 t6
1 gene1 3965.000 2008.000 1621.667 1781.000 509.000 103.6667
2 gene2 4629.333 2845.333 2196.667 2040.333 1688.333 251.3333
答案4
得分: 0
df <- read.table(text =
"geneName t11 t12 t13 t21 t22 t23 t31 t32 t33 t41 t42 t43 t51 t52 t53 t61 t62 t63
gene1 3296 5133 3466 2166 1759 2099 1916 1379 1570 2533 1794 1016 800 79 648 99 60 152
gene2 4210 5505 4173 2736 2748 3052 2409 1944 2237 1158 3475 1488 4023 102 940 265 365 124", header = TRUE)
library(tidyverse)
df %>%
pivot_longer(cols = -geneName) %>%
mutate(name = substr(name, 1, 2)) %>%
group_by(geneName, name) %>%
summarise(mean = mean(value))
英文:
df<- read.table(text =
"geneName t11 t12 t13 t21 t22 t23 t31 t32 t33 t41 t42 t43 t51 t52 t53 t61 t62 t63
gene1 3296 5133 3466 2166 1759 2099 1916 1379 1570 2533 1794 1016 800 79 648 99 60 152
gene2 4210 5505 4173 2736 2748 3052 2409 1944 2237 1158 3475 1488 4023 102 940 265 365 124", header = TRUE)
library(tidyverse)
df %>%
pivot_longer(cols = -geneName) %>%
mutate(name = substr(name, 1, 2)) %>%
group_by(geneName, name) %>%
summarise(mean = mean(value))
# A tibble: 12 × 3
# Groups: geneName [2]
geneName name mean
<chr> <chr> <dbl>
1 gene1 t1 3965
2 gene1 t2 2008
3 gene1 t3 1622.
4 gene1 t4 1781
5 gene1 t5 509
6 gene1 t6 104.
7 gene2 t1 4629.
8 gene2 t2 2845.
9 gene2 t3 2197.
10 gene2 t4 2040.
11 gene2 t5 1688.
12 gene2 t6 251.
答案5
得分: 0
以下是代码的中文翻译部分:
library(tidyverse)
test_data |>
pivot_longer(-geneName,
names_to = "time",
names_pattern = "t(\\d+)",
names_transform = as.numeric) |>
arrange(geneName, time) |>
mutate(group = cumsum(time - lag(time, default = first(time)-1)!=1)) |>
nest(data = -c(geneName, group)) |>
transmute(geneName = geneName,
time_range = map_chr(data, ~glue::glue("t{mn}-t{mx}",
mn = min(.x$time),
mx = max(.x$time))),
mean = map_dbl(data, ~mean(.x$value))) |>
pivot_wider(names_from = time_range, values_from = mean)
#> # A tibble: 2 x 7
#> geneName `t11-t13` `t21-t23` `t31-t33` `t41-t43` `t51-t53` `t61-t63`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 gene1 3965 2008 1622. 1781 509 104.
#> 2 gene2 4629. 2845. 2197. 2040. 1688. 251.
希望这对你有帮助。如果有其他翻译需求,请告诉我。
英文:
Here is another option:
library(tidyverse)
test_data |>
pivot_longer(-geneName,
names_to = "time",
names_pattern = "t(\\d+)",
names_transform = as.numeric) |>
arrange(geneName, time) |>
mutate(group = cumsum(time - lag(time, default = first(time)-1)!=1)) |>
nest(data = -c(geneName, group)) |>
transmute(geneName = geneName,
time_range = map_chr(data, ~glue::glue("t{mn}-t{mx}",
mn = min(.x$time),
mx = max(.x$time))),
mean = map_dbl(data, ~mean(.x$value))) |>
pivot_wider(names_from = time_range, values_from = mean)
#> # A tibble: 2 x 7
#> geneName `t11-t13` `t21-t23` `t31-t33` `t41-t43` `t51-t53` `t61-t63`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 gene1 3965 2008 1622. 1781 509 104.
#> 2 gene2 4629. 2845. 2197. 2040. 1688. 251.
答案6
得分: 0
尽管我喜欢Ronak的简洁答案,但我仍然更喜欢使用tidyverse语法进行计算。我觉得它更干净、更清晰。
但还是谢谢你的帮助。
test.mean <- test_data %>%
group_by(gene) %>%
rowwise() %>%
mutate("meant1" = mean(c(t11, t12, t13))) %>%
mutate("meant2" = mean(c(t21, t22, t23))) %>%
mutate("meant3" = mean(c(t31, t32, t33))) %>%
mutate("meant4" = mean(c(t41, t42, t43))) %>%
mutate("meant5" = mean(c(t51, t52, t53))) %>%
mutate("meant6" = mean(c(t61, t62, t63))) %>%
select(matches("geneName|mean")) %>%
column_to_rownames("geneName") %>%
as.data.frame()
英文:
Even though I like the slick answer from Ronak, I still prefer the calculation using tidyverse grammer. I find it to be cleaner and clearer.
But thanks for the help
test.mean <- test_data |>
group_by(gene)
dplyr::rowwise() |>
mutate("meant1" = mean(c(t11,t12,t13))) |>
mutate("meant2" = mean(c(t21,t22,t23))) |>
mutate("meant3" = mean(c(t31,t32,t33))) |>
mutate("meant4" = mean(c(t41,t42,t43))) |>
mutate("meant5" = mean(c(t51,t52,t53))) |>
mutate("meant6" = mean(c(t61,t62,t63))) |>
select(matches("geneName|mean")) |>
column_to_rownames("geneName") |>
as.data.frame()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论