英文:
How to get column mean grouped by row labels in R dataframe?
问题
我有一个数据框,看起来像这样
水果 | 2021 | 2022 |
---|---|---|
苹果 | 12 | 29 |
香蕉 | 11 | 31 |
苹果 | 44 | 55 |
橙子 | 30 | 73 |
橙子 | 19 | 82 |
香蕉 | 24 | 78 |
水果名称没有排序,所以我不能通过取 n 个来分组它们,它们以随机顺序列出。我需要分别获取2021年和2022年销售的水果均值,以及苹果、橙子和香蕉的均值。
我的代码是
2021 <- c(mean(df$2021), sd(df$2021))
2022 <- c(mean(df$2022), sd(df$2022))
measure <- c('mean','standard deviation')
df1 <- data.table(measure,TE,TW,NC,SC,NWC)
输出如下:
指标 | 2021 | 2022 |
---|---|---|
均值 | 23.3 | 58 |
标准差 | 12.4 | 23.3 |
但是我不确定从哪里开始按名称分组。我需要获得类似以下的结果
指标 | 2021 | 苹果 | 香蕉 | 橙子 | 2022 | 苹果 | 香蕉 | 橙子 |
---|---|---|---|---|---|---|---|---|
均值 | 23.3 | 58 | ||||||
标准差 | 12.4 | 23.3 |
(在空白处填写适当的数字)
英文:
I have a dataframe that looks like this
Fruit | 2021 | 2022 |
---|---|---|
Apples | 12 | 29 |
Bananas | 11 | 31 |
Apples | 44 | 55 |
Oranges | 30 | 73 |
Oranges | 19 | 82 |
Bananas | 24 | 78 |
The Fruit names are not ordered so I can't group them by taking n at a time, they're listed randomly. I need to get the mean of fruits sold in 2021 & 2022 as well as mean sold for apples, oranges & bananas for each year separately.
My code is
2021 <- c(mean(df$2021), sd(df$2021))
2022 <- c(mean(df$2022), sd(df$2022))
measure <- c('mean','standard deviation')
df1 <- data.table(measure,TE,TW,NC,SC,NWC)
and output looks like this:
Measure | 2021 | 2022 |
---|---|---|
mean | 23.3 | 58 |
standard deviation | 12.4 | 23.3 |
But I'm not sure where to start with grouping the rows by name. I need to get something that looks like this
Measure | 2021 | Apples | Bananas | Oranges | 2022 | Apples | Bananas | Oranges |
---|---|---|---|---|---|---|---|---|
mean | 23.3 | 58 | ||||||
standard deviation | 12.4 | 23.3 |
(with the appropriate numbers in the blank spaces)
答案1
得分: 2
我建议这可能在长期内更好(长格式),这样总结可以开始。这只是“意思”,对于sd
很容易重复,并与此结合:
fruits <- c(NA, "苹果", "橙子", "香蕉")
lapply(quux[,-1], function(yr) stack(sapply(fruits, function(z) mean(yr[is.na(z) | quux$水果 %in% z])))) |>
dplyr::bind_rows(.id = "年份")
# 年份 值 指标
# 1 2021 23.33333 <NA>
# 2 2021 28.00000 苹果
# 3 2021 24.50000 橙子
# 4 2021 17.50000 香蕉
# 5 2022 58.00000 <NA>
# 6 2022 42.00000 苹果
# 7 2022 77.50000 橙子
# 8 2022 54.50000 香蕉
其中指标
中的NA
表示所有水果,否则为个别水果标签。
英文:
I suggest this might be better (in the long run) in a long format, which this summarizing can get started. This is just 'mean', not hard to repeat for sd
and combine with this:
fruits <- c(NA, "Apples", "Oranges", "Bananas")
lapply(quux[,-1], function(yr) stack(sapply(fruits, function(z) mean(yr[is.na(z) | quux$Fruit %in% z])))) |>
dplyr::bind_rows(.id = "year")
# year values ind
# 1 2021 23.33333 <NA>
# 2 2021 28.00000 Apples
# 3 2021 24.50000 Oranges
# 4 2021 17.50000 Bananas
# 5 2022 58.00000 <NA>
# 6 2022 42.00000 Apples
# 7 2022 77.50000 Oranges
# 8 2022 54.50000 Bananas
where NA
in ind
indicates all fruits, otherwise the individual fruit labeled.
答案2
得分: 1
如果将数据以长格式放置,您可以使用聚合函数:
a <- aggregate(value ~ year + fruit, data=df, FUN=function(x) c(sd(x),mean(x)))
其中 value
是您可以创建的列,用于放置现在在 2021
和 2022
下的值。然后创建一个名为 year
的新列,根据情况包含 2021
或 2022
。在R中,长格式几乎总是最佳选择。
英文:
If you put your data in long form, you could use the aggregate function:
a <- aggregate(value ~ year + fruit, data=df, FUN=function(x) c(sd(x),mean(x))
Where value
is a column you could create to put the values which are now under 2021
and 2022
. Then create a new column called year
which has 2021
or 2022
accordingly. Long form is the way to go in R almost always.
答案3
得分: 1
我们可以使用
library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
df1 %>%
pivot_longer(cols = where(is.numeric), names_to = 'year') %>%
as.data.table %>%
cube( .(Mean = mean(value), SD = sd(value)),
by = c("水果", "年份")) %>%
filter(!if_all(水果:年份, is.na)) %>%
unite(水果, 水果, 年份, sep = "", na.rm = TRUE) %>%
filter(str_detect(水果, "|\d+")) %>%
data.table::transpose(make.names = "水果", keep.names = "测量")
-输出
测量 苹果_2021 苹果_2022 香蕉_2021 香蕉_2022 橙子_2021 橙子_2022 2021 2022
1: 平均值 28.00000 42.00000 17.500000 54.50000 24.500000 77.500000 23.33333 58.00000
2: 标准差 22.62742 18.38478 9.192388 33.23402 7.778175 6.363961 12.42041 23.57965
---
或者如果我们想要重复的列名
df1 %>%
pivot_longer(cols = where(is.numeric), names_to = 'year') %>%
as.data.table %>%
cube( .(Mean = mean(value), SD = sd(value)), by = c("水果", "年份")) %>%
mutate(水果 = coalesce(水果, 年份)) %>%
drop_na(年份) %>%
arrange(年份, str_detect(水果, '\d{4}', negate = TRUE)) %>%
select(-年份) %>%
data.table::transpose(make.names = "水果", keep.names = "测量")
-输出
测量 2021 苹果 香蕉 橙子 2022 苹果 香蕉 橙子
1: 平均值 23.33333 28.00000 17.500000 24.500000 58.00000 42.00000 54.50000 77.500000
2: 标准差 12.42041 22.62742 9.192388 7.778175 23.57965 18.38478 33.23402 6.363961
### 数据
df1 <- structure(list(水果 = c("苹果", "香蕉", "苹果", "橙子", "橙子", "香蕉"), 2021
= c(12L, 11L, 44L, 30L, 19L, 24L), 2022
= c(29L, 31L, 55L, 73L, 82L, 78L)),
class = "data.frame", row.names = c(NA, -6L))
<details>
<summary>英文:</summary>
We may use
library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
df1 %>%
pivot_longer(cols = where(is.numeric), names_to = 'year') %>%
as.data.table %>%
cube( .(Mean = mean(value), SD = sd(value)),
by = c("Fruit", "year")) %>%
filter(!if_all(Fruit:year, is.na)) %>%
unite(Fruit, Fruit, year, sep = "", na.rm = TRUE) %>%
filter(str_detect(Fruit, "|\d+")) %>%
data.table::transpose(make.names = "Fruit", keep.names = "Measure")
-output
Measure Apples_2021 Apples_2022 Bananas_2021 Bananas_2022 Oranges_2021 Oranges_2022 2021 2022
1: Mean 28.00000 42.00000 17.500000 54.50000 24.500000 77.500000 23.33333 58.00000
2: SD 22.62742 18.38478 9.192388 33.23402 7.778175 6.363961 12.42041 23.57965
---
Or if we want the duplicate column names
df1 %>%
pivot_longer(cols = where(is.numeric), names_to = 'year') %>%
as.data.table %>%
cube( .(Mean = mean(value), SD = sd(value)), by = c("Fruit", "year")) %>%
mutate(Fruit = coalesce(Fruit, year)) %>%
drop_na(year) %>%
arrange(year, str_detect(Fruit, '\d{4}', negate = TRUE)) %>%
select(-year) %>%
data.table::transpose(make.names = "Fruit", keep.names = "Measure")
-output
Measure 2021 Apples Bananas Oranges 2022 Apples Bananas Oranges
1: Mean 23.33333 28.00000 17.500000 24.500000 58.00000 42.00000 54.50000 77.500000
2: SD 12.42041 22.62742 9.192388 7.778175 23.57965 18.38478 33.23402 6.363961
### data
df1 <- structure(list(Fruit = c("Apples", "Bananas", "Apples", "Oranges",
"Oranges", "Bananas"), 2021
= c(12L, 11L, 44L, 30L, 19L, 24L
), 2022
= c(29L, 31L, 55L, 73L, 82L, 78L)),
class = "data.frame", row.names = c(NA,
-6L))
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论