如何在R数据框中按行标签分组获取列平均值?

huangapple go评论60阅读模式
英文:

How to get column mean grouped by row labels in R dataframe?

问题

我有一个数据框,看起来像这样

水果 2021 2022
苹果 12 29
香蕉 11 31
苹果 44 55
橙子 30 73
橙子 19 82
香蕉 24 78

水果名称没有排序,所以我不能通过取 n 个来分组它们,它们以随机顺序列出。我需要分别获取2021年和2022年销售的水果均值,以及苹果、橙子和香蕉的均值。

我的代码是

2021 <- c(mean(df$2021), sd(df$2021))
2022 <- c(mean(df$2022), sd(df$2022))
measure <- c('mean','standard deviation')

df1 <- data.table(measure,TE,TW,NC,SC,NWC)

输出如下:

指标 2021 2022
均值 23.3 58
标准差 12.4 23.3

但是我不确定从哪里开始按名称分组。我需要获得类似以下的结果

指标 2021 苹果 香蕉 橙子 2022 苹果 香蕉 橙子
均值 23.3 58
标准差 12.4 23.3

(在空白处填写适当的数字)

英文:

I have a dataframe that looks like this

Fruit 2021 2022
Apples 12 29
Bananas 11 31
Apples 44 55
Oranges 30 73
Oranges 19 82
Bananas 24 78

The Fruit names are not ordered so I can't group them by taking n at a time, they're listed randomly. I need to get the mean of fruits sold in 2021 & 2022 as well as mean sold for apples, oranges & bananas for each year separately.

My code is

2021 <- c(mean(df$2021), sd(df$2021))
2022 <- c(mean(df$2022), sd(df$2022))
measure <- c('mean','standard deviation')

df1 <- data.table(measure,TE,TW,NC,SC,NWC)

and output looks like this:

Measure 2021 2022
mean 23.3 58
standard deviation 12.4 23.3

But I'm not sure where to start with grouping the rows by name. I need to get something that looks like this

Measure 2021 Apples Bananas Oranges 2022 Apples Bananas Oranges
mean 23.3 58
standard deviation 12.4 23.3

(with the appropriate numbers in the blank spaces)

答案1

得分: 2

我建议这可能在长期内更好(长格式),这样总结可以开始。这只是“意思”,对于sd很容易重复,并与此结合:

fruits <- c(NA, "苹果", "橙子", "香蕉")
lapply(quux[,-1], function(yr) stack(sapply(fruits, function(z) mean(yr[is.na(z) | quux$水果 %in% z])))) |>
  dplyr::bind_rows(.id = "年份")
#   年份   值     指标
# 1 2021 23.33333    <NA>
# 2 2021 28.00000  苹果
# 3 2021 24.50000 橙子
# 4 2021 17.50000 香蕉
# 5 2022 58.00000    <NA>
# 6 2022 42.00000  苹果
# 7 2022 77.50000 橙子
# 8 2022 54.50000 香蕉

其中指标中的NA表示所有水果,否则为个别水果标签。

英文:

I suggest this might be better (in the long run) in a long format, which this summarizing can get started. This is just 'mean', not hard to repeat for sd and combine with this:

fruits &lt;- c(NA, &quot;Apples&quot;, &quot;Oranges&quot;, &quot;Bananas&quot;)
lapply(quux[,-1], function(yr) stack(sapply(fruits, function(z) mean(yr[is.na(z) | quux$Fruit %in% z])))) |&gt;
  dplyr::bind_rows(.id = &quot;year&quot;)
#   year   values     ind
# 1 2021 23.33333    &lt;NA&gt;
# 2 2021 28.00000  Apples
# 3 2021 24.50000 Oranges
# 4 2021 17.50000 Bananas
# 5 2022 58.00000    &lt;NA&gt;
# 6 2022 42.00000  Apples
# 7 2022 77.50000 Oranges
# 8 2022 54.50000 Bananas

where NA in ind indicates all fruits, otherwise the individual fruit labeled.

答案2

得分: 1

如果将数据以长格式放置,您可以使用聚合函数:

a <- aggregate(value ~ year + fruit, data=df, FUN=function(x) c(sd(x),mean(x)))

其中 value 是您可以创建的列,用于放置现在在 20212022 下的值。然后创建一个名为 year 的新列,根据情况包含 20212022。在R中,长格式几乎总是最佳选择。

英文:

If you put your data in long form, you could use the aggregate function:

a &lt;- aggregate(value ~ year + fruit, data=df, FUN=function(x) c(sd(x),mean(x))

Where value is a column you could create to put the values which are now under 2021 and 2022. Then create a new column called year which has 2021 or 2022 accordingly. Long form is the way to go in R almost always.

答案3

得分: 1

我们可以使用

library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
df1 %>%
pivot_longer(cols = where(is.numeric), names_to = 'year') %>%
as.data.table %>%
cube( .(Mean = mean(value), SD = sd(value)),
by = c("水果", "年份")) %>%
filter(!if_all(水果:年份, is.na)) %>%
unite(水果, 水果, 年份, sep = "", na.rm = TRUE) %>%
filter(str_detect(水果, "
|\d+")) %>%
data.table::transpose(make.names = "水果", keep.names = "测量")

-输出

测量 苹果_2021 苹果_2022 香蕉_2021 香蕉_2022 橙子_2021 橙子_2022     2021     2022

1: 平均值 28.00000 42.00000 17.500000 54.50000 24.500000 77.500000 23.33333 58.00000
2: 标准差 22.62742 18.38478 9.192388 33.23402 7.778175 6.363961 12.42041 23.57965


---

或者如果我们想要重复的列名

df1 %>%
pivot_longer(cols = where(is.numeric), names_to = 'year') %>%
as.data.table %>%
cube( .(Mean = mean(value), SD = sd(value)), by = c("水果", "年份")) %>%
mutate(水果 = coalesce(水果, 年份)) %>%
drop_na(年份) %>%
arrange(年份, str_detect(水果, '\d{4}', negate = TRUE)) %>%
select(-年份) %>%
data.table::transpose(make.names = "水果", keep.names = "测量")

-输出

测量 2021 苹果 香蕉 橙子 2022 苹果 香蕉 橙子
1: 平均值 23.33333 28.00000 17.500000 24.500000 58.00000 42.00000 54.50000 77.500000
2: 标准差 12.42041 22.62742 9.192388 7.778175 23.57965 18.38478 33.23402 6.363961


### 数据

df1 <- structure(list(水果 = c("苹果", "香蕉", "苹果", "橙子", "橙子", "香蕉"), 2021 = c(12L, 11L, 44L, 30L, 19L, 24L), 2022 = c(29L, 31L, 55L, 73L, 82L, 78L)),
class = "data.frame", row.names = c(NA, -6L))


<details>
<summary>英文:</summary>

We may use

library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
df1 %>%
pivot_longer(cols = where(is.numeric), names_to = 'year') %>%
as.data.table %>%
cube( .(Mean = mean(value), SD = sd(value)),
by = c("Fruit", "year")) %>%
filter(!if_all(Fruit:year, is.na)) %>%
unite(Fruit, Fruit, year, sep = "", na.rm = TRUE) %>%
filter(str_detect(Fruit, "
|\d+")) %>%
data.table::transpose(make.names = "Fruit", keep.names = "Measure")

-output

Measure Apples_2021 Apples_2022 Bananas_2021 Bananas_2022 Oranges_2021 Oranges_2022     2021     2022

1: Mean 28.00000 42.00000 17.500000 54.50000 24.500000 77.500000 23.33333 58.00000
2: SD 22.62742 18.38478 9.192388 33.23402 7.778175 6.363961 12.42041 23.57965


---

Or if we want the duplicate column names

df1 %>%
pivot_longer(cols = where(is.numeric), names_to = 'year') %>%
as.data.table %>%
cube( .(Mean = mean(value), SD = sd(value)), by = c("Fruit", "year")) %>%
mutate(Fruit = coalesce(Fruit, year)) %>%
drop_na(year) %>%
arrange(year, str_detect(Fruit, '\d{4}', negate = TRUE)) %>%
select(-year) %>%
data.table::transpose(make.names = "Fruit", keep.names = "Measure")


-output

Measure 2021 Apples Bananas Oranges 2022 Apples Bananas Oranges
1: Mean 23.33333 28.00000 17.500000 24.500000 58.00000 42.00000 54.50000 77.500000
2: SD 12.42041 22.62742 9.192388 7.778175 23.57965 18.38478 33.23402 6.363961


### data

df1 <- structure(list(Fruit = c("Apples", "Bananas", "Apples", "Oranges",
"Oranges", "Bananas"), 2021 = c(12L, 11L, 44L, 30L, 19L, 24L
), 2022 = c(29L, 31L, 55L, 73L, 82L, 78L)),
class = "data.frame", row.names = c(NA,
-6L))


</details>



huangapple
  • 本文由 发表于 2023年2月6日 05:32:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/75355671.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定