2023年2月6日 05:32:42go评论85阅读模式

英文:

How to get column mean grouped by row labels in R dataframe?

问题

我有一个数据框，看起来像这样

水果	2021	2022
苹果	12	29
香蕉	11	31
苹果	44	55
橙子	30	73
橙子	19	82
香蕉	24	78

水果名称没有排序，所以我不能通过取 n 个来分组它们，它们以随机顺序列出。我需要分别获取2021年和2022年销售的水果均值，以及苹果、橙子和香蕉的均值。

我的代码是

2021 &lt;- c(mean(df$2021), sd(df$2021))
2022 &lt;- c(mean(df$2022), sd(df$2022))
measure &lt;- c(&#39;mean&#39;,&#39;standard deviation&#39;)
df1 &lt;- data.table(measure,TE,TW,NC,SC,NWC)

输出如下：

指标	2021	2022
均值	23.3	58
标准差	12.4	23.3

但是我不确定从哪里开始按名称分组。我需要获得类似以下的结果

指标	2021	苹果	香蕉	橙子	2022	苹果	香蕉	橙子
均值	23.3				58
标准差	12.4				23.3

（在空白处填写适当的数字）

英文:

I have a dataframe that looks like this

Fruit	2021	2022
Apples	12	29
Bananas	11	31
Apples	44	55
Oranges	30	73
Oranges	19	82
Bananas	24	78

The Fruit names are not ordered so I can't group them by taking n at a time, they're listed randomly. I need to get the mean of fruits sold in 2021 & 2022 as well as mean sold for apples, oranges & bananas for each year separately.

My code is

2021 &lt;- c(mean(df$2021), sd(df$2021))
2022 &lt;- c(mean(df$2022), sd(df$2022))
measure &lt;- c(&#39;mean&#39;,&#39;standard deviation&#39;)
df1 &lt;- data.table(measure,TE,TW,NC,SC,NWC)

and output looks like this:

Measure	2021	2022
mean	23.3	58
standard deviation	12.4	23.3

But I'm not sure where to start with grouping the rows by name. I need to get something that looks like this

Measure	2021	Apples	Bananas	Oranges	2022	Apples	Bananas	Oranges
mean	23.3				58
standard deviation	12.4				23.3

(with the appropriate numbers in the blank spaces)

答案1

得分: 2

我建议这可能在长期内更好（长格式），这样总结可以开始。这只是“意思”，对于sd很容易重复，并与此结合：

fruits <- c(NA, "苹果", "橙子", "香蕉")
lapply(quux[,-1], function(yr) stack(sapply(fruits, function(z) mean(yr[is.na(z) | quux$水果 %in% z])))) |>
  dplyr::bind_rows(.id = "年份")
#   年份   值     指标
# 1 2021 23.33333    <NA>
# 2 2021 28.00000  苹果
# 3 2021 24.50000 橙子
# 4 2021 17.50000 香蕉
# 5 2022 58.00000    <NA>
# 6 2022 42.00000  苹果
# 7 2022 77.50000 橙子
# 8 2022 54.50000 香蕉

其中指标中的NA表示所有水果，否则为个别水果标签。

英文:

I suggest this might be better (in the long run) in a long format, which this summarizing can get started. This is just 'mean', not hard to repeat for sd and combine with this:

fruits &lt;- c(NA, &quot;Apples&quot;, &quot;Oranges&quot;, &quot;Bananas&quot;)
lapply(quux[,-1], function(yr) stack(sapply(fruits, function(z) mean(yr[is.na(z) | quux$Fruit %in% z])))) |&gt;
  dplyr::bind_rows(.id = &quot;year&quot;)
#   year   values     ind
# 1 2021 23.33333    &lt;NA&gt;
# 2 2021 28.00000  Apples
# 3 2021 24.50000 Oranges
# 4 2021 17.50000 Bananas
# 5 2022 58.00000    &lt;NA&gt;
# 6 2022 42.00000  Apples
# 7 2022 77.50000 Oranges
# 8 2022 54.50000 Bananas

where NA in ind indicates all fruits, otherwise the individual fruit labeled.

答案2

得分: 1

如果将数据以长格式放置，您可以使用聚合函数：

a <- aggregate(value ~ year + fruit, data=df, FUN=function(x) c(sd(x),mean(x)))

其中 value 是您可以创建的列，用于放置现在在 2021 和 2022 下的值。然后创建一个名为 year 的新列，根据情况包含 2021 或 2022。在R中，长格式几乎总是最佳选择。

英文:

If you put your data in long form, you could use the aggregate function:

a &lt;- aggregate(value ~ year + fruit, data=df, FUN=function(x) c(sd(x),mean(x))

Where value is a column you could create to put the values which are now under 2021 and 2022. Then create a new column called year which has 2021 or 2022 accordingly. Long form is the way to go in R almost always.

答案3

得分: 1

我们可以使用

library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
df1 %>%
pivot_longer(cols = where(is.numeric), names_to = 'year') %>%
as.data.table %>%
cube( .(Mean = mean(value), SD = sd(value)),
by = c("水果", "年份")) %>%
filter(!if_all(水果:年份, is.na)) %>%
unite(水果, 水果, 年份, sep = "", na.rm = TRUE) %>%
filter(str_detect(水果, "|\d+")) %>%
data.table::transpose(make.names = "水果", keep.names = "测量")

-输出

测量 苹果_2021 苹果_2022 香蕉_2021 香蕉_2022 橙子_2021 橙子_2022     2021     2022

1: 平均值 28.00000 42.00000 17.500000 54.50000 24.500000 77.500000 23.33333 58.00000
2: 标准差 22.62742 18.38478 9.192388 33.23402 7.778175 6.363961 12.42041 23.57965


---
或者如果我们想要重复的列名

df1 %>%
pivot_longer(cols = where(is.numeric), names_to = 'year') %>%
as.data.table %>%
cube( .(Mean = mean(value), SD = sd(value)), by = c("水果", "年份")) %>%
mutate(水果 = coalesce(水果, 年份)) %>%
drop_na(年份) %>%
arrange(年份, str_detect(水果, '\d{4}', negate = TRUE)) %>%
select(-年份) %>%
data.table::transpose(make.names = "水果", keep.names = "测量")

-输出

测量 2021 苹果香蕉橙子 2022 苹果香蕉橙子
1: 平均值 23.33333 28.00000 17.500000 24.500000 58.00000 42.00000 54.50000 77.500000
2: 标准差 12.42041 22.62742 9.192388 7.778175 23.57965 18.38478 33.23402 6.363961


### 数据

df1 <- structure(list(水果 = c("苹果", "香蕉", "苹果", "橙子", "橙子", "香蕉"), 2021 = c(12L, 11L, 44L, 30L, 19L, 24L), 2022 = c(29L, 31L, 55L, 73L, 82L, 78L)),
class = "data.frame", row.names = c(NA, -6L))


<details>
<summary>英文:</summary>
We may use

library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
df1 %>%
pivot_longer(cols = where(is.numeric), names_to = 'year') %>%
as.data.table %>%
cube( .(Mean = mean(value), SD = sd(value)),
by = c("Fruit", "year")) %>%
filter(!if_all(Fruit:year, is.na)) %>%
unite(Fruit, Fruit, year, sep = "", na.rm = TRUE) %>%
filter(str_detect(Fruit, "|\d+")) %>%
data.table::transpose(make.names = "Fruit", keep.names = "Measure")

-output

Measure Apples_2021 Apples_2022 Bananas_2021 Bananas_2022 Oranges_2021 Oranges_2022     2021     2022

1: Mean 28.00000 42.00000 17.500000 54.50000 24.500000 77.500000 23.33333 58.00000
2: SD 22.62742 18.38478 9.192388 33.23402 7.778175 6.363961 12.42041 23.57965


---
Or if we want the duplicate column names

df1 %>%
pivot_longer(cols = where(is.numeric), names_to = 'year') %>%
as.data.table %>%
cube( .(Mean = mean(value), SD = sd(value)), by = c("Fruit", "year")) %>%
mutate(Fruit = coalesce(Fruit, year)) %>%
drop_na(year) %>%
arrange(year, str_detect(Fruit, '\d{4}', negate = TRUE)) %>%
select(-year) %>%
data.table::transpose(make.names = "Fruit", keep.names = "Measure")


-output

Measure 2021 Apples Bananas Oranges 2022 Apples Bananas Oranges
1: Mean 23.33333 28.00000 17.500000 24.500000 58.00000 42.00000 54.50000 77.500000
2: SD 12.42041 22.62742 9.192388 7.778175 23.57965 18.38478 33.23402 6.363961


### data

df1 <- structure(list(Fruit = c("Apples", "Bananas", "Apples", "Oranges",
"Oranges", "Bananas"), 2021 = c(12L, 11L, 44L, 30L, 19L, 24L
), 2022 = c(29L, 31L, 55L, 73L, 82L, 78L)),
class = "data.frame", row.names = c(NA,
-6L))


</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在R数据框中按行标签分组获取列平均值？

问题

答案1

答案2

答案3

我需要将日期时间拆分成两列。

如何检测具有至少一个错位数字的个体

显示 ggplot 直方图上的所有 x 轴标签。

使用doFuture包完成并行计算后，如何关闭额外的R会话？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。