基于多个条件压缩/合并 R 数据框中的行:

huangapple go评论68阅读模式
英文:

Compressing/combining rows in r df based on multiple conditions

问题

我有一个包含一个简单游戏输出的数据框。每30行,数据框切换到一组新的玩家。以下是一个示例,仅包含前两组玩家:

> dput(stackdf)
structure(list(id = structure(list(`dataset_single$id` = c(20230420, 
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 
20230420, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 
20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 
20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 
20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 
20230424, 20230424)), row.names = c(NA, -60L), class = "data.frame"), 
    round = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", 
    "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", 
    "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", 
    "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", 
    "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", 
    "22", "23", "24", "25", "26", "27", "28", "29", "30"), win = c(1, 
    0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 
    1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 
    1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 
    0, 0), sound = c(1, 1, 2, 1, 2, 1, 1, 2, 2, 2, 1, 1, 2, 3, 
    5, 6, 7, 8, 8, 1, 4, 8, 6, 7, 8, 5, 4, 8, 8, 8, 3, 4, 5, 
    5, 6, 5, 2, 5, 7, 4, 7, 6, 8, 5, 1, 8, 5, 3, 7, 6, 3, 6, 
    6, 5, 5, 8, 1, 6, 6, 5)), row.names = c(NA, -60L), class = c("tbl_df", 
"tbl", "data.frame"))

"win"列表示赢家(p1=1,p2=0),"sound"列表示该轮赢家的得分。如您所见,有一些连胜局,p1可能连续赢了几轮,然后p2赢了几轮,依此类推。当有连胜局时,我希望将这些局压缩为一行,其中"sound"列取该连胜局中得分的平均值。例如,这个表格:

win sound
1 3
0 4
0 1
1 3
1 2
1 6

将变成:

win sound
1 3
0 2.5
1 3.67

有两个让这变得复杂的因素:1.) 整个数据框非常大,因此我正在寻找自动化的解决方案(可能是一个for循环?);2.) 我需要将程序限制在一组玩家中,以便它不会认为27:31行(例如)是一个大的连胜局(因为第31行实际上是一组新的玩家)。任何帮助都将不胜感激 基于多个条件压缩/合并 R 数据框中的行:

英文:

I have a df containing the outputs of a simple game. Every 30 rows, the df switches to a new set of players. Here is a sample with just the first 2 sets of players:

> dput(stackdf)
structure(list(id = structure(list(`dataset_single$id` = c(20230420, 
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 
20230420, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 
20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 
20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 
20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 
20230424, 20230424, 20230424)), row.names = c(NA, -60L), class = "data.frame"), 
round = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", 
"11", "12", "13", "14", "15", "16", "17", "18", "19", "20", 
"21", "22", "23", "24", "25", "26", "27", "28", "29", "30", 
"1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", 
"12", "13", "14", "15", "16", "17", "18", "19", "20", "21", 
"22", "23", "24", "25", "26", "27", "28", "29", "30"), win = c(1, 
0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 
1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 
1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 
0, 0), sound = c(1, 1, 2, 1, 2, 1, 1, 2, 2, 2, 1, 1, 2, 3, 
5, 6, 7, 8, 8, 1, 4, 8, 6, 7, 8, 5, 4, 8, 8, 8, 3, 4, 5, 
5, 6, 5, 2, 5, 7, 4, 7, 6, 8, 5, 1, 8, 5, 3, 7, 6, 3, 6, 
6, 5, 5, 8, 1, 6, 6, 5)), row.names = c(NA, -60L), class = c("tbl_df", 
"tbl", "data.frame"))

The "win" column indicates the winner (p1=1, p2=0), and the "sound" column indicates the score of the winner for that round. As you can see, there are streaks where p1 might win a few rounds in a row, and then p2 wins several, etc. When there is a streak, I want to compress the streak so that it becomes one row, where the "sound" column takes a mean of the sound scores in that streak. For example, this:

win sound
1 3
0 4
0 1
1 3
1 2
1 6

would become:

win sound
1 3
0 2.5
1 3.67

There are 2 things making this tricky: 1.) the full df is huge, so I am seeking an automated solution (probably a for loop?); and 2.) I need to confine the program to one set of players at a time so that it does not think that rows 27:31 (for example) is one big streak (as row 31 is actually a new set of players).

Any help would be great 基于多个条件压缩/合并 R 数据框中的行:

答案1

得分: 2

代码部分不需要翻译,以下是翻译好的内容:

"It seems the first id column is a nested dataframe inside a column. I have corrected it to a regular column."

"这个似乎是第一个 id 列是嵌套在一个列内的数据框。我已经将其更正为一个常规列。"

dplyr -

"dplyr -"

"Here are two option using dplyr and data.table which follows the same logic. Use data.table option if you need more speed."

"以下是两种使用 dplyrdata.table 的选项,它们遵循相同的逻辑。如果需要更快的速度,可以使用 data.table 选项。"

"dplyr -"

"dplyr -"

"library(dplyr)"

"library(dplyr)"

"stackdf %>%"

"stackdf %>%"

"group_by(id, cons_id = consecutive_id(win)) %>%"

"按(id, cons_id = consecutive_id(win))分组 %>%"

"summarise(win = first(win),"

"summarise(win = first(win),"

"sound = mean(sound),.groups = 'drop')"

"sound = mean(sound),.groups = 'drop')"

"data.table -"

"data.table -"

"library(data.table)"

"library(data.table)"

"setDT(stackdf)"

"setDT(stackdf)"

"stackdf[, .(win = first(win), sound = mean(sound)), .(id, cons_id = rleid(win))]"

"stackdf[, .(win = first(win), sound = mean(sound)), .(id, cons_id = rleid(win))]"

"consecutive_id in dplyr and rleid in data.table creates a unique identifier which increments every time the value changes."

"dplyr 中的 consecutive_iddata.table 中的 rleid 创建了一个唯一标识符,每当值更改时,它都会递增。"

英文:

It seems the first id column is a nested dataframe inside a column. I have corrected it to a regular column.

stackdf$id <- stackdf$id$`dataset_single$id`

Here are two option using dplyr and data.table which follows the same logic. Use data.table option if you need more speed.

dplyr -

library(dplyr)
stackdf %>%
group_by(id, cons_id = consecutive_id(win)) %>% 
summarise(win = first(win), 
sound = mean(sound),.groups = "drop")

data.table -

library(data.table)  
setDT(stackdf)
stackdf[, .(win = first(win), sound = mean(sound)), .(id, cons_id = rleid(win))]

consecutive_id in dplyr and rleid in data.table creates a unique identifier which increments every time the value changes.

答案2

得分: 1

最终,它看起来像一个分组和汇总任务:
注意第一列看起来像一个嵌套列表:

library(tidyr)
library(dplyr) #> 1.1.0

stackdf %>%
  unnest(id) %>%
  rename(id = `dataset_single$id`) %>%
  mutate(x = consecutive_id(win), .by=id) %>%
  summarise(mean_sound = mean(sound), .by = c(id, x, win)) %>%
  select(-x)
  id       win mean_sound
1 20230420     1       1   
2 20230420     0       1   
3 20230420     1       1.5 
4 20230420     0       1.5 
5 20230420     1       2   
6 20230420     0       1.5 
7 20230420     1       2.75
8 20230420     0       6   
9 20230420     1       7   
10 20230420    0       8   
# … with 23 more rows
# ℹ Use `print(n = ...)` to see more rows
英文:

Finally it looks as a group and summarise task:
Note the first column looks like a nested list:

library(tidyr)
library(dplyr) #> 1.1.0
stackdf %>%
unnest(id) %>% 
rename(id = `dataset_single$id`) %>% 
mutate(x = consecutive_id(win), .by=id) %>% 
summarise(mean_sound = mean(sound), .by = c(id, x, win)) %>% 
select(-x)
  id   win mean_sound
<dbl> <dbl>      <dbl>
1 20230420     1       1   
2 20230420     0       1   
3 20230420     1       1.5 
4 20230420     0       1.5 
5 20230420     1       2   
6 20230420     0       1.5 
7 20230420     1       2.75
8 20230420     0       6   
9 20230420     1       7   
10 20230420     0       8   
# … with 23 more rows
# ℹ Use `print(n = ...)` to see more rows

huangapple
  • 本文由 发表于 2023年5月28日 17:17:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76350769.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定