英文:
Compressing/combining rows in r df based on multiple conditions
问题
我有一个包含一个简单游戏输出的数据框。每30行,数据框切换到一组新的玩家。以下是一个示例,仅包含前两组玩家:
> dput(stackdf)
structure(list(id = structure(list(`dataset_single$id` = c(20230420,
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420,
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420,
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420,
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420,
20230420, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424,
20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424,
20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424,
20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424,
20230424, 20230424)), row.names = c(NA, -60L), class = "data.frame"),
round = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
"11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
"21", "22", "23", "24", "25", "26", "27", "28", "29", "30",
"1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11",
"12", "13", "14", "15", "16", "17", "18", "19", "20", "21",
"22", "23", "24", "25", "26", "27", "28", "29", "30"), win = c(1,
0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0,
1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1,
1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0,
0, 0), sound = c(1, 1, 2, 1, 2, 1, 1, 2, 2, 2, 1, 1, 2, 3,
5, 6, 7, 8, 8, 1, 4, 8, 6, 7, 8, 5, 4, 8, 8, 8, 3, 4, 5,
5, 6, 5, 2, 5, 7, 4, 7, 6, 8, 5, 1, 8, 5, 3, 7, 6, 3, 6,
6, 5, 5, 8, 1, 6, 6, 5)), row.names = c(NA, -60L), class = c("tbl_df",
"tbl", "data.frame"))
"win"列表示赢家(p1=1,p2=0),"sound"列表示该轮赢家的得分。如您所见,有一些连胜局,p1可能连续赢了几轮,然后p2赢了几轮,依此类推。当有连胜局时,我希望将这些局压缩为一行,其中"sound"列取该连胜局中得分的平均值。例如,这个表格:
win | sound |
---|---|
1 | 3 |
0 | 4 |
0 | 1 |
1 | 3 |
1 | 2 |
1 | 6 |
将变成:
win | sound |
---|---|
1 | 3 |
0 | 2.5 |
1 | 3.67 |
有两个让这变得复杂的因素:1.) 整个数据框非常大,因此我正在寻找自动化的解决方案(可能是一个for循环?);2.) 我需要将程序限制在一组玩家中,以便它不会认为27:31行(例如)是一个大的连胜局(因为第31行实际上是一组新的玩家)。任何帮助都将不胜感激
英文:
I have a df containing the outputs of a simple game. Every 30 rows, the df switches to a new set of players. Here is a sample with just the first 2 sets of players:
> dput(stackdf)
structure(list(id = structure(list(`dataset_single$id` = c(20230420,
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420,
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420,
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420,
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420,
20230420, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424,
20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424,
20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424,
20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424,
20230424, 20230424, 20230424)), row.names = c(NA, -60L), class = "data.frame"),
round = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
"11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
"21", "22", "23", "24", "25", "26", "27", "28", "29", "30",
"1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11",
"12", "13", "14", "15", "16", "17", "18", "19", "20", "21",
"22", "23", "24", "25", "26", "27", "28", "29", "30"), win = c(1,
0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0,
1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1,
1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0,
0, 0), sound = c(1, 1, 2, 1, 2, 1, 1, 2, 2, 2, 1, 1, 2, 3,
5, 6, 7, 8, 8, 1, 4, 8, 6, 7, 8, 5, 4, 8, 8, 8, 3, 4, 5,
5, 6, 5, 2, 5, 7, 4, 7, 6, 8, 5, 1, 8, 5, 3, 7, 6, 3, 6,
6, 5, 5, 8, 1, 6, 6, 5)), row.names = c(NA, -60L), class = c("tbl_df",
"tbl", "data.frame"))
The "win" column indicates the winner (p1=1, p2=0), and the "sound" column indicates the score of the winner for that round. As you can see, there are streaks where p1 might win a few rounds in a row, and then p2 wins several, etc. When there is a streak, I want to compress the streak so that it becomes one row, where the "sound" column takes a mean of the sound scores in that streak. For example, this:
win | sound |
---|---|
1 | 3 |
0 | 4 |
0 | 1 |
1 | 3 |
1 | 2 |
1 | 6 |
would become:
win | sound |
---|---|
1 | 3 |
0 | 2.5 |
1 | 3.67 |
There are 2 things making this tricky: 1.) the full df is huge, so I am seeking an automated solution (probably a for loop?); and 2.) I need to confine the program to one set of players at a time so that it does not think that rows 27:31 (for example) is one big streak (as row 31 is actually a new set of players).
Any help would be great
答案1
得分: 2
代码部分不需要翻译,以下是翻译好的内容:
"It seems the first id
column is a nested dataframe inside a column. I have corrected it to a regular column."
"这个似乎是第一个 id
列是嵌套在一个列内的数据框。我已经将其更正为一个常规列。"
dplyr
-
"dplyr
-"
"Here are two option using dplyr
and data.table
which follows the same logic. Use data.table
option if you need more speed."
"以下是两种使用 dplyr
和 data.table
的选项,它们遵循相同的逻辑。如果需要更快的速度,可以使用 data.table
选项。"
"dplyr
-"
"dplyr
-"
"library(dplyr)"
"library(dplyr)"
"stackdf %>%"
"stackdf %>%"
"group_by(id, cons_id = consecutive_id(win)) %>%"
"按(id, cons_id = consecutive_id(win))分组 %>%"
"summarise(win = first(win),"
"summarise(win = first(win),"
"sound = mean(sound),.groups = 'drop')"
"sound = mean(sound),.groups = 'drop')"
"data.table
-"
"data.table
-"
"library(data.table)"
"library(data.table)"
"setDT(stackdf)"
"setDT(stackdf)"
"stackdf[, .(win = first(win), sound = mean(sound)), .(id, cons_id = rleid(win))]"
"stackdf[, .(win = first(win), sound = mean(sound)), .(id, cons_id = rleid(win))]"
"consecutive_id
in dplyr
and rleid
in data.table
creates a unique identifier which increments every time the value changes."
"dplyr
中的 consecutive_id
和 data.table
中的 rleid
创建了一个唯一标识符,每当值更改时,它都会递增。"
英文:
It seems the first id
column is a nested dataframe inside a column. I have corrected it to a regular column.
stackdf$id <- stackdf$id$`dataset_single$id`
Here are two option using dplyr
and data.table
which follows the same logic. Use data.table
option if you need more speed.
dplyr
-
library(dplyr)
stackdf %>%
group_by(id, cons_id = consecutive_id(win)) %>%
summarise(win = first(win),
sound = mean(sound),.groups = "drop")
data.table
-
library(data.table)
setDT(stackdf)
stackdf[, .(win = first(win), sound = mean(sound)), .(id, cons_id = rleid(win))]
consecutive_id
in dplyr
and rleid
in data.table
creates a unique identifier which increments every time the value changes.
答案2
得分: 1
最终,它看起来像一个分组和汇总任务:
注意第一列看起来像一个嵌套列表:
library(tidyr)
library(dplyr) #> 1.1.0
stackdf %>%
unnest(id) %>%
rename(id = `dataset_single$id`) %>%
mutate(x = consecutive_id(win), .by=id) %>%
summarise(mean_sound = mean(sound), .by = c(id, x, win)) %>%
select(-x)
id win mean_sound
1 20230420 1 1
2 20230420 0 1
3 20230420 1 1.5
4 20230420 0 1.5
5 20230420 1 2
6 20230420 0 1.5
7 20230420 1 2.75
8 20230420 0 6
9 20230420 1 7
10 20230420 0 8
# … with 23 more rows
# ℹ Use `print(n = ...)` to see more rows
英文:
Finally it looks as a group and summarise task:
Note the first column looks like a nested list:
library(tidyr)
library(dplyr) #> 1.1.0
stackdf %>%
unnest(id) %>%
rename(id = `dataset_single$id`) %>%
mutate(x = consecutive_id(win), .by=id) %>%
summarise(mean_sound = mean(sound), .by = c(id, x, win)) %>%
select(-x)
id win mean_sound
<dbl> <dbl> <dbl>
1 20230420 1 1
2 20230420 0 1
3 20230420 1 1.5
4 20230420 0 1.5
5 20230420 1 2
6 20230420 0 1.5
7 20230420 1 2.75
8 20230420 0 6
9 20230420 1 7
10 20230420 0 8
# … with 23 more rows
# ℹ Use `print(n = ...)` to see more rows
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论