2023年5月28日 17:17:18go评论88阅读模式

英文:

Compressing/combining rows in r df based on multiple conditions

问题

我有一个包含一个简单游戏输出的数据框。每30行，数据框切换到一组新的玩家。以下是一个示例，仅包含前两组玩家：

> dput(stackdf)
structure(list(id = structure(list(`dataset_single$id` = c(20230420, 
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 
20230420, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 
20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 
20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 
20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 
20230424, 20230424)), row.names = c(NA, -60L), class = "data.frame"), 
    round = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", 
    "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", 
    "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", 
    "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", 
    "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", 
    "22", "23", "24", "25", "26", "27", "28", "29", "30"), win = c(1, 
    0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 
    1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 
    1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 
    0, 0), sound = c(1, 1, 2, 1, 2, 1, 1, 2, 2, 2, 1, 1, 2, 3, 
    5, 6, 7, 8, 8, 1, 4, 8, 6, 7, 8, 5, 4, 8, 8, 8, 3, 4, 5, 
    5, 6, 5, 2, 5, 7, 4, 7, 6, 8, 5, 1, 8, 5, 3, 7, 6, 3, 6, 
    6, 5, 5, 8, 1, 6, 6, 5)), row.names = c(NA, -60L), class = c("tbl_df", 
"tbl", "data.frame"))

"win"列表示赢家（p1=1，p2=0），"sound"列表示该轮赢家的得分。如您所见，有一些连胜局，p1可能连续赢了几轮，然后p2赢了几轮，依此类推。当有连胜局时，我希望将这些局压缩为一行，其中"sound"列取该连胜局中得分的平均值。例如，这个表格：

win	sound
1	3
0	4
0	1
1	3
1	2
1	6

将变成：

win	sound
1	3
0	2.5
1	3.67

有两个让这变得复杂的因素：1.) 整个数据框非常大，因此我正在寻找自动化的解决方案（可能是一个for循环？）；2.) 我需要将程序限制在一组玩家中，以便它不会认为27:31行（例如）是一个大的连胜局（因为第31行实际上是一组新的玩家）。任何帮助都将不胜感激

英文:

I have a df containing the outputs of a simple game. Every 30 rows, the df switches to a new set of players. Here is a sample with just the first 2 sets of players:

&gt; dput(stackdf)
structure(list(id = structure(list(`dataset_single$id` = c(20230420, 
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 
20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 20230420, 
20230420, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 
20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 
20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 
20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 20230424, 
20230424, 20230424, 20230424)), row.names = c(NA, -60L), class = &quot;data.frame&quot;), 
round = c(&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;, &quot;7&quot;, &quot;8&quot;, &quot;9&quot;, &quot;10&quot;, 
&quot;11&quot;, &quot;12&quot;, &quot;13&quot;, &quot;14&quot;, &quot;15&quot;, &quot;16&quot;, &quot;17&quot;, &quot;18&quot;, &quot;19&quot;, &quot;20&quot;, 
&quot;21&quot;, &quot;22&quot;, &quot;23&quot;, &quot;24&quot;, &quot;25&quot;, &quot;26&quot;, &quot;27&quot;, &quot;28&quot;, &quot;29&quot;, &quot;30&quot;, 
&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;, &quot;7&quot;, &quot;8&quot;, &quot;9&quot;, &quot;10&quot;, &quot;11&quot;, 
&quot;12&quot;, &quot;13&quot;, &quot;14&quot;, &quot;15&quot;, &quot;16&quot;, &quot;17&quot;, &quot;18&quot;, &quot;19&quot;, &quot;20&quot;, &quot;21&quot;, 
&quot;22&quot;, &quot;23&quot;, &quot;24&quot;, &quot;25&quot;, &quot;26&quot;, &quot;27&quot;, &quot;28&quot;, &quot;29&quot;, &quot;30&quot;), win = c(1, 
0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 
1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 
1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 
0, 0), sound = c(1, 1, 2, 1, 2, 1, 1, 2, 2, 2, 1, 1, 2, 3, 
5, 6, 7, 8, 8, 1, 4, 8, 6, 7, 8, 5, 4, 8, 8, 8, 3, 4, 5, 
5, 6, 5, 2, 5, 7, 4, 7, 6, 8, 5, 1, 8, 5, 3, 7, 6, 3, 6, 
6, 5, 5, 8, 1, 6, 6, 5)), row.names = c(NA, -60L), class = c(&quot;tbl_df&quot;, 
&quot;tbl&quot;, &quot;data.frame&quot;))

The "win" column indicates the winner (p1=1, p2=0), and the "sound" column indicates the score of the winner for that round. As you can see, there are streaks where p1 might win a few rounds in a row, and then p2 wins several, etc. When there is a streak, I want to compress the streak so that it becomes one row, where the "sound" column takes a mean of the sound scores in that streak. For example, this:

win	sound
1	3
0	4
0	1
1	3
1	2
1	6

would become:

win	sound
1	3
0	2.5
1	3.67

There are 2 things making this tricky: 1.) the full df is huge, so I am seeking an automated solution (probably a for loop?); and 2.) I need to confine the program to one set of players at a time so that it does not think that rows 27:31 (for example) is one big streak (as row 31 is actually a new set of players).

Any help would be great

答案1

得分: 2

代码部分不需要翻译，以下是翻译好的内容：

"It seems the first id column is a nested dataframe inside a column. I have corrected it to a regular column."

"这个似乎是第一个 id 列是嵌套在一个列内的数据框。我已经将其更正为一个常规列。"

dplyr -

"dplyr -"

"Here are two option using dplyr and data.table which follows the same logic. Use data.table option if you need more speed."

"以下是两种使用 dplyr 和 data.table 的选项，它们遵循相同的逻辑。如果需要更快的速度，可以使用 data.table 选项。"

"dplyr -"

"library(dplyr)"

"stackdf %>%"

"group_by(id, cons_id = consecutive_id(win)) %>%"

"按(id, cons_id = consecutive_id(win))分组 %>%"

"summarise(win = first(win),"

"sound = mean(sound),.groups = 'drop')"

"data.table -"

"library(data.table)"

"setDT(stackdf)"

"stackdf[, .(win = first(win), sound = mean(sound)), .(id, cons_id = rleid(win))]"

"consecutive_id in dplyr and rleid in data.table creates a unique identifier which increments every time the value changes."

"dplyr 中的 consecutive_id 和 data.table 中的 rleid 创建了一个唯一标识符，每当值更改时，它都会递增。"

英文:

It seems the first id column is a nested dataframe inside a column. I have corrected it to a regular column.

stackdf$id &lt;- stackdf$id$`dataset_single$id`

Here are two option using dplyr and data.table which follows the same logic. Use data.table option if you need more speed.

dplyr -

library(dplyr)
stackdf %&gt;%
group_by(id, cons_id = consecutive_id(win)) %&gt;% 
summarise(win = first(win), 
sound = mean(sound),.groups = &quot;drop&quot;)

data.table -

library(data.table)  
setDT(stackdf)
stackdf[, .(win = first(win), sound = mean(sound)), .(id, cons_id = rleid(win))]

consecutive_id in dplyr and rleid in data.table creates a unique identifier which increments every time the value changes.

答案2

得分: 1

最终，它看起来像一个分组和汇总任务：
注意第一列看起来像一个嵌套列表：

library(tidyr)
library(dplyr) #> 1.1.0
stackdf %>%
  unnest(id) %>%
  rename(id = `dataset_single$id`) %>%
  mutate(x = consecutive_id(win), .by=id) %>%
  summarise(mean_sound = mean(sound), .by = c(id, x, win)) %>%
  select(-x)

  id       win mean_sound
1 20230420     1       1   
2 20230420     0       1   
3 20230420     1       1.5 
4 20230420     0       1.5 
5 20230420     1       2   
6 20230420     0       1.5 
7 20230420     1       2.75
8 20230420     0       6   
9 20230420     1       7   
10 20230420    0       8   
# … with 23 more rows
# ℹ Use `print(n = ...)` to see more rows

英文:

Finally it looks as a group and summarise task:
Note the first column looks like a nested list:

library(tidyr)
library(dplyr) #&gt; 1.1.0
stackdf %&gt;%
unnest(id) %&gt;% 
rename(id = `dataset_single$id`) %&gt;% 
mutate(x = consecutive_id(win), .by=id) %&gt;% 
summarise(mean_sound = mean(sound), .by = c(id, x, win)) %&gt;% 
select(-x)

  id   win mean_sound
&lt;dbl&gt; &lt;dbl&gt;      &lt;dbl&gt;
1 20230420     1       1   
2 20230420     0       1   
3 20230420     1       1.5 
4 20230420     0       1.5 
5 20230420     1       2   
6 20230420     0       1.5 
7 20230420     1       2.75
8 20230420     0       6   
9 20230420     1       7   
10 20230420     0       8   
# … with 23 more rows
# ℹ Use `print(n = ...)` to see more rows

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

基于多个条件压缩/合并 R 数据框中的行：

问题

答案1

答案2

直到事件发生时翻转硬币

可以移除代码前面的所有’>’吗？

R Shiny App：在使用shinytheme时覆盖按钮背景颜色

mutate_if在R中与case_when一起使用的多个条件

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。