2023年6月29日 22:31:47go评论86阅读模式

英文:

Grouping data in R only for consecutive rows in same category

问题

这是我正在处理的模拟数据框的示例：

df_in <- data.frame(
  time = c(1,2,3,4,5,6,7,8,9,10),
  state = c(1,1,1,2,2,3,1,1,1,2),
  returns = c(0.5,0.2,0.3,0.4,0.1,0.2,1.1,0.8,0.5,0.2)
)
   time state returns
1     1     1     0.5
2     2     1     0.2
3     3     1     0.3
4     4     2     0.4
5     5     2     0.1
6     6     3     0.2
7     7     1     1.1
8     8     1     0.8
9     9     1     0.5
10   10     2     0.2

我想要对数据进行聚合，以便将具有相同 'state' 的连续行的 'returns' 值一起计算，最好是复利计算，但对于此应用程序，简单求和也可以。

我希望输出的结果类似于以下内容：

df_out <- data.frame(
  time_start = c(1,4,6,7,10),
  time_end = c(3,5,6,9,10),
  state = c(1,2,3,1,2),
  returns = c(1.0,0.5,0.2,2.4,0.2)
)
  time_start time_end state returns
1          1        3     1     1.0
2          4        5     2     0.5
3          6        6     3     0.2
4          7        9     1     2.4
5         10       10     2     0.2

在这个示例中，我只是将 'returns' 简单地累加在一起，但如前所述，如果可能的话，复合收益会更好。

有另一个线程接近我试图实现的目标，但它使用一个标识符 'ID'，而这里的 'state' 没有。我还没有尝试太多，但考虑使用 dplyr 包中的方法，因为它们似乎对这样的任务非常高效。我对该包完全不熟悉。

英文:

Here is a mock data frame of what I'm dealing with:

df_in &lt;- data.frame(
  time = c(1,2,3,4,5,6,7,8,9,10),
  state = c(1,1,1,2,2,3,1,1,1,2),
  returns = c(0.5,0.2,0.3,0.4,0.1,0.2,1.1,0.8,0.5,0.2)
)
   time state returns
1     1     1     0.5
2     2     1     0.2
3     3     1     0.3
4     4     2     0.4
5     5     2     0.1
6     6     3     0.2
7     7     1     1.1
8     8     1     0.8
9     9     1     0.5
10   10     2     0.2

I want to aggregate the data so, that returns in consecutive rows with the same 'state' would be calulated together, preferably compounded together, but a simple sum is also fine for this application.

What I would like the output to look like would be something like this:

df_out &lt;- data.frame(
  time_start = c(1,4,6,7,10),
  time_end = c(3,5,6,9,10),
  state = c(1,2,3,1,2),
  returns = c(1.0,0.5,0.2,2.4,0.2)
)
  time_start time_end state returns
1          1        3     1     1.0
2          4        5     2     0.5
3          6        6     3     0.2
4          7        9     1     2.4
5         10       10     2     0.2

In the example I have the returns simply summed together, but as previously said, compounded returns would be optimal if manageably possible.

There is another thread which is fairly close to what I'm trying to achieve, but it uses an identifying 'ID', which the states here don't have.

I haven't really tried too much, but was thinking of using methods in the dplyr package, as they seem fairly efficient for something like this. I'm completely new to the package though.

答案1

得分: 2

你可以使用 consecutive_id 函数来创建这个新的分组，它会为一起出现的行生成唯一的 id。然后，你可以为每个这样的分组计算你所需的摘要统计信息。

library(dplyr)
df_in %>%
  mutate(consecutive_group = consecutive_id(state)) %>%
  summarise(time_start = min(time, na.rm = TRUE), 
            time_end = max(time, na.rm = TRUE), 
            state = first(state), 
            returns = sum(returns, na.rm = TRUE), .by = consecutive_group) %>%
  select(-consecutive_group)
#  time_start time_end state returns
#1          1        3     1     1.0
#2          4        5     2     0.5
#3          6        6     3     0.2
#4          7        9     1     2.4
#5         10       10     2     0.2

请注意，这是一段R代码，用于创建新的分组并计算摘要统计信息。

英文:

You may use consecutive_id function to create this new group which will have a unique id for rows that occur together. You may then calculate your preferred summary statistics for each of this group.

library(dplyr)
df_in %&gt;%
  mutate(consecutive_group = consecutive_id(state)) %&gt;%
  summarise(time_start = min(time, na.rm = TRUE), 
            time_end = max(time, na.rm = TRUE), 
            state = first(state), 
            returns = sum(returns, na.rm = TRUE), .by = consecutive_group) %&gt;%
  select(-consecutive_group)
#  time_start time_end state returns
#1          1        3     1     1.0
#2          4        5     2     0.5
#3          6        6     3     0.2
#4          7        9     1     2.4
#5         10       10     2     0.2

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中仅针对相同类别的连续行分组数据。

问题

答案1

如何在R数据框中设置一致的小数分隔符？

在数据框中运行一系列连续列的计算。

在数据框中基于重复数字序列创建标识列。

用ggplot在R中制作2D随机漫步轨迹动画。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。