在R中仅针对相同类别的连续行分组数据。

huangapple go评论59阅读模式
英文:

Grouping data in R only for consecutive rows in same category

问题

这是我正在处理的模拟数据框的示例:

df_in <- data.frame(
  time = c(1,2,3,4,5,6,7,8,9,10),
  state = c(1,1,1,2,2,3,1,1,1,2),
  returns = c(0.5,0.2,0.3,0.4,0.1,0.2,1.1,0.8,0.5,0.2)
)
   time state returns
1     1     1     0.5
2     2     1     0.2
3     3     1     0.3
4     4     2     0.4
5     5     2     0.1
6     6     3     0.2
7     7     1     1.1
8     8     1     0.8
9     9     1     0.5
10   10     2     0.2

我想要对数据进行聚合,以便将具有相同 'state' 的连续行的 'returns' 值一起计算,最好是复利计算,但对于此应用程序,简单求和也可以。

我希望输出的结果类似于以下内容:

df_out <- data.frame(
  time_start = c(1,4,6,7,10),
  time_end = c(3,5,6,9,10),
  state = c(1,2,3,1,2),
  returns = c(1.0,0.5,0.2,2.4,0.2)
)

  time_start time_end state returns
1          1        3     1     1.0
2          4        5     2     0.5
3          6        6     3     0.2
4          7        9     1     2.4
5         10       10     2     0.2

在这个示例中,我只是将 'returns' 简单地累加在一起,但如前所述,如果可能的话,复合收益会更好。

有另一个 线程 接近我试图实现的目标,但它使用一个标识符 'ID',而这里的 'state' 没有。我还没有尝试太多,但考虑使用 dplyr 包中的方法,因为它们似乎对这样的任务非常高效。我对该包完全不熟悉。

英文:

Here is a mock data frame of what I'm dealing with:

df_in &lt;- data.frame(
  time = c(1,2,3,4,5,6,7,8,9,10),
  state = c(1,1,1,2,2,3,1,1,1,2),
  returns = c(0.5,0.2,0.3,0.4,0.1,0.2,1.1,0.8,0.5,0.2)
)
   time state returns
1     1     1     0.5
2     2     1     0.2
3     3     1     0.3
4     4     2     0.4
5     5     2     0.1
6     6     3     0.2
7     7     1     1.1
8     8     1     0.8
9     9     1     0.5
10   10     2     0.2

I want to aggregate the data so, that returns in consecutive rows with the same 'state' would be calulated together, preferably compounded together, but a simple sum is also fine for this application.

What I would like the output to look like would be something like this:

df_out &lt;- data.frame(
  time_start = c(1,4,6,7,10),
  time_end = c(3,5,6,9,10),
  state = c(1,2,3,1,2),
  returns = c(1.0,0.5,0.2,2.4,0.2)
)

  time_start time_end state returns
1          1        3     1     1.0
2          4        5     2     0.5
3          6        6     3     0.2
4          7        9     1     2.4
5         10       10     2     0.2

In the example I have the returns simply summed together, but as previously said, compounded returns would be optimal if manageably possible.

There is another thread which is fairly close to what I'm trying to achieve, but it uses an identifying 'ID', which the states here don't have.

I haven't really tried too much, but was thinking of using methods in the dplyr package, as they seem fairly efficient for something like this. I'm completely new to the package though.

答案1

得分: 2

你可以使用 consecutive_id 函数来创建这个新的分组,它会为一起出现的行生成唯一的 id。然后,你可以为每个这样的分组计算你所需的摘要统计信息。

library(dplyr)

df_in %>%
  mutate(consecutive_group = consecutive_id(state)) %>%
  summarise(time_start = min(time, na.rm = TRUE), 
            time_end = max(time, na.rm = TRUE), 
            state = first(state), 
            returns = sum(returns, na.rm = TRUE), .by = consecutive_group) %>%
  select(-consecutive_group)

#  time_start time_end state returns
#1          1        3     1     1.0
#2          4        5     2     0.5
#3          6        6     3     0.2
#4          7        9     1     2.4
#5         10       10     2     0.2

请注意,这是一段R代码,用于创建新的分组并计算摘要统计信息。

英文:

You may use consecutive_id function to create this new group which will have a unique id for rows that occur together. You may then calculate your preferred summary statistics for each of this group.

library(dplyr)

df_in %&gt;%
  mutate(consecutive_group = consecutive_id(state)) %&gt;%
  summarise(time_start = min(time, na.rm = TRUE), 
            time_end = max(time, na.rm = TRUE), 
            state = first(state), 
            returns = sum(returns, na.rm = TRUE), .by = consecutive_group) %&gt;%
  select(-consecutive_group)

#  time_start time_end state returns
#1          1        3     1     1.0
#2          4        5     2     0.5
#3          6        6     3     0.2
#4          7        9     1     2.4
#5         10       10     2     0.2

huangapple
  • 本文由 发表于 2023年6月29日 22:31:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76582058.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定