英文:
Grouping data in R only for consecutive rows in same category
问题
这是我正在处理的模拟数据框的示例:
df_in <- data.frame(
time = c(1,2,3,4,5,6,7,8,9,10),
state = c(1,1,1,2,2,3,1,1,1,2),
returns = c(0.5,0.2,0.3,0.4,0.1,0.2,1.1,0.8,0.5,0.2)
)
time state returns
1 1 1 0.5
2 2 1 0.2
3 3 1 0.3
4 4 2 0.4
5 5 2 0.1
6 6 3 0.2
7 7 1 1.1
8 8 1 0.8
9 9 1 0.5
10 10 2 0.2
我想要对数据进行聚合,以便将具有相同 'state' 的连续行的 'returns' 值一起计算,最好是复利计算,但对于此应用程序,简单求和也可以。
我希望输出的结果类似于以下内容:
df_out <- data.frame(
time_start = c(1,4,6,7,10),
time_end = c(3,5,6,9,10),
state = c(1,2,3,1,2),
returns = c(1.0,0.5,0.2,2.4,0.2)
)
time_start time_end state returns
1 1 3 1 1.0
2 4 5 2 0.5
3 6 6 3 0.2
4 7 9 1 2.4
5 10 10 2 0.2
在这个示例中,我只是将 'returns' 简单地累加在一起,但如前所述,如果可能的话,复合收益会更好。
有另一个 线程 接近我试图实现的目标,但它使用一个标识符 'ID',而这里的 'state' 没有。我还没有尝试太多,但考虑使用 dplyr 包中的方法,因为它们似乎对这样的任务非常高效。我对该包完全不熟悉。
英文:
Here is a mock data frame of what I'm dealing with:
df_in <- data.frame(
time = c(1,2,3,4,5,6,7,8,9,10),
state = c(1,1,1,2,2,3,1,1,1,2),
returns = c(0.5,0.2,0.3,0.4,0.1,0.2,1.1,0.8,0.5,0.2)
)
time state returns
1 1 1 0.5
2 2 1 0.2
3 3 1 0.3
4 4 2 0.4
5 5 2 0.1
6 6 3 0.2
7 7 1 1.1
8 8 1 0.8
9 9 1 0.5
10 10 2 0.2
I want to aggregate the data so, that returns in consecutive rows with the same 'state' would be calulated together, preferably compounded together, but a simple sum is also fine for this application.
What I would like the output to look like would be something like this:
df_out <- data.frame(
time_start = c(1,4,6,7,10),
time_end = c(3,5,6,9,10),
state = c(1,2,3,1,2),
returns = c(1.0,0.5,0.2,2.4,0.2)
)
time_start time_end state returns
1 1 3 1 1.0
2 4 5 2 0.5
3 6 6 3 0.2
4 7 9 1 2.4
5 10 10 2 0.2
In the example I have the returns simply summed together, but as previously said, compounded returns would be optimal if manageably possible.
There is another thread which is fairly close to what I'm trying to achieve, but it uses an identifying 'ID', which the states here don't have.
I haven't really tried too much, but was thinking of using methods in the dplyr package, as they seem fairly efficient for something like this. I'm completely new to the package though.
答案1
得分: 2
你可以使用 consecutive_id
函数来创建这个新的分组,它会为一起出现的行生成唯一的 id
。然后,你可以为每个这样的分组计算你所需的摘要统计信息。
library(dplyr)
df_in %>%
mutate(consecutive_group = consecutive_id(state)) %>%
summarise(time_start = min(time, na.rm = TRUE),
time_end = max(time, na.rm = TRUE),
state = first(state),
returns = sum(returns, na.rm = TRUE), .by = consecutive_group) %>%
select(-consecutive_group)
# time_start time_end state returns
#1 1 3 1 1.0
#2 4 5 2 0.5
#3 6 6 3 0.2
#4 7 9 1 2.4
#5 10 10 2 0.2
请注意,这是一段R代码,用于创建新的分组并计算摘要统计信息。
英文:
You may use consecutive_id
function to create this new group which will have a unique id
for rows that occur together. You may then calculate your preferred summary statistics for each of this group.
library(dplyr)
df_in %>%
mutate(consecutive_group = consecutive_id(state)) %>%
summarise(time_start = min(time, na.rm = TRUE),
time_end = max(time, na.rm = TRUE),
state = first(state),
returns = sum(returns, na.rm = TRUE), .by = consecutive_group) %>%
select(-consecutive_group)
# time_start time_end state returns
#1 1 3 1 1.0
#2 4 5 2 0.5
#3 6 6 3 0.2
#4 7 9 1 2.4
#5 10 10 2 0.2
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论