英文:
How to group events that happened close in time
问题
我想要通过为发生在彼此在时间上接近的事件分配一个“组ID”来对数据行进行分组。
例如,考虑有一台计算机的事件日志,您想要将发生在彼此接近的事件分组在一起。
tibble::tribble(
~event_id, ~happened_at,
"xyz", "2023-07-31 13:35:06",
"tsv", "2023-07-31 13:35:07",
"abc", "2023-07-31 13:41:30",
"fgh", "2023-07-31 13:42:05",
"dda", "2023-07-31 13:42:12",
"ggf", "2023-08-01 4:43:15",
"oor", "2023-08-01 13:49:36",
"wqe", "2023-08-01 14:33:10",
"oop", "2023-08-01 14:34:14"
)
我想要添加另一列来表示事件组,如果2个或更多事件在几秒钟内发生,它们应该获得相同的“组ID”。否则,单个事件将获得自己的组ID。
# 期望的输出
tibble::tribble(
~event_id, ~happened_at, ~group_id,
"xyz", "2023-07-31 13:35:06", 1,
"tsv", "2023-07-31 13:35:07", 1,
"abc", "2023-07-31 13:41:30", 2,
"fgh", "2023-07-31 13:42:05", 2,
"dda", "2023-07-31 13:42:12", 2,
"ggf", "2023-08-01 4:43:15", 3,
"oor", "2023-08-01 13:49:36", 4,
"wqe", "2023-08-01 14:33:10", 5,
"oop", "2023-08-01 14:34:14", 5
)
虽然这似乎是一个基本的问题,但我想不出如何做到。对于这样的问题,有任何想法或“最佳实践”吗?
英文:
I want to group – by assigning a "group ID" – rows of data for events that happened in temporal proximity to each other.
For example, consider having a log of events from a computer, and you want to group together events that happened close to each other.
tibble::tribble(
~event_id, ~happened_at,
"xyz", "2023-07-31 13:35:06",
"tsv", "2023-07-31 13:35:07",
"abc", "2023-07-31 13:41:30",
"fgh", "2023-07-31 13:42:05",
"dda", "2023-07-31 13:42:12",
"ggf", "2023-08-01 4:43:15",
"oor", "2023-08-01 13:49:36",
"wqe", "2023-08-01 14:33:10",
"oop", "2023-08-01 14:34:14"
)
I want to add another column that accounts for groups of events, such that if 2 or more events happened within several seconds, they should get the same "group ID". Otherwise, single events would get a group ID of their own.
# desired output
tibble::tribble(
~event_id, ~happened_at, ~group_id,
"xyz", "2023-07-31 13:35:06", 1,
"tsv", "2023-07-31 13:35:07", 1,
"abc", "2023-07-31 13:41:30", 2,
"fgh", "2023-07-31 13:42:05", 2,
"dda", "2023-07-31 13:42:12", 2,
"ggf", "2023-08-01 4:43:15", 3,
"oor", "2023-08-01 13:49:36", 4,
"wqe", "2023-08-01 14:33:10", 5,
"oop", "2023-08-01 14:34:14", 5
)
While this seems like a basic problem, I cannot think of a way to do it. Any idea or "best practice" on such a thing?
答案1
得分: 1
需要首先定义接近性。为了演示目的,我已将其定义为2分钟,但您可以根据您的数据集选择它。
library(dplyr)
library(lubridate)
cutoff_secs <- 120L
df %>%
mutate(happened_at = ymd_hms(happened_at)) %>%
arrange(happened_at) %>%
mutate(group_id = cumsum(as.integer(difftime(happened_at,
lag(happened_at, default = first(happened_at)),
units = "secs")) > cutoff_secs) + 1)
# event_id happened_at group_id
#1 xyz 2023-07-31 13:35:06 1
#2 tsv 2023-07-31 13:35:07 1
#3 abc 2023-07-31 13:41:30 2
#4 fgh 2023-07-31 13:42:05 2
#5 dda 2023-07-31 13:42:12 2
#6 ggf 2023-08-01 04:43:15 3
#7 oor 2023-08-01 13:49:36 4
#8 wqe 2023-08-01 14:33:10 5
#9 oop 2023-08-01 14:34:14 5
在difftime
中,我们计算了happened_at
的连续值之间的差异,如果它们超过了cutoff_secs
,则增加了group_id
的计数。
英文:
You need to define the proximity first. For demonstration purposes, I have defined it as 2 mins but you can chose it based on your dataset.
library(dplyr)
library(lubridate)
cutoff_secs <- 120L
df %>%
mutate(happened_at = ymd_hms(happened_at)) %>%
arrange(happened_at) %>%
mutate(group_id = cumsum(as.integer(difftime(happened_at,
lag(happened_at, default = first(happened_at)),
units = "secs")) > cutoff_secs) + 1)
# event_id happened_at group_id
# <chr> <dttm> <dbl>
#1 xyz 2023-07-31 13:35:06 1
#2 tsv 2023-07-31 13:35:07 1
#3 abc 2023-07-31 13:41:30 2
#4 fgh 2023-07-31 13:42:05 2
#5 dda 2023-07-31 13:42:12 2
#6 ggf 2023-08-01 04:43:15 3
#7 oor 2023-08-01 13:49:36 4
#8 wqe 2023-08-01 14:33:10 5
#9 oop 2023-08-01 14:34:14 5
In difftime
we calculate the difference between the consecutive values of happened_at
and if they exceed cutoff_secs
increment the count of group_id
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论