如何将发生在时间接近的事件分组。

huangapple go评论98阅读模式
英文:

How to group events that happened close in time

问题

我想要通过为发生在彼此在时间上接近的事件分配一个“组ID”来对数据行进行分组。

例如,考虑有一台计算机的事件日志,您想要将发生在彼此接近的事件分组在一起。

tibble::tribble(
  ~event_id,          ~happened_at,
      "xyz", "2023-07-31 13:35:06",
      "tsv", "2023-07-31 13:35:07",
      "abc", "2023-07-31 13:41:30",
      "fgh", "2023-07-31 13:42:05",
      "dda", "2023-07-31 13:42:12",
      "ggf",  "2023-08-01 4:43:15",
      "oor", "2023-08-01 13:49:36",
      "wqe", "2023-08-01 14:33:10",
      "oop", "2023-08-01 14:34:14"
  )

我想要添加另一列来表示事件组,如果2个或更多事件在几秒钟内发生,它们应该获得相同的“组ID”。否则,单个事件将获得自己的组ID。

# 期望的输出
tibble::tribble(
  ~event_id,          ~happened_at, ~group_id,
      "xyz", "2023-07-31 13:35:06",        1,
      "tsv", "2023-07-31 13:35:07",        1,
      "abc", "2023-07-31 13:41:30",        2,
      "fgh", "2023-07-31 13:42:05",        2,
      "dda", "2023-07-31 13:42:12",        2,
      "ggf",  "2023-08-01 4:43:15",        3,
      "oor", "2023-08-01 13:49:36",        4,
      "wqe", "2023-08-01 14:33:10",        5,
      "oop", "2023-08-01 14:34:14",        5
  )

虽然这似乎是一个基本的问题,但我想不出如何做到。对于这样的问题,有任何想法或“最佳实践”吗?

英文:

I want to group – by assigning a "group ID" – rows of data for events that happened in temporal proximity to each other.

For example, consider having a log of events from a computer, and you want to group together events that happened close to each other.

tibble::tribble(
  ~event_id,          ~happened_at,
      "xyz", "2023-07-31 13:35:06",
      "tsv", "2023-07-31 13:35:07",
      "abc", "2023-07-31 13:41:30",
      "fgh", "2023-07-31 13:42:05",
      "dda", "2023-07-31 13:42:12",
      "ggf",  "2023-08-01 4:43:15",
      "oor", "2023-08-01 13:49:36",
      "wqe", "2023-08-01 14:33:10",
      "oop", "2023-08-01 14:34:14"
  )

I want to add another column that accounts for groups of events, such that if 2 or more events happened within several seconds, they should get the same "group ID". Otherwise, single events would get a group ID of their own.

# desired output
tibble::tribble(
  ~event_id,          ~happened_at, ~group_id,
      "xyz", "2023-07-31 13:35:06",        1,
      "tsv", "2023-07-31 13:35:07",        1,
      "abc", "2023-07-31 13:41:30",        2,
      "fgh", "2023-07-31 13:42:05",        2,
      "dda", "2023-07-31 13:42:12",        2,
      "ggf",  "2023-08-01 4:43:15",        3,
      "oor", "2023-08-01 13:49:36",        4,
      "wqe", "2023-08-01 14:33:10",        5,
      "oop", "2023-08-01 14:34:14",        5
  )

While this seems like a basic problem, I cannot think of a way to do it. Any idea or "best practice" on such a thing?

答案1

得分: 1

需要首先定义接近性。为了演示目的,我已将其定义为2分钟,但您可以根据您的数据集选择它。

library(dplyr)
library(lubridate)

cutoff_secs <- 120L

df %>%
  mutate(happened_at = ymd_hms(happened_at)) %>%
  arrange(happened_at) %>%
  mutate(group_id = cumsum(as.integer(difftime(happened_at, 
                           lag(happened_at, default = first(happened_at)), 
                           units = "secs")) > cutoff_secs) + 1)

#  event_id happened_at         group_id
#1 xyz      2023-07-31 13:35:06        1
#2 tsv      2023-07-31 13:35:07        1
#3 abc      2023-07-31 13:41:30        2
#4 fgh      2023-07-31 13:42:05        2
#5 dda      2023-07-31 13:42:12        2
#6 ggf      2023-08-01 04:43:15        3
#7 oor      2023-08-01 13:49:36        4
#8 wqe      2023-08-01 14:33:10        5
#9 oop      2023-08-01 14:34:14        5 

difftime中,我们计算了happened_at的连续值之间的差异,如果它们超过了cutoff_secs,则增加了group_id的计数。

英文:

You need to define the proximity first. For demonstration purposes, I have defined it as 2 mins but you can chose it based on your dataset.

library(dplyr)
library(lubridate)

cutoff_secs &lt;- 120L 

df %&gt;%
  mutate(happened_at = ymd_hms(happened_at)) %&gt;%
  arrange(happened_at) %&gt;%
  mutate(group_id = cumsum(as.integer(difftime(happened_at, 
                           lag(happened_at, default = first(happened_at)), 
                           units = &quot;secs&quot;)) &gt; cutoff_secs) + 1)

#  event_id happened_at         group_id
#  &lt;chr&gt;    &lt;dttm&gt;                 &lt;dbl&gt;
#1 xyz      2023-07-31 13:35:06        1
#2 tsv      2023-07-31 13:35:07        1
#3 abc      2023-07-31 13:41:30        2
#4 fgh      2023-07-31 13:42:05        2
#5 dda      2023-07-31 13:42:12        2
#6 ggf      2023-08-01 04:43:15        3
#7 oor      2023-08-01 13:49:36        4
#8 wqe      2023-08-01 14:33:10        5
#9 oop      2023-08-01 14:34:14        5 

In difftime we calculate the difference between the consecutive values of happened_at and if they exceed cutoff_secs increment the count of group_id.

huangapple
  • 本文由 发表于 2023年8月10日 22:40:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76876798.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定