如何将发生在时间接近的事件分组。

huangapple go评论151阅读模式
英文:

How to group events that happened close in time

问题

我想要通过为发生在彼此在时间上接近的事件分配一个“组ID”来对数据行进行分组。

例如,考虑有一台计算机的事件日志,您想要将发生在彼此接近的事件分组在一起。

  1. tibble::tribble(
  2. ~event_id, ~happened_at,
  3. "xyz", "2023-07-31 13:35:06",
  4. "tsv", "2023-07-31 13:35:07",
  5. "abc", "2023-07-31 13:41:30",
  6. "fgh", "2023-07-31 13:42:05",
  7. "dda", "2023-07-31 13:42:12",
  8. "ggf", "2023-08-01 4:43:15",
  9. "oor", "2023-08-01 13:49:36",
  10. "wqe", "2023-08-01 14:33:10",
  11. "oop", "2023-08-01 14:34:14"
  12. )

我想要添加另一列来表示事件组,如果2个或更多事件在几秒钟内发生,它们应该获得相同的“组ID”。否则,单个事件将获得自己的组ID。

  1. # 期望的输出
  2. tibble::tribble(
  3. ~event_id, ~happened_at, ~group_id,
  4. "xyz", "2023-07-31 13:35:06", 1,
  5. "tsv", "2023-07-31 13:35:07", 1,
  6. "abc", "2023-07-31 13:41:30", 2,
  7. "fgh", "2023-07-31 13:42:05", 2,
  8. "dda", "2023-07-31 13:42:12", 2,
  9. "ggf", "2023-08-01 4:43:15", 3,
  10. "oor", "2023-08-01 13:49:36", 4,
  11. "wqe", "2023-08-01 14:33:10", 5,
  12. "oop", "2023-08-01 14:34:14", 5
  13. )

虽然这似乎是一个基本的问题,但我想不出如何做到。对于这样的问题,有任何想法或“最佳实践”吗?

英文:

I want to group – by assigning a "group ID" – rows of data for events that happened in temporal proximity to each other.

For example, consider having a log of events from a computer, and you want to group together events that happened close to each other.

  1. tibble::tribble(
  2. ~event_id, ~happened_at,
  3. "xyz", "2023-07-31 13:35:06",
  4. "tsv", "2023-07-31 13:35:07",
  5. "abc", "2023-07-31 13:41:30",
  6. "fgh", "2023-07-31 13:42:05",
  7. "dda", "2023-07-31 13:42:12",
  8. "ggf", "2023-08-01 4:43:15",
  9. "oor", "2023-08-01 13:49:36",
  10. "wqe", "2023-08-01 14:33:10",
  11. "oop", "2023-08-01 14:34:14"
  12. )

I want to add another column that accounts for groups of events, such that if 2 or more events happened within several seconds, they should get the same "group ID". Otherwise, single events would get a group ID of their own.

  1. # desired output
  2. tibble::tribble(
  3. ~event_id, ~happened_at, ~group_id,
  4. "xyz", "2023-07-31 13:35:06", 1,
  5. "tsv", "2023-07-31 13:35:07", 1,
  6. "abc", "2023-07-31 13:41:30", 2,
  7. "fgh", "2023-07-31 13:42:05", 2,
  8. "dda", "2023-07-31 13:42:12", 2,
  9. "ggf", "2023-08-01 4:43:15", 3,
  10. "oor", "2023-08-01 13:49:36", 4,
  11. "wqe", "2023-08-01 14:33:10", 5,
  12. "oop", "2023-08-01 14:34:14", 5
  13. )

While this seems like a basic problem, I cannot think of a way to do it. Any idea or "best practice" on such a thing?

答案1

得分: 1

需要首先定义接近性。为了演示目的,我已将其定义为2分钟,但您可以根据您的数据集选择它。

  1. library(dplyr)
  2. library(lubridate)
  3. cutoff_secs <- 120L
  4. df %>%
  5. mutate(happened_at = ymd_hms(happened_at)) %>%
  6. arrange(happened_at) %>%
  7. mutate(group_id = cumsum(as.integer(difftime(happened_at,
  8. lag(happened_at, default = first(happened_at)),
  9. units = "secs")) > cutoff_secs) + 1)
  10. # event_id happened_at group_id
  11. #1 xyz 2023-07-31 13:35:06 1
  12. #2 tsv 2023-07-31 13:35:07 1
  13. #3 abc 2023-07-31 13:41:30 2
  14. #4 fgh 2023-07-31 13:42:05 2
  15. #5 dda 2023-07-31 13:42:12 2
  16. #6 ggf 2023-08-01 04:43:15 3
  17. #7 oor 2023-08-01 13:49:36 4
  18. #8 wqe 2023-08-01 14:33:10 5
  19. #9 oop 2023-08-01 14:34:14 5

difftime中,我们计算了happened_at的连续值之间的差异,如果它们超过了cutoff_secs,则增加了group_id的计数。

英文:

You need to define the proximity first. For demonstration purposes, I have defined it as 2 mins but you can chose it based on your dataset.

  1. library(dplyr)
  2. library(lubridate)
  3. cutoff_secs &lt;- 120L
  4. df %&gt;%
  5. mutate(happened_at = ymd_hms(happened_at)) %&gt;%
  6. arrange(happened_at) %&gt;%
  7. mutate(group_id = cumsum(as.integer(difftime(happened_at,
  8. lag(happened_at, default = first(happened_at)),
  9. units = &quot;secs&quot;)) &gt; cutoff_secs) + 1)
  10. # event_id happened_at group_id
  11. # &lt;chr&gt; &lt;dttm&gt; &lt;dbl&gt;
  12. #1 xyz 2023-07-31 13:35:06 1
  13. #2 tsv 2023-07-31 13:35:07 1
  14. #3 abc 2023-07-31 13:41:30 2
  15. #4 fgh 2023-07-31 13:42:05 2
  16. #5 dda 2023-07-31 13:42:12 2
  17. #6 ggf 2023-08-01 04:43:15 3
  18. #7 oor 2023-08-01 13:49:36 4
  19. #8 wqe 2023-08-01 14:33:10 5
  20. #9 oop 2023-08-01 14:34:14 5

In difftime we calculate the difference between the consecutive values of happened_at and if they exceed cutoff_secs increment the count of group_id.

huangapple
  • 本文由 发表于 2023年8月10日 22:40:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76876798.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定