2023年8月10日 22:40:14go评论151阅读模式

英文:

How to group events that happened close in time

问题

我想要通过为发生在彼此在时间上接近的事件分配一个“组ID”来对数据行进行分组。

例如，考虑有一台计算机的事件日志，您想要将发生在彼此接近的事件分组在一起。

tibble::tribble(
  ~event_id,          ~happened_at,
      "xyz", "2023-07-31 13:35:06",
      "tsv", "2023-07-31 13:35:07",
      "abc", "2023-07-31 13:41:30",
      "fgh", "2023-07-31 13:42:05",
      "dda", "2023-07-31 13:42:12",
      "ggf",  "2023-08-01 4:43:15",
      "oor", "2023-08-01 13:49:36",
      "wqe", "2023-08-01 14:33:10",
      "oop", "2023-08-01 14:34:14"
  )

我想要添加另一列来表示事件组，如果2个或更多事件在几秒钟内发生，它们应该获得相同的“组ID”。否则，单个事件将获得自己的组ID。

# 期望的输出
tibble::tribble(
  ~event_id,          ~happened_at, ~group_id,
      "xyz", "2023-07-31 13:35:06",        1,
      "tsv", "2023-07-31 13:35:07",        1,
      "abc", "2023-07-31 13:41:30",        2,
      "fgh", "2023-07-31 13:42:05",        2,
      "dda", "2023-07-31 13:42:12",        2,
      "ggf",  "2023-08-01 4:43:15",        3,
      "oor", "2023-08-01 13:49:36",        4,
      "wqe", "2023-08-01 14:33:10",        5,
      "oop", "2023-08-01 14:34:14",        5
  )

虽然这似乎是一个基本的问题，但我想不出如何做到。对于这样的问题，有任何想法或“最佳实践”吗？

英文:

I want to group – by assigning a "group ID" – rows of data for events that happened in temporal proximity to each other.

For example, consider having a log of events from a computer, and you want to group together events that happened close to each other.

tibble::tribble(
  ~event_id,          ~happened_at,
      &quot;xyz&quot;, &quot;2023-07-31 13:35:06&quot;,
      &quot;tsv&quot;, &quot;2023-07-31 13:35:07&quot;,
      &quot;abc&quot;, &quot;2023-07-31 13:41:30&quot;,
      &quot;fgh&quot;, &quot;2023-07-31 13:42:05&quot;,
      &quot;dda&quot;, &quot;2023-07-31 13:42:12&quot;,
      &quot;ggf&quot;,  &quot;2023-08-01 4:43:15&quot;,
      &quot;oor&quot;, &quot;2023-08-01 13:49:36&quot;,
      &quot;wqe&quot;, &quot;2023-08-01 14:33:10&quot;,
      &quot;oop&quot;, &quot;2023-08-01 14:34:14&quot;
  )

I want to add another column that accounts for groups of events, such that if 2 or more events happened within several seconds, they should get the same "group ID". Otherwise, single events would get a group ID of their own.

# desired output
tibble::tribble(
  ~event_id,          ~happened_at, ~group_id,
      &quot;xyz&quot;, &quot;2023-07-31 13:35:06&quot;,        1,
      &quot;tsv&quot;, &quot;2023-07-31 13:35:07&quot;,        1,
      &quot;abc&quot;, &quot;2023-07-31 13:41:30&quot;,        2,
      &quot;fgh&quot;, &quot;2023-07-31 13:42:05&quot;,        2,
      &quot;dda&quot;, &quot;2023-07-31 13:42:12&quot;,        2,
      &quot;ggf&quot;,  &quot;2023-08-01 4:43:15&quot;,        3,
      &quot;oor&quot;, &quot;2023-08-01 13:49:36&quot;,        4,
      &quot;wqe&quot;, &quot;2023-08-01 14:33:10&quot;,        5,
      &quot;oop&quot;, &quot;2023-08-01 14:34:14&quot;,        5
  )

While this seems like a basic problem, I cannot think of a way to do it. Any idea or "best practice" on such a thing?

答案1

得分: 1

需要首先定义接近性。为了演示目的，我已将其定义为2分钟，但您可以根据您的数据集选择它。

library(dplyr)
library(lubridate)
cutoff_secs <- 120L
df %>%
  mutate(happened_at = ymd_hms(happened_at)) %>%
  arrange(happened_at) %>%
  mutate(group_id = cumsum(as.integer(difftime(happened_at, 
                           lag(happened_at, default = first(happened_at)), 
                           units = "secs")) > cutoff_secs) + 1)
#  event_id happened_at         group_id
#1 xyz      2023-07-31 13:35:06        1
#2 tsv      2023-07-31 13:35:07        1
#3 abc      2023-07-31 13:41:30        2
#4 fgh      2023-07-31 13:42:05        2
#5 dda      2023-07-31 13:42:12        2
#6 ggf      2023-08-01 04:43:15        3
#7 oor      2023-08-01 13:49:36        4
#8 wqe      2023-08-01 14:33:10        5
#9 oop      2023-08-01 14:34:14        5

在difftime中，我们计算了happened_at的连续值之间的差异，如果它们超过了cutoff_secs，则增加了group_id的计数。

英文:

You need to define the proximity first. For demonstration purposes, I have defined it as 2 mins but you can chose it based on your dataset.

library(dplyr)
library(lubridate)
cutoff_secs &lt;- 120L 
df %&gt;%
  mutate(happened_at = ymd_hms(happened_at)) %&gt;%
  arrange(happened_at) %&gt;%
  mutate(group_id = cumsum(as.integer(difftime(happened_at, 
                           lag(happened_at, default = first(happened_at)), 
                           units = &quot;secs&quot;)) &gt; cutoff_secs) + 1)
#  event_id happened_at         group_id
#  &lt;chr&gt;    &lt;dttm&gt;                 &lt;dbl&gt;
#1 xyz      2023-07-31 13:35:06        1
#2 tsv      2023-07-31 13:35:07        1
#3 abc      2023-07-31 13:41:30        2
#4 fgh      2023-07-31 13:42:05        2
#5 dda      2023-07-31 13:42:12        2
#6 ggf      2023-08-01 04:43:15        3
#7 oor      2023-08-01 13:49:36        4
#8 wqe      2023-08-01 14:33:10        5
#9 oop      2023-08-01 14:34:14        5

In difftime we calculate the difference between the consecutive values of happened_at and if they exceed cutoff_secs increment the count of group_id.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何将发生在时间接近的事件分组。

问题

答案1

如何使用dbSendQuery在R中将不同的CSV文件导入到一个SQL数据库表中？

在R中使用ggplot2绘制复合2D密度和气泡图。

‘Can’t use `!!!` at top level.’ 的意思是什么，如何解决这个问题？

DataFrame列中一个特定值的频率

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。