Python和pandas:批处理数据,其中时间戳之间的差值小于设定的值

huangapple go评论109阅读模式
英文:

Python & pandas: Batching data where difference between timestamps < set value

问题

  1. 我试图在Python中(最好使用pandas)创建一个数据集,将所有结束时间与后续条目的开始时间之间的时间间隔小于10分钟的行分组在一起。
  2. 示例数据:
  3. | 活动 | 开始时间 | 结束时间 |
  4. | -------| ----------| ---------|
  5. | foo | 9:08:34am | 9:11:27am|
  6. | bar | 9:12:14am | 10:28:41am |
  7. | baz | 2:38:11pm | 2:41:19pm |
  8. | bay | 2:41:33pm | 2:48:53pm |
  9. 在上述示例中,解决方案将foo/bar行批处理为一个输出,baz/bay行批处理为另一个输出。
  10. 数据的一些特点:
  11. * 没有时间重叠(即在任何给定时间之前和之后,最多只有一个带有开始时间和结束时间的条目)
  12. * 每个“批次”可能有数百/数千行
  13. * 一个批次可能跨越午夜
  14. 我意识到这可能是一个常见的问题,但我无法想出如何(优雅地)解决它,或者说,我无法优雅地在Google上找到答案。感激任何建议。
英文:

I'm trying to create a data set in python (preferably pandas) that groups together all rows where the amount of time between the end_time of the last entry and the start_time of the subsequent one is < 10 minutes.

Example data:

activity start_time end_time
foo 9:08:34am 9:11:27am
bar 9:12:14am 10:28:41am
baz 2:38:11pm 2:41:19pm
bay 2:41:33pm 2:48:53pm

In the above, the solution would batch together foo/bar rows as one output, and baz/bay rows for another.

Some traits of the data:

  • No times overlap (aka there is at most one entry with start_time before and end_time after any given time)
  • There may be hundreds/thousands of rows per "batch"
  • A batch may go through midnight

I realize this may well be a common problem, but I can't figure out quite how to (elegantly) solve it, or frankly, quite how to elegantly google it. Any thoughts appreciated

答案1

得分: 1

尝试这个:

  1. start_time = pd.to_datetime(df["start_time"])
  2. end_time = pd.to_datetime(df["end_time"])
  3. # 由于您的数据没有日期,我们必须分配一个虚拟日期。如果某行的start_time早于前一行的start_time,则认为它属于下一天。这要求您的数据已经排序。
  4. is_next_day = start_time.diff().dt.total_seconds() < 0
  5. # 增加虚拟日期以处理午夜交叉
  6. delta = pd.to_timedelta(is_next_day.cumsum(), "D")
  7. start_time += delta
  8. end_time += delta
  9. # 根据与前一行的间隔分配每一行到批次号
  10. gap = pd.to_timedelta(10, "T")
  11. exceed_gap = (start_time - end_time.shift()) > gap
  12. df["batch_number"] = exceed_gap.cumsum()

请注意,我已经将 HTML 实体编码(&quot;)更改为正常的引号以使代码更易读。

英文:

Try this:

  1. start_time = pd.to_datetime(df[&quot;start_time&quot;])
  2. end_time = pd.to_datetime(df[&quot;end_time&quot;])
  3. # Since your data does not have date, we have to assign a dummy date. A row is
  4. # considered to be in the next day if its start_time is before the previous
  5. # row&#39;s start_time. This requires your data to be sorted already.
  6. is_next_day = start_time.diff().dt.total_seconds() &lt; 0
  7. # Increment the dummy date to handle mid-night cross-over
  8. delta = pd.to_timedelta(is_next_day.cumsum(), &quot;D&quot;)
  9. start_time += delta
  10. end_time += delta
  11. # Assign each row to a batch number based on gap to the previous row
  12. gap = pd.to_timedelta(10, &quot;T&quot;)
  13. exceed_gap = (start_time - end_time.shift()) &gt; gap
  14. df[&quot;batch_number&quot;] = exceed_gap.cumsum()

huangapple
  • 本文由 发表于 2023年3月4日 01:58:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/75630411.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定