Python和pandas:批处理数据,其中时间戳之间的差值小于设定的值

huangapple go评论81阅读模式
英文:

Python & pandas: Batching data where difference between timestamps < set value

问题

我试图在Python中(最好使用pandas)创建一个数据集,将所有结束时间与后续条目的开始时间之间的时间间隔小于10分钟的行分组在一起。

示例数据:

| 活动    | 开始时间  | 结束时间  |
| -------| ----------| ---------|
| foo    | 9:08:34am | 9:11:27am|
| bar    | 9:12:14am | 10:28:41am |
| baz    | 2:38:11pm | 2:41:19pm |
| bay    | 2:41:33pm | 2:48:53pm |

在上述示例中,解决方案将foo/bar行批处理为一个输出,baz/bay行批处理为另一个输出。

数据的一些特点:
* 没有时间重叠(即在任何给定时间之前和之后,最多只有一个带有开始时间和结束时间的条目)
* 每个“批次”可能有数百/数千行
* 一个批次可能跨越午夜

我意识到这可能是一个常见的问题,但我无法想出如何(优雅地)解决它,或者说,我无法优雅地在Google上找到答案。感激任何建议。
英文:

I'm trying to create a data set in python (preferably pandas) that groups together all rows where the amount of time between the end_time of the last entry and the start_time of the subsequent one is < 10 minutes.

Example data:

activity start_time end_time
foo 9:08:34am 9:11:27am
bar 9:12:14am 10:28:41am
baz 2:38:11pm 2:41:19pm
bay 2:41:33pm 2:48:53pm

In the above, the solution would batch together foo/bar rows as one output, and baz/bay rows for another.

Some traits of the data:

  • No times overlap (aka there is at most one entry with start_time before and end_time after any given time)
  • There may be hundreds/thousands of rows per "batch"
  • A batch may go through midnight

I realize this may well be a common problem, but I can't figure out quite how to (elegantly) solve it, or frankly, quite how to elegantly google it. Any thoughts appreciated

答案1

得分: 1

尝试这个:

start_time = pd.to_datetime(df["start_time"])
end_time = pd.to_datetime(df["end_time"])

# 由于您的数据没有日期,我们必须分配一个虚拟日期。如果某行的start_time早于前一行的start_time,则认为它属于下一天。这要求您的数据已经排序。
is_next_day = start_time.diff().dt.total_seconds() < 0

# 增加虚拟日期以处理午夜交叉
delta = pd.to_timedelta(is_next_day.cumsum(), "D")
start_time += delta
end_time += delta

# 根据与前一行的间隔分配每一行到批次号
gap = pd.to_timedelta(10, "T")
exceed_gap = (start_time - end_time.shift()) > gap
df["batch_number"] = exceed_gap.cumsum()

请注意,我已经将 HTML 实体编码(&quot;)更改为正常的引号以使代码更易读。

英文:

Try this:

start_time = pd.to_datetime(df[&quot;start_time&quot;])
end_time = pd.to_datetime(df[&quot;end_time&quot;])

# Since your data does not have date, we have to assign a dummy date. A row is
# considered to be in the next day if its start_time is before the previous
# row&#39;s start_time. This requires your data to be sorted already.
is_next_day = start_time.diff().dt.total_seconds() &lt; 0

# Increment the dummy date to handle mid-night cross-over
delta = pd.to_timedelta(is_next_day.cumsum(), &quot;D&quot;)
start_time += delta
end_time += delta

# Assign each row to a batch number based on gap to the previous row
gap = pd.to_timedelta(10, &quot;T&quot;)
exceed_gap = (start_time - end_time.shift()) &gt; gap
df[&quot;batch_number&quot;] = exceed_gap.cumsum()

huangapple
  • 本文由 发表于 2023年3月4日 01:58:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/75630411.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定