英文:
Python & pandas: Batching data where difference between timestamps < set value
问题
我试图在Python中(最好使用pandas)创建一个数据集,将所有结束时间与后续条目的开始时间之间的时间间隔小于10分钟的行分组在一起。
示例数据:
| 活动 | 开始时间 | 结束时间 |
| -------| ----------| ---------|
| foo | 9:08:34am | 9:11:27am|
| bar | 9:12:14am | 10:28:41am |
| baz | 2:38:11pm | 2:41:19pm |
| bay | 2:41:33pm | 2:48:53pm |
在上述示例中,解决方案将foo/bar行批处理为一个输出,baz/bay行批处理为另一个输出。
数据的一些特点:
* 没有时间重叠(即在任何给定时间之前和之后,最多只有一个带有开始时间和结束时间的条目)
* 每个“批次”可能有数百/数千行
* 一个批次可能跨越午夜
我意识到这可能是一个常见的问题,但我无法想出如何(优雅地)解决它,或者说,我无法优雅地在Google上找到答案。感激任何建议。
英文:
I'm trying to create a data set in python (preferably pandas) that groups together all rows where the amount of time between the end_time of the last entry and the start_time of the subsequent one is < 10 minutes.
Example data:
activity | start_time | end_time |
---|---|---|
foo | 9:08:34am | 9:11:27am |
bar | 9:12:14am | 10:28:41am |
baz | 2:38:11pm | 2:41:19pm |
bay | 2:41:33pm | 2:48:53pm |
In the above, the solution would batch together foo/bar rows as one output, and baz/bay rows for another.
Some traits of the data:
- No times overlap (aka there is at most one entry with start_time before and end_time after any given time)
- There may be hundreds/thousands of rows per "batch"
- A batch may go through midnight
I realize this may well be a common problem, but I can't figure out quite how to (elegantly) solve it, or frankly, quite how to elegantly google it. Any thoughts appreciated
答案1
得分: 1
尝试这个:
start_time = pd.to_datetime(df["start_time"])
end_time = pd.to_datetime(df["end_time"])
# 由于您的数据没有日期,我们必须分配一个虚拟日期。如果某行的start_time早于前一行的start_time,则认为它属于下一天。这要求您的数据已经排序。
is_next_day = start_time.diff().dt.total_seconds() < 0
# 增加虚拟日期以处理午夜交叉
delta = pd.to_timedelta(is_next_day.cumsum(), "D")
start_time += delta
end_time += delta
# 根据与前一行的间隔分配每一行到批次号
gap = pd.to_timedelta(10, "T")
exceed_gap = (start_time - end_time.shift()) > gap
df["batch_number"] = exceed_gap.cumsum()
请注意,我已经将 HTML 实体编码("
)更改为正常的引号以使代码更易读。
英文:
Try this:
start_time = pd.to_datetime(df["start_time"])
end_time = pd.to_datetime(df["end_time"])
# Since your data does not have date, we have to assign a dummy date. A row is
# considered to be in the next day if its start_time is before the previous
# row's start_time. This requires your data to be sorted already.
is_next_day = start_time.diff().dt.total_seconds() < 0
# Increment the dummy date to handle mid-night cross-over
delta = pd.to_timedelta(is_next_day.cumsum(), "D")
start_time += delta
end_time += delta
# Assign each row to a batch number based on gap to the previous row
gap = pd.to_timedelta(10, "T")
exceed_gap = (start_time - end_time.shift()) > gap
df["batch_number"] = exceed_gap.cumsum()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论