2023年3月4日 01:58:16go评论109阅读模式

英文:

Python & pandas: Batching data where difference between timestamps < set value

问题

我试图在Python中（最好使用pandas）创建一个数据集，将所有结束时间与后续条目的开始时间之间的时间间隔小于10分钟的行分组在一起。
示例数据：
| 活动    | 开始时间  | 结束时间  |
| -------| ----------| ---------|
| foo    | 9:08:34am | 9:11:27am|
| bar    | 9:12:14am | 10:28:41am |
| baz    | 2:38:11pm | 2:41:19pm |
| bay    | 2:41:33pm | 2:48:53pm |
在上述示例中，解决方案将foo/bar行批处理为一个输出，baz/bay行批处理为另一个输出。
数据的一些特点：
* 没有时间重叠（即在任何给定时间之前和之后，最多只有一个带有开始时间和结束时间的条目）
* 每个“批次”可能有数百/数千行
* 一个批次可能跨越午夜
我意识到这可能是一个常见的问题，但我无法想出如何（优雅地）解决它，或者说，我无法优雅地在Google上找到答案。感激任何建议。

英文:

I'm trying to create a data set in python (preferably pandas) that groups together all rows where the amount of time between the end_time of the last entry and the start_time of the subsequent one is < 10 minutes.

Example data:

activity	start_time	end_time
foo	9:08:34am	9:11:27am
bar	9:12:14am	10:28:41am
baz	2:38:11pm	2:41:19pm
bay	2:41:33pm	2:48:53pm

In the above, the solution would batch together foo/bar rows as one output, and baz/bay rows for another.

Some traits of the data:

No times overlap (aka there is at most one entry with start_time before and end_time after any given time)
There may be hundreds/thousands of rows per "batch"
A batch may go through midnight

I realize this may well be a common problem, but I can't figure out quite how to (elegantly) solve it, or frankly, quite how to elegantly google it. Any thoughts appreciated

答案1

得分: 1

尝试这个：

start_time = pd.to_datetime(df["start_time"])
end_time = pd.to_datetime(df["end_time"])
# 由于您的数据没有日期，我们必须分配一个虚拟日期。如果某行的start_time早于前一行的start_time，则认为它属于下一天。这要求您的数据已经排序。
is_next_day = start_time.diff().dt.total_seconds() < 0
# 增加虚拟日期以处理午夜交叉
delta = pd.to_timedelta(is_next_day.cumsum(), "D")
start_time += delta
end_time += delta
# 根据与前一行的间隔分配每一行到批次号
gap = pd.to_timedelta(10, "T")
exceed_gap = (start_time - end_time.shift()) > gap
df["batch_number"] = exceed_gap.cumsum()

请注意，我已经将 HTML 实体编码（"）更改为正常的引号以使代码更易读。

英文:

Try this:

start_time = pd.to_datetime(df[&quot;start_time&quot;])
end_time = pd.to_datetime(df[&quot;end_time&quot;])
# Since your data does not have date, we have to assign a dummy date. A row is
# considered to be in the next day if its start_time is before the previous
# row&#39;s start_time. This requires your data to be sorted already.
is_next_day = start_time.diff().dt.total_seconds() &lt; 0
# Increment the dummy date to handle mid-night cross-over
delta = pd.to_timedelta(is_next_day.cumsum(), &quot;D&quot;)
start_time += delta
end_time += delta
# Assign each row to a batch number based on gap to the previous row
gap = pd.to_timedelta(10, &quot;T&quot;)
exceed_gap = (start_time - end_time.shift()) &gt; gap
df[&quot;batch_number&quot;] = exceed_gap.cumsum()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python和pandas：批处理数据，其中时间戳之间的差值小于设定的值

问题

答案1

如何在Python中更新数据字典

Pandas中特定列数值的3周滚动平均值

如何在Databricks V2中安装任何Python包一次，然后重复使用它？

Plotly: 在 go.figure() 中使用 px.strip()

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。