2023年6月1日 20:00:06go评论72阅读模式

英文:

Ensuring a minimum time interval between successive observations in a Pandas dataframe

问题

我有一个按日期列排序的pandas数据框。然而，我希望确保观察之间有最小的时间间隔。简单来说，假设这个窗口是10分钟，这意味着如果我的第一个观察是在上午8:05发生的，那么第二个观察必须至少在上午8:15发生。在上午8:05-8:15之间发生的任何观察都必须从数据框中删除。假设在删除观察后，第二个观察在上午8:17发生。然后，在上午8:17-8:27之间删除所有观察，以找到第三个数据点，并且这个过程继续进行。

我有一个可以工作的脚本，但是它逐行迭代行，并且对于具有数十万行的数据框来说速度非常慢。我的当前脚本（窗口是以分钟为单位的最小阈值）：

cur_time = df.iloc[0].Date

for idx, row in df[1:].iterrows():
    time_diff = (row.Date - cur_time).total_seconds()
    if time_diff > window*60:
        cur_time = row.Date
    else:
        df.drop(idx, inplace=True)

有没有人能想到一个更加速度优化的方法来执行这个操作？如果我将日期列作为索引，是否有现成的函数可以执行这个功能？

编辑：经过进一步的研究，我正在寻找的函数类似于 df.resample(window + 'M').first()。然而，使用这个函数的问题是我的数据集稀疏。也就是说，我不是每分钟都有数据，数据点之间的间隔可能是1秒或1个月。

英文:

I have a pandas dataframe which is sorted by a date column. However I wish to ensure a minimum time interval between observations. Say for simplicity this window is 10 minutes, what this means is that if my first observation occurred at 8:05 AM then the second observation must occur at at least 8:15 AM. Any observations occurring between 8:05-8:15 AM must be dropped from the dataframe. Say without loss of generality that after dropping observations the second observation occurs at 8:17 AM. Then all observations between 8:17-8:27 AM are dropped to find the third data point and this process continues.

I have a script which works but iterates over the rows one at a time and is excruciatingly slow as the dataframe has hundreds of thousands of rows. My current script (window is the minimum threshold in minutes):

cur_time = df.iloc[0].Date

for idx, row in df[1:].iterrows():

time_diff = (row.Date - cur_time).total_seconds()

if time_diff &gt; window*60:

    cur_time = row.Date

else:

    trades_df.drop(idx, inplace=True)

Can anyone think of a more speed optimized way of doing this operation? If I switch to the Date column as the index are there functions readily available for performing this function?

Edit: After doing further research the function that I'm looking for is similar to df.resample(window + 'M').first(). However the issue with using this is that my data set is sparsely spaced. I.e. I don't have data for every minute, the gap between data points could be 1 second or it could be 1 month.

答案1

得分: 1

根据您在评论中提到的条件，我认为您不能对整个代码进行向量化。但是，您可以更快地浏览数据集：

window = 10
# 将日期转换为numpy数组（以秒为单位）
arr = df['Date'].values.astype(float) / 1e9
# 使用numpy广播计算密集矩阵
m = arr - arr[:, None] > window * 60
locs = []  # 有效观察的列表
idx = 0  # 第一个日期始终有效

while True:
    # 添加当前观察
    locs.append(idx)
    if m[idx].sum() == 0:
        # 没有更多要检查的观察
        break
    # 下一个有效观察
    idx = np.argmax(m[idx])

out = df.iloc[locs]

输出：

>>> out
                  Date
0  2023-06-01 00:02:10
3  2023-06-01 00:14:20
8  2023-06-01 00:24:42
11 2023-06-01 00:35:35
13 2023-06-01 00:48:39

>>> locs
[0, 3, 8, 11, 13]

最小可复现示例：

import numpy as np
import pandas as pd

np.random.seed(42)
offsets = pd.to_timedelta(np.random.randint(0, 60*60, 20), unit='S')
df = (pd.DataFrame({'Date': pd.Timestamp('2023-06-01') + offsets})
        .sort_values('Date', ignore_index=True))
print(df)

# 输出
                  Date
0  2023-06-01 00:02:10  # OK，第一个值始终有效
1  2023-06-01 00:05:30
2  2023-06-01 00:07:46
3  2023-06-01 00:14:20  # OK，00:02:10 + 10分钟 < 00:14:20
4  2023-06-01 00:18:15
5  2023-06-01 00:18:50
6  2023-06-01 00:20:38
7  2023-06-01 00:21:34
8  2023-06-01 00:24:42  # OK，00:14:20 + 10分钟 < 00:24:42
9  2023-06-01 00:27:18
10 2023-06-01 00:28:05
11 2023-06-01 00:35:35  # OK，00:24:42 + 10分钟 < 00:35:35
12 2023-06-01 00:36:09
13 2023-06-01 00:48:39  # OK，00:35:35 + 10分钟 < 00:48:39
14 2023-06-01 00:51:32
15 2023-06-01 00:52:51
16 2023-06-01 00:52:54
17 2023-06-01 00:56:20
18 2023-06-01 00:57:24
19 2023-06-01 00:58:27

英文:

With your condition mentioned in comments, I think you can't vectorize whole code. However, you can browse through the dataset faster:

window = 10
# convert date as numpy array (in seconds)
arr = df[&#39;Date&#39;].values.astype(float) / 1e9
# compute dense matrix using numpy broadcasting
m = arr - arr[:, None] &gt; window * 60
locs = []  # list of valid observations
idx = 0  # first date is always valid
while True:
# append the current observation
locs.append(idx)
if m[idx].sum() == 0:
# no more observations to check
break
# next valid observation
idx = np.argmax(m[idx])
out = df.iloc[locs]

Output:

&gt;&gt;&gt; out
Date
0  2023-06-01 00:02:10
3  2023-06-01 00:14:20
8  2023-06-01 00:24:42
11 2023-06-01 00:35:35
13 2023-06-01 00:48:39
&gt;&gt;&gt; locs
[0, 3, 8, 11, 13]

Minimal Reproducible Example:

import numpy as np
import pandas as pd
np.random.seed(42)
offsets = pd.to_timedelta(np.random.randint(0, 60*60, 20), unit=&#39;S&#39;)
df = (pd.DataFrame({&#39;Date&#39;: pd.Timestamp(&#39;2023-06-01&#39;) + offsets})
.sort_values(&#39;Date&#39;, ignore_index=True))
print(df)
# Output
Date
0  2023-06-01 00:02:10  # OK, first value is always valid
1  2023-06-01 00:05:30
2  2023-06-01 00:07:46
3  2023-06-01 00:14:20  # OK, 00:02:10 + 10min &lt; 00:14:20
4  2023-06-01 00:18:15
5  2023-06-01 00:18:50
6  2023-06-01 00:20:38
7  2023-06-01 00:21:34
8  2023-06-01 00:24:42  # OK, 00:14:20 + 10min &lt; 00:24:42
9  2023-06-01 00:27:18
10 2023-06-01 00:28:05
11 2023-06-01 00:35:35  # OK, 00:24:42 + 10min &lt; 00:35:35
12 2023-06-01 00:36:09
13 2023-06-01 00:48:39  # OK, 00:35:35 + 10min &lt; 00:48:39
14 2023-06-01 00:51:32
15 2023-06-01 00:52:51
16 2023-06-01 00:52:54
17 2023-06-01 00:56:20
18 2023-06-01 00:57:24
19 2023-06-01 00:58:27

答案2

得分: 1

给所有未来查看此线程的人：

我发现对于我的特定示例有效的方法是：

我的日期列由 'Date' 给出

df.set_index('Date', inplace=True)

窗口是以分钟为单位的最小阈值

df.resample(window + 'min').first()

在原始数据集中，Name 是一个必填字段

df.dropna('Name', inplace=True)

txTime 是我从原始日期列创建的重复列。
这是因为重新采样为观察值放入了新的时间段，但我想保留原始日期。

df.reindex('txTime', inplace=True)

尽管在我的特定情况下，此代码通过重新采样函数创建了数十万个 NA 行，但与我原来的迭代行方法相比，它仍然非常快。

英文:

To all future people looking at this thread:

What I found to work for my specific example is:

>My date column is given by 'Date'

df.set_index('Date', inplace=True)

>window is the minimum threshold in minutes

df.resample(window +'min').first()

>Name is a mandatory field in the original dataset

df.dropna('Name', inplace=True)

> txTime is a duplicate column I made of the original Date column.
This is because resample puts in new time slots for the observations
but I want to keep the original date

df.reindex('txTime', inplace=True)

Although in my specific case this code creates hundreds of thousands of NA rows from the resample function, it is still lightning fast compared to my original iterrows method.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

确保在Pandas数据框中连续观测之间存在最小时间间隔

问题

答案1

答案2

如何使用scatter_kws自定义pairplot中的标记样式

如果条件问题

计算布尔矩阵的密度如何？

可以暗示一个函数参数不应被修改吗？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论