确保在Pandas数据框中连续观测之间存在最小时间间隔

huangapple go评论72阅读模式
英文:

Ensuring a minimum time interval between successive observations in a Pandas dataframe

问题

我有一个按日期列排序的pandas数据框。然而,我希望确保观察之间有最小的时间间隔。简单来说,假设这个窗口是10分钟,这意味着如果我的第一个观察是在上午8:05发生的,那么第二个观察必须至少在上午8:15发生。在上午8:05-8:15之间发生的任何观察都必须从数据框中删除。假设在删除观察后,第二个观察在上午8:17发生。然后,在上午8:17-8:27之间删除所有观察,以找到第三个数据点,并且这个过程继续进行。

我有一个可以工作的脚本,但是它逐行迭代行,并且对于具有数十万行的数据框来说速度非常慢。我的当前脚本(窗口是以分钟为单位的最小阈值):

cur_time = df.iloc[0].Date

for idx, row in df[1:].iterrows():
    time_diff = (row.Date - cur_time).total_seconds()
    if time_diff > window*60:
        cur_time = row.Date
    else:
        df.drop(idx, inplace=True)

有没有人能想到一个更加速度优化的方法来执行这个操作?如果我将日期列作为索引,是否有现成的函数可以执行这个功能?

编辑:经过进一步的研究,我正在寻找的函数类似于 df.resample(window + 'M').first()。然而,使用这个函数的问题是我的数据集稀疏。也就是说,我不是每分钟都有数据,数据点之间的间隔可能是1秒或1个月。

英文:

I have a pandas dataframe which is sorted by a date column. However I wish to ensure a minimum time interval between observations. Say for simplicity this window is 10 minutes, what this means is that if my first observation occurred at 8:05 AM then the second observation must occur at at least 8:15 AM. Any observations occurring between 8:05-8:15 AM must be dropped from the dataframe. Say without loss of generality that after dropping observations the second observation occurs at 8:17 AM. Then all observations between 8:17-8:27 AM are dropped to find the third data point and this process continues.

I have a script which works but iterates over the rows one at a time and is excruciatingly slow as the dataframe has hundreds of thousands of rows. My current script (window is the minimum threshold in minutes):

cur_time = df.iloc[0].Date

for idx, row in df[1:].iterrows():

time_diff = (row.Date - cur_time).total_seconds()

if time_diff > window*60:

    cur_time = row.Date

else:

    trades_df.drop(idx, inplace=True)

Can anyone think of a more speed optimized way of doing this operation? If I switch to the Date column as the index are there functions readily available for performing this function?

Edit: After doing further research the function that I'm looking for is similar to df.resample(window + 'M').first(). However the issue with using this is that my data set is sparsely spaced. I.e. I don't have data for every minute, the gap between data points could be 1 second or it could be 1 month.

答案1

得分: 1

根据您在评论中提到的条件,我认为您不能对整个代码进行向量化。但是,您可以更快地浏览数据集:

window = 10
# 将日期转换为numpy数组(以秒为单位)
arr = df['Date'].values.astype(float) / 1e9
# 使用numpy广播计算密集矩阵
m = arr - arr[:, None] > window * 60
locs = []  # 有效观察的列表
idx = 0  # 第一个日期始终有效

while True:
    # 添加当前观察
    locs.append(idx)
    if m[idx].sum() == 0:
        # 没有更多要检查的观察
        break
    # 下一个有效观察
    idx = np.argmax(m[idx])

out = df.iloc[locs]

输出:

>>> out
                  Date
0  2023-06-01 00:02:10
3  2023-06-01 00:14:20
8  2023-06-01 00:24:42
11 2023-06-01 00:35:35
13 2023-06-01 00:48:39

>>> locs
[0, 3, 8, 11, 13]

最小可复现示例

import numpy as np
import pandas as pd

np.random.seed(42)
offsets = pd.to_timedelta(np.random.randint(0, 60*60, 20), unit='S')
df = (pd.DataFrame({'Date': pd.Timestamp('2023-06-01') + offsets})
        .sort_values('Date', ignore_index=True))
print(df)

# 输出
                  Date
0  2023-06-01 00:02:10  # OK,第一个值始终有效
1  2023-06-01 00:05:30
2  2023-06-01 00:07:46
3  2023-06-01 00:14:20  # OK,00:02:10 + 10分钟 < 00:14:20
4  2023-06-01 00:18:15
5  2023-06-01 00:18:50
6  2023-06-01 00:20:38
7  2023-06-01 00:21:34
8  2023-06-01 00:24:42  # OK,00:14:20 + 10分钟 < 00:24:42
9  2023-06-01 00:27:18
10 2023-06-01 00:28:05
11 2023-06-01 00:35:35  # OK,00:24:42 + 10分钟 < 00:35:35
12 2023-06-01 00:36:09
13 2023-06-01 00:48:39  # OK,00:35:35 + 10分钟 < 00:48:39
14 2023-06-01 00:51:32
15 2023-06-01 00:52:51
16 2023-06-01 00:52:54
17 2023-06-01 00:56:20
18 2023-06-01 00:57:24
19 2023-06-01 00:58:27
英文:

With your condition mentioned in comments, I think you can't vectorize whole code. However, you can browse through the dataset faster:

window = 10
# convert date as numpy array (in seconds)
arr = df[&#39;Date&#39;].values.astype(float) / 1e9
# compute dense matrix using numpy broadcasting
m = arr - arr[:, None] &gt; window * 60
locs = []  # list of valid observations
idx = 0  # first date is always valid
while True:
# append the current observation
locs.append(idx)
if m[idx].sum() == 0:
# no more observations to check
break
# next valid observation
idx = np.argmax(m[idx])
out = df.iloc[locs]

Output:

&gt;&gt;&gt; out
Date
0  2023-06-01 00:02:10
3  2023-06-01 00:14:20
8  2023-06-01 00:24:42
11 2023-06-01 00:35:35
13 2023-06-01 00:48:39
&gt;&gt;&gt; locs
[0, 3, 8, 11, 13]

Minimal Reproducible Example:

import numpy as np
import pandas as pd
np.random.seed(42)
offsets = pd.to_timedelta(np.random.randint(0, 60*60, 20), unit=&#39;S&#39;)
df = (pd.DataFrame({&#39;Date&#39;: pd.Timestamp(&#39;2023-06-01&#39;) + offsets})
.sort_values(&#39;Date&#39;, ignore_index=True))
print(df)
# Output
Date
0  2023-06-01 00:02:10  # OK, first value is always valid
1  2023-06-01 00:05:30
2  2023-06-01 00:07:46
3  2023-06-01 00:14:20  # OK, 00:02:10 + 10min &lt; 00:14:20
4  2023-06-01 00:18:15
5  2023-06-01 00:18:50
6  2023-06-01 00:20:38
7  2023-06-01 00:21:34
8  2023-06-01 00:24:42  # OK, 00:14:20 + 10min &lt; 00:24:42
9  2023-06-01 00:27:18
10 2023-06-01 00:28:05
11 2023-06-01 00:35:35  # OK, 00:24:42 + 10min &lt; 00:35:35
12 2023-06-01 00:36:09
13 2023-06-01 00:48:39  # OK, 00:35:35 + 10min &lt; 00:48:39
14 2023-06-01 00:51:32
15 2023-06-01 00:52:51
16 2023-06-01 00:52:54
17 2023-06-01 00:56:20
18 2023-06-01 00:57:24
19 2023-06-01 00:58:27

答案2

得分: 1

给所有未来查看此线程的人:

我发现对于我的特定示例有效的方法是:

我的日期列由 'Date' 给出

df.set_index('Date', inplace=True)

窗口是以分钟为单位的最小阈值

df.resample(window + 'min').first()

在原始数据集中,Name 是一个必填字段

df.dropna('Name', inplace=True)

txTime 是我从原始日期列创建的重复列。
这是因为重新采样为观察值放入了新的时间段,但我想保留原始日期。

df.reindex('txTime', inplace=True)

尽管在我的特定情况下,此代码通过重新采样函数创建了数十万个 NA 行,但与我原来的迭代行方法相比,它仍然非常快。

英文:

To all future people looking at this thread:<br/>

What I found to work for my specific example is:<br/>

>My date column is given by 'Date'<br/>

df.set_index('Date', inplace=True)<br/>

>window is the minimum threshold in minutes<br/>

df.resample(window +'min').first()

>Name is a mandatory field in the original dataset<br/>

df.dropna('Name', inplace=True)<br/>

> txTime is a duplicate column I made of the original Date column.
This is because resample puts in new time slots for the observations
but I want to keep the original date

df.reindex('txTime', inplace=True)<br/>

Although in my specific case this code creates hundreds of thousands of NA rows from the resample function, it is still lightning fast compared to my original iterrows method.

huangapple
  • 本文由 发表于 2023年6月1日 20:00:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76381633.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定