英文:
Ensuring a minimum time interval between successive observations in a Pandas dataframe
问题
我有一个按日期列排序的pandas数据框。然而,我希望确保观察之间有最小的时间间隔。简单来说,假设这个窗口是10分钟,这意味着如果我的第一个观察是在上午8:05发生的,那么第二个观察必须至少在上午8:15发生。在上午8:05-8:15之间发生的任何观察都必须从数据框中删除。假设在删除观察后,第二个观察在上午8:17发生。然后,在上午8:17-8:27之间删除所有观察,以找到第三个数据点,并且这个过程继续进行。
我有一个可以工作的脚本,但是它逐行迭代行,并且对于具有数十万行的数据框来说速度非常慢。我的当前脚本(窗口是以分钟为单位的最小阈值):
cur_time = df.iloc[0].Date
for idx, row in df[1:].iterrows():
time_diff = (row.Date - cur_time).total_seconds()
if time_diff > window*60:
cur_time = row.Date
else:
df.drop(idx, inplace=True)
有没有人能想到一个更加速度优化的方法来执行这个操作?如果我将日期列作为索引,是否有现成的函数可以执行这个功能?
编辑:经过进一步的研究,我正在寻找的函数类似于 df.resample(window + 'M').first()
。然而,使用这个函数的问题是我的数据集稀疏。也就是说,我不是每分钟都有数据,数据点之间的间隔可能是1秒或1个月。
英文:
I have a pandas dataframe which is sorted by a date column. However I wish to ensure a minimum time interval between observations. Say for simplicity this window is 10 minutes, what this means is that if my first observation occurred at 8:05 AM then the second observation must occur at at least 8:15 AM. Any observations occurring between 8:05-8:15 AM must be dropped from the dataframe. Say without loss of generality that after dropping observations the second observation occurs at 8:17 AM. Then all observations between 8:17-8:27 AM are dropped to find the third data point and this process continues.
I have a script which works but iterates over the rows one at a time and is excruciatingly slow as the dataframe has hundreds of thousands of rows. My current script (window is the minimum threshold in minutes):
cur_time = df.iloc[0].Date
for idx, row in df[1:].iterrows():
time_diff = (row.Date - cur_time).total_seconds()
if time_diff > window*60:
cur_time = row.Date
else:
trades_df.drop(idx, inplace=True)
Can anyone think of a more speed optimized way of doing this operation? If I switch to the Date column as the index are there functions readily available for performing this function?
Edit: After doing further research the function that I'm looking for is similar to df.resample(window + 'M').first(). However the issue with using this is that my data set is sparsely spaced. I.e. I don't have data for every minute, the gap between data points could be 1 second or it could be 1 month.
答案1
得分: 1
根据您在评论中提到的条件,我认为您不能对整个代码进行向量化。但是,您可以更快地浏览数据集:
window = 10
# 将日期转换为numpy数组(以秒为单位)
arr = df['Date'].values.astype(float) / 1e9
# 使用numpy广播计算密集矩阵
m = arr - arr[:, None] > window * 60
locs = [] # 有效观察的列表
idx = 0 # 第一个日期始终有效
while True:
# 添加当前观察
locs.append(idx)
if m[idx].sum() == 0:
# 没有更多要检查的观察
break
# 下一个有效观察
idx = np.argmax(m[idx])
out = df.iloc[locs]
输出:
>>> out
Date
0 2023-06-01 00:02:10
3 2023-06-01 00:14:20
8 2023-06-01 00:24:42
11 2023-06-01 00:35:35
13 2023-06-01 00:48:39
>>> locs
[0, 3, 8, 11, 13]
import numpy as np
import pandas as pd
np.random.seed(42)
offsets = pd.to_timedelta(np.random.randint(0, 60*60, 20), unit='S')
df = (pd.DataFrame({'Date': pd.Timestamp('2023-06-01') + offsets})
.sort_values('Date', ignore_index=True))
print(df)
# 输出
Date
0 2023-06-01 00:02:10 # OK,第一个值始终有效
1 2023-06-01 00:05:30
2 2023-06-01 00:07:46
3 2023-06-01 00:14:20 # OK,00:02:10 + 10分钟 < 00:14:20
4 2023-06-01 00:18:15
5 2023-06-01 00:18:50
6 2023-06-01 00:20:38
7 2023-06-01 00:21:34
8 2023-06-01 00:24:42 # OK,00:14:20 + 10分钟 < 00:24:42
9 2023-06-01 00:27:18
10 2023-06-01 00:28:05
11 2023-06-01 00:35:35 # OK,00:24:42 + 10分钟 < 00:35:35
12 2023-06-01 00:36:09
13 2023-06-01 00:48:39 # OK,00:35:35 + 10分钟 < 00:48:39
14 2023-06-01 00:51:32
15 2023-06-01 00:52:51
16 2023-06-01 00:52:54
17 2023-06-01 00:56:20
18 2023-06-01 00:57:24
19 2023-06-01 00:58:27
英文:
With your condition mentioned in comments, I think you can't vectorize whole code. However, you can browse through the dataset faster:
window = 10
# convert date as numpy array (in seconds)
arr = df['Date'].values.astype(float) / 1e9
# compute dense matrix using numpy broadcasting
m = arr - arr[:, None] > window * 60
locs = [] # list of valid observations
idx = 0 # first date is always valid
while True:
# append the current observation
locs.append(idx)
if m[idx].sum() == 0:
# no more observations to check
break
# next valid observation
idx = np.argmax(m[idx])
out = df.iloc[locs]
Output:
>>> out
Date
0 2023-06-01 00:02:10
3 2023-06-01 00:14:20
8 2023-06-01 00:24:42
11 2023-06-01 00:35:35
13 2023-06-01 00:48:39
>>> locs
[0, 3, 8, 11, 13]
import numpy as np
import pandas as pd
np.random.seed(42)
offsets = pd.to_timedelta(np.random.randint(0, 60*60, 20), unit='S')
df = (pd.DataFrame({'Date': pd.Timestamp('2023-06-01') + offsets})
.sort_values('Date', ignore_index=True))
print(df)
# Output
Date
0 2023-06-01 00:02:10 # OK, first value is always valid
1 2023-06-01 00:05:30
2 2023-06-01 00:07:46
3 2023-06-01 00:14:20 # OK, 00:02:10 + 10min < 00:14:20
4 2023-06-01 00:18:15
5 2023-06-01 00:18:50
6 2023-06-01 00:20:38
7 2023-06-01 00:21:34
8 2023-06-01 00:24:42 # OK, 00:14:20 + 10min < 00:24:42
9 2023-06-01 00:27:18
10 2023-06-01 00:28:05
11 2023-06-01 00:35:35 # OK, 00:24:42 + 10min < 00:35:35
12 2023-06-01 00:36:09
13 2023-06-01 00:48:39 # OK, 00:35:35 + 10min < 00:48:39
14 2023-06-01 00:51:32
15 2023-06-01 00:52:51
16 2023-06-01 00:52:54
17 2023-06-01 00:56:20
18 2023-06-01 00:57:24
19 2023-06-01 00:58:27
答案2
得分: 1
给所有未来查看此线程的人:
我发现对于我的特定示例有效的方法是:
我的日期列由 'Date' 给出
df.set_index('Date', inplace=True)
窗口是以分钟为单位的最小阈值
df.resample(window + 'min').first()
在原始数据集中,Name 是一个必填字段
df.dropna('Name', inplace=True)
txTime 是我从原始日期列创建的重复列。
这是因为重新采样为观察值放入了新的时间段,但我想保留原始日期。
df.reindex('txTime', inplace=True)
尽管在我的特定情况下,此代码通过重新采样函数创建了数十万个 NA 行,但与我原来的迭代行方法相比,它仍然非常快。
英文:
To all future people looking at this thread:<br/>
What I found to work for my specific example is:<br/>
>My date column is given by 'Date'<br/>
df.set_index('Date', inplace=True)<br/>
>window is the minimum threshold in minutes<br/>
df.resample(window +'min').first()
>Name is a mandatory field in the original dataset<br/>
df.dropna('Name', inplace=True)<br/>
> txTime is a duplicate column I made of the original Date column.
This is because resample puts in new time slots for the observations
but I want to keep the original date
df.reindex('txTime', inplace=True)<br/>
Although in my specific case this code creates hundreds of thousands of NA rows from the resample function, it is still lightning fast compared to my original iterrows method.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论