英文:
How to speed up resample idxmax and idxmin column calculations in pandas?
问题
def transform(x):
x["open_first"] = x["open"].iloc[0]
x["close_last"] = x["close"].iloc[-1]
x["high_max"] = x["high"].max()
x["low_min"] = x["low"].min()
x["high_idxmax"] = x["high"].idxmax()
x["low_idxmin"] = x["low"].idxmin()
x["volume_sum"] = x["volume"].sum()
x["high_first"] = x["high_idxmax"] < x["low_idxmin"]
return x
df_resampled = df.resample("5min").apply(transform)
英文:
I have 1 minute ohlcv data in a pandas dataframe.
I want to resample it with 5 minute intervals and see if the high of the 5 minutes was hit first, or the low, all the while keeping the ohlcv values.
Input is like
datetime | open | high | low | close |
---|---|---|---|---|
2022-01-01 10:00:00 | 10 | 12 | 9 | 11 |
2022-01-01 10:01:00 | 11 | 14 | 9 | 12 |
2022-01-01 10:02:00 | 12 | 12 | 8 | 10 |
2022-01-01 10:03:00 | 10 | 15 | 9 | 11 |
2022-01-01 10:04:00 | 10 | 12 | 8 | 11 |
2022-01-01 10:05:00 | 11 | 11 | 9 | 10 |
Output is like
datetime | open | high | low | close | high_first |
---|---|---|---|---|---|
2022-01-01 10:00:00 | 10 | 15 | 8 | 11 | 0 |
2022-01-01 10:05:00 | 11 | 11 | 9 | 10 | 1 |
First, I tried the simple way:
I would find the indexes of where high would reach its max and low would reach its min, then I would add another column comparing those.
df.resample("15min").agg({"high": ["idxmax", "max"], "low": ["idxmin", "min"]})
But I got an error:
ValueError: attempt to get argmax of an empty sequence
Because my data is not continuous (it cuts off at holidays)
So, left to my own devices, I wrote an apply function:
def transform(x):
x["open_first"] = x["open"][0]
x["close_last"] = x["close"][-1]
x["high_max"] = x["high"].max()
x["low_min"] = x["low"].min()
x["high_idxmax"] = x["high"].idxmax()
x["low_idxmin"] = x["low"].idxmin()
x["volume_sum"] = x["volume"].sum()
x["high_first"] = x["high_idxmax"] < x["low_idxmin"]
return x
But it is very slow. Is it possible to make it faster and avoid df.apply
?
答案1
得分: 1
你可以使用 DatetimeIndex.floor
处理 DatetimeIndex
的不连续 5 分钟数值:
df1 = (df.groupby(df.index.floor('5min'))
.agg(open_first=('open','first'),
close_last=('close','last'),
high_max=('high','max'),
low_min=('low','min'),
high_idxmax=('high','idxmax'),
low_idxmin=('low','idxmin'))
.assign(high_first = lambda x: x["high_idxmax"] < x["low_idxmin"]))
print (df1)
open_first close_last high_max low_min \
datetime
2022-01-01 10:00:00 10 11 15 8
2022-01-01 10:05:00 11 10 11 9
high_idxmax low_idxmin high_first
datetime
2022-01-01 10:00:00 2022-01-01 10:03:00 2022-01-01 10:02:00 False
2022-01-01 10:05:00 2022-01-01 10:05:00 2022-01-01 10:05:00 False
你的解决方案创建了连续的索引,对于 idxmax/idmin
需要使用 lambda 函数:
df1 = (df.groupby(pd.Grouper(freq='5Min'))
.agg(open_first=('open','first'),
close_last=('close','last'),
high_max=('high','max'),
low_min=('low','min'),
high_idxmax=('high',lambda x : np.nan if x.count() == 0 else x.idxmax()),
low_idxmin=('low',lambda x : np.nan if x.count() == 0 else x.idxmin()))
.assign(high_first = lambda x: x["high_idxmax"] < x["low_idxmin"]))
print (df1)
英文:
You can use DatetimeIndex.floor
for 5 minutes values with not continuous values of DatetimeIndex
:
df1 = (df.groupby(df.index.floor('5min'))
.agg(open_first=('open','first'),
close_last=('close','last'),
high_max=('high','max'),
low_min=('low','min'),
high_idxmax=('high','idxmax'),
low_idxmin=('low','idxmin'))
.assign(high_first = lambda x: x["high_idxmax"] < x["low_idxmin"]))
print (df1)
open_first close_last high_max low_min \
datetime
2022-01-01 10:00:00 10 11 15 8
2022-01-01 10:05:00 11 10 11 9
high_idxmax low_idxmin high_first
datetime
2022-01-01 10:00:00 2022-01-01 10:03:00 2022-01-01 10:02:00 False
2022-01-01 10:05:00 2022-01-01 10:05:00 2022-01-01 10:05:00 False
Your solution create consecutive index, for idxmax/idmin
are necessary lambda functions:
df1 = (df.groupby(pd.Grouper(freq='5Min'))
.agg(open_first=('open','first'),
close_last=('close','last'),
high_max=('high','max'),
low_min=('low','min'),
high_idxmax=('high',lambda x : np.nan if x.count() == 0 else x.idxmax()),
low_idxmin=('low',lambda x : np.nan if x.count() == 0 else x.idxmin()))
.assign(high_first = lambda x: x["high_idxmax"] < x["low_idxmin"]))
print (df1)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论