如何加速 pandas 中 resample、idxmax 和 idxmin 列的计算?

huangapple go评论106阅读模式
英文:

How to speed up resample idxmax and idxmin column calculations in pandas?

问题

  1. def transform(x):
  2. x["open_first"] = x["open"].iloc[0]
  3. x["close_last"] = x["close"].iloc[-1]
  4. x["high_max"] = x["high"].max()
  5. x["low_min"] = x["low"].min()
  6. x["high_idxmax"] = x["high"].idxmax()
  7. x["low_idxmin"] = x["low"].idxmin()
  8. x["volume_sum"] = x["volume"].sum()
  9. x["high_first"] = x["high_idxmax"] < x["low_idxmin"]
  10. return x
  11. df_resampled = df.resample("5min").apply(transform)
英文:

I have 1 minute ohlcv data in a pandas dataframe.
I want to resample it with 5 minute intervals and see if the high of the 5 minutes was hit first, or the low, all the while keeping the ohlcv values.

Input is like

datetime open high low close
2022-01-01 10:00:00 10 12 9 11
2022-01-01 10:01:00 11 14 9 12
2022-01-01 10:02:00 12 12 8 10
2022-01-01 10:03:00 10 15 9 11
2022-01-01 10:04:00 10 12 8 11
2022-01-01 10:05:00 11 11 9 10

Output is like

datetime open high low close high_first
2022-01-01 10:00:00 10 15 8 11 0
2022-01-01 10:05:00 11 11 9 10 1

First, I tried the simple way:

I would find the indexes of where high would reach its max and low would reach its min, then I would add another column comparing those.

  1. df.resample(&quot;15min&quot;).agg({&quot;high&quot;: [&quot;idxmax&quot;, &quot;max&quot;], &quot;low&quot;: [&quot;idxmin&quot;, &quot;min&quot;]})

But I got an error:

ValueError: attempt to get argmax of an empty sequence

Because my data is not continuous (it cuts off at holidays)

So, left to my own devices, I wrote an apply function:

  1. def transform(x):
  2. x[&quot;open_first&quot;] = x[&quot;open&quot;][0]
  3. x[&quot;close_last&quot;] = x[&quot;close&quot;][-1]
  4. x[&quot;high_max&quot;] = x[&quot;high&quot;].max()
  5. x[&quot;low_min&quot;] = x[&quot;low&quot;].min()
  6. x[&quot;high_idxmax&quot;] = x[&quot;high&quot;].idxmax()
  7. x[&quot;low_idxmin&quot;] = x[&quot;low&quot;].idxmin()
  8. x[&quot;volume_sum&quot;] = x[&quot;volume&quot;].sum()
  9. x[&quot;high_first&quot;] = x[&quot;high_idxmax&quot;] &lt; x[&quot;low_idxmin&quot;]
  10. return x

But it is very slow. Is it possible to make it faster and avoid df.apply ?

答案1

得分: 1

你可以使用 DatetimeIndex.floor 处理 DatetimeIndex 的不连续 5 分钟数值:

  1. df1 = (df.groupby(df.index.floor('5min'))
  2. .agg(open_first=('open','first'),
  3. close_last=('close','last'),
  4. high_max=('high','max'),
  5. low_min=('low','min'),
  6. high_idxmax=('high','idxmax'),
  7. low_idxmin=('low','idxmin'))
  8. .assign(high_first = lambda x: x["high_idxmax"] < x["low_idxmin"]))
  9. print (df1)
  10. open_first close_last high_max low_min \
  11. datetime
  12. 2022-01-01 10:00:00 10 11 15 8
  13. 2022-01-01 10:05:00 11 10 11 9
  14. high_idxmax low_idxmin high_first
  15. datetime
  16. 2022-01-01 10:00:00 2022-01-01 10:03:00 2022-01-01 10:02:00 False
  17. 2022-01-01 10:05:00 2022-01-01 10:05:00 2022-01-01 10:05:00 False

你的解决方案创建了连续的索引,对于 idxmax/idmin 需要使用 lambda 函数:

  1. df1 = (df.groupby(pd.Grouper(freq='5Min'))
  2. .agg(open_first=('open','first'),
  3. close_last=('close','last'),
  4. high_max=('high','max'),
  5. low_min=('low','min'),
  6. high_idxmax=('high',lambda x : np.nan if x.count() == 0 else x.idxmax()),
  7. low_idxmin=('low',lambda x : np.nan if x.count() == 0 else x.idxmin()))
  8. .assign(high_first = lambda x: x["high_idxmax"] < x["low_idxmin"]))
  9. print (df1)
英文:

You can use DatetimeIndex.floor for 5 minutes values with not continuous values of DatetimeIndex:

  1. df1 = (df.groupby(df.index.floor(&#39;5min&#39;))
  2. .agg(open_first=(&#39;open&#39;,&#39;first&#39;),
  3. close_last=(&#39;close&#39;,&#39;last&#39;),
  4. high_max=(&#39;high&#39;,&#39;max&#39;),
  5. low_min=(&#39;low&#39;,&#39;min&#39;),
  6. high_idxmax=(&#39;high&#39;,&#39;idxmax&#39;),
  7. low_idxmin=(&#39;low&#39;,&#39;idxmin&#39;))
  8. .assign(high_first = lambda x: x[&quot;high_idxmax&quot;] &lt; x[&quot;low_idxmin&quot;]))
  9. print (df1)
  10. open_first close_last high_max low_min \
  11. datetime
  12. 2022-01-01 10:00:00 10 11 15 8
  13. 2022-01-01 10:05:00 11 10 11 9
  14. high_idxmax low_idxmin high_first
  15. datetime
  16. 2022-01-01 10:00:00 2022-01-01 10:03:00 2022-01-01 10:02:00 False
  17. 2022-01-01 10:05:00 2022-01-01 10:05:00 2022-01-01 10:05:00 False

Your solution create consecutive index, for idxmax/idmin are necessary lambda functions:

  1. df1 = (df.groupby(pd.Grouper(freq=&#39;5Min&#39;))
  2. .agg(open_first=(&#39;open&#39;,&#39;first&#39;),
  3. close_last=(&#39;close&#39;,&#39;last&#39;),
  4. high_max=(&#39;high&#39;,&#39;max&#39;),
  5. low_min=(&#39;low&#39;,&#39;min&#39;),
  6. high_idxmax=(&#39;high&#39;,lambda x : np.nan if x.count() == 0 else x.idxmax()),
  7. low_idxmin=(&#39;low&#39;,lambda x : np.nan if x.count() == 0 else x.idxmin()))
  8. .assign(high_first = lambda x: x[&quot;high_idxmax&quot;] &lt; x[&quot;low_idxmin&quot;]))
  9. print (df1)

huangapple
  • 本文由 发表于 2023年3月15日 20:26:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/75744669.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定