2023年3月15日 20:26:11go评论106阅读模式

英文:

How to speed up resample idxmax and idxmin column calculations in pandas?

问题

def transform(x):
    x["open_first"] = x["open"].iloc[0]
    x["close_last"] = x["close"].iloc[-1]
    x["high_max"] = x["high"].max()
    x["low_min"] = x["low"].min()
    x["high_idxmax"] = x["high"].idxmax()
    x["low_idxmin"] = x["low"].idxmin()
    x["volume_sum"] = x["volume"].sum()
    x["high_first"] = x["high_idxmax"] < x["low_idxmin"]
    return x
df_resampled = df.resample("5min").apply(transform)

英文:

I have 1 minute ohlcv data in a pandas dataframe.
I want to resample it with 5 minute intervals and see if the high of the 5 minutes was hit first, or the low, all the while keeping the ohlcv values.

Input is like

datetime	open	high	low	close
2022-01-01 10:00:00	10	12	9	11
2022-01-01 10:01:00	11	14	9	12
2022-01-01 10:02:00	12	12	8	10
2022-01-01 10:03:00	10	15	9	11
2022-01-01 10:04:00	10	12	8	11
2022-01-01 10:05:00	11	11	9	10

Output is like

datetime	open	high	low	close	high_first
2022-01-01 10:00:00	10	15	8	11	0
2022-01-01 10:05:00	11	11	9	10	1

First, I tried the simple way:

I would find the indexes of where high would reach its max and low would reach its min, then I would add another column comparing those.

df.resample(&quot;15min&quot;).agg({&quot;high&quot;: [&quot;idxmax&quot;, &quot;max&quot;], &quot;low&quot;: [&quot;idxmin&quot;, &quot;min&quot;]})

But I got an error:

ValueError: attempt to get argmax of an empty sequence

Because my data is not continuous (it cuts off at holidays)

So, left to my own devices, I wrote an apply function:

def transform(x):
    x[&quot;open_first&quot;] = x[&quot;open&quot;][0]
    x[&quot;close_last&quot;] = x[&quot;close&quot;][-1]
    x[&quot;high_max&quot;] = x[&quot;high&quot;].max()
    x[&quot;low_min&quot;] = x[&quot;low&quot;].min()
    x[&quot;high_idxmax&quot;] = x[&quot;high&quot;].idxmax()
    x[&quot;low_idxmin&quot;] = x[&quot;low&quot;].idxmin()
    x[&quot;volume_sum&quot;] = x[&quot;volume&quot;].sum()
    x[&quot;high_first&quot;] = x[&quot;high_idxmax&quot;] &lt; x[&quot;low_idxmin&quot;]
    return x

But it is very slow. Is it possible to make it faster and avoid df.apply ?

答案1

得分: 1

你可以使用 DatetimeIndex.floor 处理 DatetimeIndex 的不连续 5 分钟数值：

df1 = (df.groupby(df.index.floor('5min'))
        .agg(open_first=('open','first'),
             close_last=('close','last'),
             high_max=('high','max'),
             low_min=('low','min'),
             high_idxmax=('high','idxmax'),
             low_idxmin=('low','idxmin'))
        .assign(high_first = lambda x: x["high_idxmax"] < x["low_idxmin"]))
print (df1)
                     open_first  close_last  high_max  low_min  \
datetime                                                         
2022-01-01 10:00:00          10          11        15        8   
2022-01-01 10:05:00          11          10        11        9   
                            high_idxmax          low_idxmin  high_first  
datetime                                                                 
2022-01-01 10:00:00 2022-01-01 10:03:00 2022-01-01 10:02:00       False  
2022-01-01 10:05:00 2022-01-01 10:05:00 2022-01-01 10:05:00       False

你的解决方案创建了连续的索引，对于 idxmax/idmin 需要使用 lambda 函数：

df1 = (df.groupby(pd.Grouper(freq='5Min'))
        .agg(open_first=('open','first'),
             close_last=('close','last'),
             high_max=('high','max'),
             low_min=('low','min'),
             high_idxmax=('high',lambda x : np.nan if x.count() == 0 else x.idxmax()),
             low_idxmin=('low',lambda x : np.nan if x.count() == 0 else x.idxmin()))
        .assign(high_first = lambda x: x["high_idxmax"] < x["low_idxmin"]))
print (df1)

英文:

You can use DatetimeIndex.floor for 5 minutes values with not continuous values of DatetimeIndex:

df1 = (df.groupby(df.index.floor(&#39;5min&#39;))
        .agg(open_first=(&#39;open&#39;,&#39;first&#39;),
             close_last=(&#39;close&#39;,&#39;last&#39;),
             high_max=(&#39;high&#39;,&#39;max&#39;),
             low_min=(&#39;low&#39;,&#39;min&#39;),
             high_idxmax=(&#39;high&#39;,&#39;idxmax&#39;),
             low_idxmin=(&#39;low&#39;,&#39;idxmin&#39;))
        .assign(high_first = lambda x: x[&quot;high_idxmax&quot;] &lt; x[&quot;low_idxmin&quot;]))
print (df1)
                     open_first  close_last  high_max  low_min  \
datetime                                                         
2022-01-01 10:00:00          10          11        15        8   
2022-01-01 10:05:00          11          10        11        9   
                            high_idxmax          low_idxmin  high_first  
datetime                                                                 
2022-01-01 10:00:00 2022-01-01 10:03:00 2022-01-01 10:02:00       False  
2022-01-01 10:05:00 2022-01-01 10:05:00 2022-01-01 10:05:00       False

Your solution create consecutive index, for idxmax/idmin are necessary lambda functions:

df1 = (df.groupby(pd.Grouper(freq=&#39;5Min&#39;))
        .agg(open_first=(&#39;open&#39;,&#39;first&#39;),
             close_last=(&#39;close&#39;,&#39;last&#39;),
             high_max=(&#39;high&#39;,&#39;max&#39;),
             low_min=(&#39;low&#39;,&#39;min&#39;),
             high_idxmax=(&#39;high&#39;,lambda x : np.nan if x.count() == 0 else x.idxmax()),
             low_idxmin=(&#39;low&#39;,lambda x : np.nan if x.count() == 0 else x.idxmin()))
        .assign(high_first = lambda x: x[&quot;high_idxmax&quot;] &lt; x[&quot;low_idxmin&quot;]))
print (df1)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何加速 pandas 中 resample、idxmax 和 idxmin 列的计算？

问题

答案1

在分段式色标中绘制不同颜色之间的分隔线。

How to print all duplicates key, including full paths, and optionally with values, for nested JSON in Python?

为什么程序返回一个空响应

使用 re.sub() 如何替换所有字面上的Unicode空格？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。