2023年3月7日 05:16:26go评论196阅读模式

英文:

Pandas groupby apply (nested) slow

问题

我有一个包含'category'和'number'列的数据框。我想要创建一个新列'avg_of_largest_2_from_prev_5'，该列是在按'category'分组后计算的，计算方法是取前5行中'number'值的最大的两个值的平均值，不包括当前行。

np.random.seed(123)
n_rows = 10000
data = {'category': np.random.randint(1, 1000, n_rows), 'number': np.random.randint(1, n_rows, n_rows)}
df = pd.DataFrame(data)

%timeit df['avg_of_largest_2_from_prev_5'] = df.groupby('category')['number'].apply(lambda x: x.shift(1).rolling(5, min_periods=0).apply(lambda y: pd.Series(y).nlargest(2).mean()))

df = df[df['category'] == df['category'].values[0]]
df

输出:
4.55秒 ± 34.4毫秒每次循环（7次运行的平均值 ± 标准差, 1次循环）

这个操作在有1万行和1千个类别的数据帧上需要36秒。当我在拥有1百万行的数据帧上尝试时，需要大约8分钟。我认为有更快的方法可以实现我所尝试的操作，我将感激任何建议。

英文:

I have a dataframe with 'category' and 'number' columns. I want to create a new column 'avg_of_largest_2_from_prev_5' which is calculated after grouping by 'category' and averaging highest 2 values from the previous 5 rows' number values, excluding the current row.

np.random.seed(123)
n_rows = 10000
data = {&#39;category&#39;: np.random.randint(1, 1000, n_rows), &#39;number&#39;: np.random.randint(1, n_rows, n_rows)}
df = pd.DataFrame(data)

%timeit df[&#39;avg_of_largest_2_from_prev_5&#39;] = df.groupby(&#39;category&#39;)[&#39;number&#39;].apply(lambda x: x.shift(1).rolling(5, min_periods=0).apply(lambda y: pd.Series(y).nlargest(2).mean()))

df = df[df[&#39;category&#39;] == df[&#39;category&#39;].values[0]]
df

out:
4.55 s ± 34.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

       category  	   number  	avg_of_largest_2_from_prev_5
0       	511        	4179	NaN
392      	511        	2878	4179.0
1292    	511        	5834	3528.5
1350    	511        	1054	5006.5
1639    	511        	8673	5006.5
3145    	511        	8506	7253.5
4176    	511        	947    	8589.5
4471    	511        	151    	8589.5
4735    	511        	5326	8589.5
4965    	511        	4827	8589.5
5046    	511        	9792	6916.0
5316    	511        	3772	7559.0
5535    	511        	1095	7559.0
5722    	511        	5619	7559.0
5732    	511        	700   	7705.5
6825    	511        	1156	7705.5
6877    	511        	7240	4695.5
8100    	511        	2381	6429.5
8398    	511        	2376	6429.5

this operation takes 36 seconds with 10k rows and 1k categories. When I try this in 1m+ rows dataframe it takes around 8 minutes. I think there should be a faster way for what I'm trying to do, and I'd appreciate any suggestions.

答案1

得分: 1

a different solution is to use Polars (https://www.pola.rs/) instead of Pandas.

In my tests, the Pandas function takes 5.33 seconds, while the Polars function takes 30.1 ms. Thus 178 times faster.

You can install Polars with pip install polars

Here is the function:

import polars as pl

# convert the dataframe to Polars
df_polars = pl.from_pandas(df)

# run the calculation
df_polars_result = df_polars.with_columns(
        pl.lit(1).alias('ones')
    ).with_columns(
        i = pl.col('ones').cumsum().over('category').cast(pl.Int32),
        num_shifted = pl.col('number').shift(1).over('category')
        ) \
    .groupby_rolling(index_column = 'i', period='5i', by='category').agg(
        pl.last('number'),    
        avg_of_largest_2_from_prev_5 = 
            pl.col('num_shifted').sort(descending=True).slice(0,2).mean()
)
30.1 ms ± 6.37 ms per loop

# Convert back to Pandas if needed
df_pandas_result = df_polars_result.to_pandas(use_pyarrow_extension_array=True)


# Example of the result
┌──────────┬─────┬────────┬──────────────────────────────┐
│ category ┆ i   ┆ number ┆ avg_of_largest_2_from_prev_5 │
│ ---      ┆ --- ┆ ---    ┆ ---                          │
│ i64      ┆ i32 ┆ i64    ┆ f64                          │
╞══════════╪═════╪════════╪══════════════════════════════╡
│ 511      ┆ 1   ┆ 4179   ┆ null                         │
│ 511      ┆ 2   ┆ 2878   ┆ 4179.0                       │
│ 511      ┆ 3   ┆ 5834   ┆ 3528.5                       │
│ 511      ┆ 4   ┆ 1054   ┆ 5006.5                       │
│ ...      ┆ ... ┆ ...    ┆ ...                          │
│ 511      ┆ 16  ┆ 1156   ┆ 7705.5                       │
│ 511      ┆ 17  ┆ 7240   ┆ 4695.5                       │
│ 511      ┆ 18  ┆ 2381   ┆ 6429.5                       │
│ 511      ┆ 19  ┆ 2376   ┆ 6429.5                       │
└──────────┴─────┴────────┴──────────────────────────────┘

英文:

a different solution is to use Polars (https://www.pola.rs/) instead of Pandas.

In my tests, the Pandas function takes 5.33 seconds, while the Polars function takes 30.1 ms. Thus 178 times faster.

You can install Polars with pip install polars

Here is the function:

import polars as pl
# convert the dataframe to Polars
df_polars = pl.from_pandas(df)
# run the calculation
df_polars_result = df_polars.with_columns(
pl.lit(1).alias(&#39;ones&#39;)
).with_columns(
i = pl.col(&#39;ones&#39;).cumsum().over(&#39;category&#39;).cast(pl.Int32),
num_shifted = pl.col(&#39;number&#39;).shift(1).over(&#39;category&#39;)
) \
.groupby_rolling(index_column = &#39;i&#39;, period=&#39;5i&#39;, by=&#39;category&#39;).agg(
pl.last(&#39;number&#39;),    
avg_of_largest_2_from_prev_5 = 
pl.col(&#39;num_shifted&#39;).sort(descending=True).slice(0,2).mean()
)
30.1 ms &#177; 6.37 ms per loop
# Convert back to Pandas if needed
df_pandas_result = df_polars_result.to_pandas(use_pyarrow_extension_array=True)
# Example of the result
┌──────────┬─────┬────────┬──────────────────────────────┐
│ category ┆ i   ┆ number ┆ avg_of_largest_2_from_prev_5 │
│ ---      ┆ --- ┆ ---    ┆ ---                          │
│ i64      ┆ i32 ┆ i64    ┆ f64                          │
╞══════════╪═════╪════════╪══════════════════════════════╡
│ 511      ┆ 1   ┆ 4179   ┆ null                         │
│ 511      ┆ 2   ┆ 2878   ┆ 4179.0                       │
│ 511      ┆ 3   ┆ 5834   ┆ 3528.5                       │
│ 511      ┆ 4   ┆ 1054   ┆ 5006.5                       │
│ ...      ┆ ... ┆ ...    ┆ ...                          │
│ 511      ┆ 16  ┆ 1156   ┆ 7705.5                       │
│ 511      ┆ 17  ┆ 7240   ┆ 4695.5                       │
│ 511      ┆ 18  ┆ 2381   ┆ 6429.5                       │
│ 511      ┆ 19  ┆ 2376   ┆ 6429.5                       │
└──────────┴─────┴────────┴──────────────────────────────┘
</details>
# 答案2
**得分**: 0
```python
import time
start = time.time()
df['avg_of_largest_2_from_prev_5'] = df.groupby('category')['number'].apply(lambda x: x.shift(1).rolling(5, min_periods=0).apply(lambda y: pd.Series(y).nlargest(2).mean()))
end = time.time()

&gt;&gt;&gt; end-start
1.6836352348327637

Speed improvement

start = time.time()
df['avg_of_largest_2_from_prev_5'] = \
( df.groupby('category')['number']
    .apply(lambda s: s.shift(1).rolling(5, min_periods=0).agg(lambda s: s.nlargest(2).mean()))
  )
end = time.time()

&gt;&gt;&gt; end-start
1.623293399810791

Speed improvement is 3.7% compared to the sample.

With the chosen sample we can't really see the efficiency of the agg function compared to the apply function.
If, on the other hand, we reduce the number of categories to 10 and transfer more lines to be processed accordingly to the agg function, we realize that the change is very significant.

import pandas as pd
import numpy as np
import time

np.random.seed(123)
n_rows = 10000
data = {'category': np.random.randint(1, 10, n_rows), 'number': np.random.randint(1, n_rows, n_rows)}
df = pd.DataFrame(data)

start = time.time()
df['avg_of_largest_2_from_prev_5'] = df.groupby('category')['number'].apply(lambda x: x.shift(1).rolling(5, min_periods=0).apply(lambda y: pd.Series(y).nlargest(2).mean()))
end = time.time()

print("First method", end-start)

start = time.time()
df['avg_of_largest_2_from_prev_5'] = \
( df.groupby('category')['number']
    .apply(lambda s: s.shift(1).rolling(5, min_periods=0).agg(lambda s: s.nlargest(2).mean()))
  )
end = time.time()
print("Second method", end-start)

First method 1.420635461807251
Second method 1.339367151260376

First method 1.420635461807251
Second method 1.339367151260376

Improvement is 6% in this case

And now, I test the following parameters:

n_rows = 1000000
data = {'category': np.random.randint(1, 10, n_rows)...

and get those results with an improvement of 9.5%:

First method 139.25211119651794
Second method 127.48456478118896

Conclusion

The more data you have and the fewer categories you have, the more the second method is significantly performant.

英文:

Your sample

import time

start = time.time()
df[&#39;avg_of_largest_2_from_prev_5&#39;] = df.groupby(&#39;category&#39;)[&#39;number&#39;].apply(lambda x: x.shift(1).rolling(5, min_periods=0).apply(lambda y: pd.Series(y).nlargest(2).mean()))
end = time.time()

&gt;&gt;&gt; end-start
1.6836352348327637

Speed improvement

start = time.time()
df[&#39;avg_of_largest_2_from_prev_5&#39;] = \
( df.groupby(&#39;category&#39;)[&#39;number&#39;]
    .apply(lambda s: s.shift(1).rolling(5, min_periods=0).agg( lambda s: s.nlargest(2).mean() ))
  )
end = time.time()

&gt;&gt;&gt; end-start
1.623293399810791

Speed improvement is 3.7% compared to the sample.

import pandas as pd
import numpy as np
import time

np.random.seed(123)
n_rows = 10000
data = {&#39;category&#39;: np.random.randint(1, 10, n_rows), &#39;number&#39;: np.random.randint(1, n_rows, n_rows)}
df = pd.DataFrame(data)

start = time.time()
df[&#39;avg_of_largest_2_from_prev_5&#39;] = df.groupby(&#39;category&#39;)[&#39;number&#39;].apply(lambda x: x.shift(1).rolling(5, min_periods=0).apply(lambda y: pd.Series(y).nlargest(2).mean()))
end = time.time()

print(&quot;First method&quot;, end-start)

start = time.time()
df[&#39;avg_of_largest_2_from_prev_5&#39;] = \
( df.groupby(&#39;category&#39;)[&#39;number&#39;]
    .apply(lambda s: s.shift(1).rolling(5, min_periods=0).agg( lambda s: s.nlargest(2).mean() ))
  )
end = time.time()
print(&quot;Second method&quot;, end-start)

First method 1.420635461807251
Second method 1.339367151260376

First method 1.420635461807251
Second method 1.339367151260376

Improvement is 6% in this case

And now, I test the following parameters :

n_rows = 1000000
data = {&#39;category&#39;: np.random.randint(1, 10, n_rows)...

and get those results with an improvement of 9.5% :

First method 139.25211119651794
Second method 127.48456478118896

Conclusion

The more data you have and the less categories you have, the more the second method is significantly performant

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas groupby apply (nested) slow

问题

答案1

添加逗号分隔的字符串列表中的计数。

将括号中的数字分开到不同的列中。

Discord机器人具有签到和签退功能，但顺序不正确。

Signing a string with an RSA key in Python – how can I translate this JavaScript code that uses SubtleCrypto to Python?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论