Pandas groupby apply (nested) slow

huangapple go评论104阅读模式
英文:

Pandas groupby apply (nested) slow

问题

我有一个包含'category'和'number'列的数据框。我想要创建一个新列'avg_of_largest_2_from_prev_5',该列是在按'category'分组后计算的,计算方法是取前5行中'number'值的最大的两个值的平均值,不包括当前行。

np.random.seed(123)
n_rows = 10000
data = {'category': np.random.randint(1, 1000, n_rows), 'number': np.random.randint(1, n_rows, n_rows)}
df = pd.DataFrame(data)

%timeit df['avg_of_largest_2_from_prev_5'] = df.groupby('category')['number'].apply(lambda x: x.shift(1).rolling(5, min_periods=0).apply(lambda y: pd.Series(y).nlargest(2).mean()))

df = df[df['category'] == df['category'].values[0]]
df

输出:
4.55秒 ± 34.4毫秒每次循环(7次运行的平均值 ± 标准差, 1次循环)

这个操作在有1万行和1千个类别的数据帧上需要36秒。当我在拥有1百万行的数据帧上尝试时,需要大约8分钟。我认为有更快的方法可以实现我所尝试的操作,我将感激任何建议。

英文:

I have a dataframe with 'category' and 'number' columns. I want to create a new column 'avg_of_largest_2_from_prev_5' which is calculated after grouping by 'category' and averaging highest 2 values from the previous 5 rows' number values, excluding the current row.

np.random.seed(123)
n_rows = 10000
data = {'category': np.random.randint(1, 1000, n_rows), 'number': np.random.randint(1, n_rows, n_rows)}
df = pd.DataFrame(data)

%timeit df['avg_of_largest_2_from_prev_5'] = df.groupby('category')['number'].apply(lambda x: x.shift(1).rolling(5, min_periods=0).apply(lambda y: pd.Series(y).nlargest(2).mean()))

df = df[df['category'] == df['category'].values[0]]
df

out:
4.55 s ± 34.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

       category  	   number  	avg_of_largest_2_from_prev_5
0       	511        	4179	NaN
392      	511        	2878	4179.0
1292    	511        	5834	3528.5
1350    	511        	1054	5006.5
1639    	511        	8673	5006.5
3145    	511        	8506	7253.5
4176    	511        	947    	8589.5
4471    	511        	151    	8589.5
4735    	511        	5326	8589.5
4965    	511        	4827	8589.5
5046    	511        	9792	6916.0
5316    	511        	3772	7559.0
5535    	511        	1095	7559.0
5722    	511        	5619	7559.0
5732    	511        	700   	7705.5
6825    	511        	1156	7705.5
6877    	511        	7240	4695.5
8100    	511        	2381	6429.5
8398    	511        	2376	6429.5

this operation takes 36 seconds with 10k rows and 1k categories. When I try this in 1m+ rows dataframe it takes around 8 minutes. I think there should be a faster way for what I'm trying to do, and I'd appreciate any suggestions.

答案1

得分: 1

a different solution is to use Polars (https://www.pola.rs/) instead of Pandas.

In my tests, the Pandas function takes 5.33 seconds, while the Polars function takes 30.1 ms. Thus 178 times faster.

You can install Polars with pip install polars

Here is the function:

import polars as pl

# convert the dataframe to Polars
df_polars = pl.from_pandas(df)

# run the calculation
df_polars_result = df_polars.with_columns(
        pl.lit(1).alias('ones')
    ).with_columns(
        i = pl.col('ones').cumsum().over('category').cast(pl.Int32),
        num_shifted = pl.col('number').shift(1).over('category')
        ) \
    .groupby_rolling(index_column = 'i', period='5i', by='category').agg(
        pl.last('number'),    
        avg_of_largest_2_from_prev_5 = 
            pl.col('num_shifted').sort(descending=True).slice(0,2).mean()
)
30.1 ms ± 6.37 ms per loop

# Convert back to Pandas if needed
df_pandas_result = df_polars_result.to_pandas(use_pyarrow_extension_array=True)


# Example of the result
┌──────────┬─────┬────────┬──────────────────────────────┐
 category  i    number  avg_of_largest_2_from_prev_5 
 ---       ---  ---     ---                          
 i64       i32  i64     f64                          
╞══════════╪═════╪════════╪══════════════════════════════╡
 511       1    4179    null                         
 511       2    2878    4179.0                       
 511       3    5834    3528.5                       
 511       4    1054    5006.5                       
 ...       ...  ...     ...                          
 511       16   1156    7705.5                       
 511       17   7240    4695.5                       
 511       18   2381    6429.5                       
 511       19   2376    6429.5                       
└──────────┴─────┴────────┴──────────────────────────────┘
英文:

a different solution is to use Polars (https://www.pola.rs/) instead of Pandas.

In my tests, the Pandas function takes 5.33 seconds, while the Polars function takes 30.1 ms. Thus 178 times faster.

You can install Polars with pip install polars

Here is the function:

import polars as pl
# convert the dataframe to Polars
df_polars = pl.from_pandas(df)
# run the calculation
df_polars_result = df_polars.with_columns(
pl.lit(1).alias('ones')
).with_columns(
i = pl.col('ones').cumsum().over('category').cast(pl.Int32),
num_shifted = pl.col('number').shift(1).over('category')
) \
.groupby_rolling(index_column = 'i', period='5i', by='category').agg(
pl.last('number'),    
avg_of_largest_2_from_prev_5 = 
pl.col('num_shifted').sort(descending=True).slice(0,2).mean()
)
30.1 ms ± 6.37 ms per loop
# Convert back to Pandas if needed
df_pandas_result = df_polars_result.to_pandas(use_pyarrow_extension_array=True)
# Example of the result
┌──────────┬─────┬────────┬──────────────────────────────┐
│ category ┆ i   ┆ number ┆ avg_of_largest_2_from_prev_5 │
│ ---      ┆ --- ┆ ---    ┆ ---                          │
│ i64      ┆ i32 ┆ i64    ┆ f64                          │
╞══════════╪═════╪════════╪══════════════════════════════╡
│ 511      ┆ 1   ┆ 4179   ┆ null                         │
│ 511      ┆ 2   ┆ 2878   ┆ 4179.0                       │
│ 511      ┆ 3   ┆ 5834   ┆ 3528.5                       │
│ 511      ┆ 4   ┆ 1054   ┆ 5006.5                       │
│ ...      ┆ ... ┆ ...    ┆ ...                          │
│ 511      ┆ 16  ┆ 1156   ┆ 7705.5                       │
│ 511      ┆ 17  ┆ 7240   ┆ 4695.5                       │
│ 511      ┆ 18  ┆ 2381   ┆ 6429.5                       │
│ 511      ┆ 19  ┆ 2376   ┆ 6429.5                       │
└──────────┴─────┴────────┴──────────────────────────────┘
</details>
# 答案2
**得分**: 0
```python
import time
start = time.time()
df['avg_of_largest_2_from_prev_5'] = df.groupby('category')['number'].apply(lambda x: x.shift(1).rolling(5, min_periods=0).apply(lambda y: pd.Series(y).nlargest(2).mean()))
end = time.time()
&gt;&gt;&gt; end-start
1.6836352348327637

Speed improvement

start = time.time()
df['avg_of_largest_2_from_prev_5'] = \
( df.groupby('category')['number']
    .apply(lambda s: s.shift(1).rolling(5, min_periods=0).agg(lambda s: s.nlargest(2).mean()))
  )
end = time.time()
&gt;&gt;&gt; end-start
1.623293399810791

Speed improvement is 3.7% compared to the sample.

With the chosen sample we can't really see the efficiency of the agg function compared to the apply function.
If, on the other hand, we reduce the number of categories to 10 and transfer more lines to be processed accordingly to the agg function, we realize that the change is very significant.

import pandas as pd
import numpy as np
import time

np.random.seed(123)
n_rows = 10000
data = {'category': np.random.randint(1, 10, n_rows), 'number': np.random.randint(1, n_rows, n_rows)}
df = pd.DataFrame(data)

start = time.time()
df['avg_of_largest_2_from_prev_5'] = df.groupby('category')['number'].apply(lambda x: x.shift(1).rolling(5, min_periods=0).apply(lambda y: pd.Series(y).nlargest(2).mean()))
end = time.time()

print("First method", end-start)

start = time.time()
df['avg_of_largest_2_from_prev_5'] = \
( df.groupby('category')['number']
    .apply(lambda s: s.shift(1).rolling(5, min_periods=0).agg(lambda s: s.nlargest(2).mean()))
  )
end = time.time()
print("Second method", end-start)

First method 1.420635461807251
Second method 1.339367151260376
First method 1.420635461807251
Second method 1.339367151260376

Improvement is 6% in this case

And now, I test the following parameters:

n_rows = 1000000
data = {'category': np.random.randint(1, 10, n_rows)...

and get those results with an improvement of 9.5%:

First method 139.25211119651794
Second method 127.48456478118896

Conclusion

The more data you have and the fewer categories you have, the more the second method is significantly performant.

英文:

Your sample

import time

start = time.time()
df[&#39;avg_of_largest_2_from_prev_5&#39;] = df.groupby(&#39;category&#39;)[&#39;number&#39;].apply(lambda x: x.shift(1).rolling(5, min_periods=0).apply(lambda y: pd.Series(y).nlargest(2).mean()))
end = time.time()
&gt;&gt;&gt; end-start
1.6836352348327637

Speed improvement

start = time.time()
df[&#39;avg_of_largest_2_from_prev_5&#39;] = \
( df.groupby(&#39;category&#39;)[&#39;number&#39;]
    .apply(lambda s: s.shift(1).rolling(5, min_periods=0).agg( lambda s: s.nlargest(2).mean() ))
  )
end = time.time()
&gt;&gt;&gt; end-start
1.623293399810791

Speed improvement is 3.7% compared to the sample.

With the chosen sample we can't really see the efficiency of the agg function compared to the apply function.
If, on the other hand, we reduce the number of categories to 10 and transfer more lines to be processed accordingly to the agg function, we realize that the change is very significant.

import pandas as pd
import numpy as np
import time

np.random.seed(123)
n_rows = 10000
data = {&#39;category&#39;: np.random.randint(1, 10, n_rows), &#39;number&#39;: np.random.randint(1, n_rows, n_rows)}
df = pd.DataFrame(data)

start = time.time()
df[&#39;avg_of_largest_2_from_prev_5&#39;] = df.groupby(&#39;category&#39;)[&#39;number&#39;].apply(lambda x: x.shift(1).rolling(5, min_periods=0).apply(lambda y: pd.Series(y).nlargest(2).mean()))
end = time.time()

print(&quot;First method&quot;, end-start)

start = time.time()
df[&#39;avg_of_largest_2_from_prev_5&#39;] = \
( df.groupby(&#39;category&#39;)[&#39;number&#39;]
    .apply(lambda s: s.shift(1).rolling(5, min_periods=0).agg( lambda s: s.nlargest(2).mean() ))
  )
end = time.time()
print(&quot;Second method&quot;, end-start)

First method 1.420635461807251
Second method 1.339367151260376
First method 1.420635461807251
Second method 1.339367151260376

Improvement is 6% in this case

And now, I test the following parameters :

n_rows = 1000000
data = {&#39;category&#39;: np.random.randint(1, 10, n_rows)...

and get those results with an improvement of 9.5% :

First method 139.25211119651794
Second method 127.48456478118896

Conclusion

The more data you have and the less categories you have, the more the second method is significantly performant

huangapple
  • 本文由 发表于 2023年3月7日 05:16:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/75655926.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定