英文:
Pandas groupby apply (nested) slow
问题
我有一个包含'category'和'number'列的数据框。我想要创建一个新列'avg_of_largest_2_from_prev_5',该列是在按'category'分组后计算的,计算方法是取前5行中'number'值的最大的两个值的平均值,不包括当前行。
np.random.seed(123)
n_rows = 10000
data = {'category': np.random.randint(1, 1000, n_rows), 'number': np.random.randint(1, n_rows, n_rows)}
df = pd.DataFrame(data)
%timeit df['avg_of_largest_2_from_prev_5'] = df.groupby('category')['number'].apply(lambda x: x.shift(1).rolling(5, min_periods=0).apply(lambda y: pd.Series(y).nlargest(2).mean()))
df = df[df['category'] == df['category'].values[0]]
df
输出:
4.55秒 ± 34.4毫秒每次循环(7次运行的平均值 ± 标准差, 1次循环)
这个操作在有1万行和1千个类别的数据帧上需要36秒。当我在拥有1百万行的数据帧上尝试时,需要大约8分钟。我认为有更快的方法可以实现我所尝试的操作,我将感激任何建议。
英文:
I have a dataframe with 'category' and 'number' columns. I want to create a new column 'avg_of_largest_2_from_prev_5' which is calculated after grouping by 'category' and averaging highest 2 values from the previous 5 rows' number values, excluding the current row.
np.random.seed(123)
n_rows = 10000
data = {'category': np.random.randint(1, 1000, n_rows), 'number': np.random.randint(1, n_rows, n_rows)}
df = pd.DataFrame(data)
%timeit df['avg_of_largest_2_from_prev_5'] = df.groupby('category')['number'].apply(lambda x: x.shift(1).rolling(5, min_periods=0).apply(lambda y: pd.Series(y).nlargest(2).mean()))
df = df[df['category'] == df['category'].values[0]]
df
out:
4.55 s ± 34.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
category number avg_of_largest_2_from_prev_5
0 511 4179 NaN
392 511 2878 4179.0
1292 511 5834 3528.5
1350 511 1054 5006.5
1639 511 8673 5006.5
3145 511 8506 7253.5
4176 511 947 8589.5
4471 511 151 8589.5
4735 511 5326 8589.5
4965 511 4827 8589.5
5046 511 9792 6916.0
5316 511 3772 7559.0
5535 511 1095 7559.0
5722 511 5619 7559.0
5732 511 700 7705.5
6825 511 1156 7705.5
6877 511 7240 4695.5
8100 511 2381 6429.5
8398 511 2376 6429.5
this operation takes 36 seconds with 10k rows and 1k categories. When I try this in 1m+ rows dataframe it takes around 8 minutes. I think there should be a faster way for what I'm trying to do, and I'd appreciate any suggestions.
答案1
得分: 1
a different solution is to use Polars (https://www.pola.rs/) instead of Pandas.
In my tests, the Pandas function takes 5.33 seconds, while the Polars function takes 30.1 ms. Thus 178 times faster.
You can install Polars with pip install polars
Here is the function:
import polars as pl
# convert the dataframe to Polars
df_polars = pl.from_pandas(df)
# run the calculation
df_polars_result = df_polars.with_columns(
pl.lit(1).alias('ones')
).with_columns(
i = pl.col('ones').cumsum().over('category').cast(pl.Int32),
num_shifted = pl.col('number').shift(1).over('category')
) \
.groupby_rolling(index_column = 'i', period='5i', by='category').agg(
pl.last('number'),
avg_of_largest_2_from_prev_5 =
pl.col('num_shifted').sort(descending=True).slice(0,2).mean()
)
30.1 ms ± 6.37 ms per loop
# Convert back to Pandas if needed
df_pandas_result = df_polars_result.to_pandas(use_pyarrow_extension_array=True)
# Example of the result
┌──────────┬─────┬────────┬──────────────────────────────┐
│ category ┆ i ┆ number ┆ avg_of_largest_2_from_prev_5 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i32 ┆ i64 ┆ f64 │
╞══════════╪═════╪════════╪══════════════════════════════╡
│ 511 ┆ 1 ┆ 4179 ┆ null │
│ 511 ┆ 2 ┆ 2878 ┆ 4179.0 │
│ 511 ┆ 3 ┆ 5834 ┆ 3528.5 │
│ 511 ┆ 4 ┆ 1054 ┆ 5006.5 │
│ ... ┆ ... ┆ ... ┆ ... │
│ 511 ┆ 16 ┆ 1156 ┆ 7705.5 │
│ 511 ┆ 17 ┆ 7240 ┆ 4695.5 │
│ 511 ┆ 18 ┆ 2381 ┆ 6429.5 │
│ 511 ┆ 19 ┆ 2376 ┆ 6429.5 │
└──────────┴─────┴────────┴──────────────────────────────┘
英文:
a different solution is to use Polars (https://www.pola.rs/) instead of Pandas.
In my tests, the Pandas function takes 5.33 seconds, while the Polars function takes 30.1 ms. Thus 178 times faster.
You can install Polars with pip install polars
Here is the function:
import polars as pl
# convert the dataframe to Polars
df_polars = pl.from_pandas(df)
# run the calculation
df_polars_result = df_polars.with_columns(
pl.lit(1).alias('ones')
).with_columns(
i = pl.col('ones').cumsum().over('category').cast(pl.Int32),
num_shifted = pl.col('number').shift(1).over('category')
) \
.groupby_rolling(index_column = 'i', period='5i', by='category').agg(
pl.last('number'),
avg_of_largest_2_from_prev_5 =
pl.col('num_shifted').sort(descending=True).slice(0,2).mean()
)
30.1 ms ± 6.37 ms per loop
# Convert back to Pandas if needed
df_pandas_result = df_polars_result.to_pandas(use_pyarrow_extension_array=True)
# Example of the result
┌──────────┬─────┬────────┬──────────────────────────────┐
│ category ┆ i ┆ number ┆ avg_of_largest_2_from_prev_5 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i32 ┆ i64 ┆ f64 │
╞══════════╪═════╪════════╪══════════════════════════════╡
│ 511 ┆ 1 ┆ 4179 ┆ null │
│ 511 ┆ 2 ┆ 2878 ┆ 4179.0 │
│ 511 ┆ 3 ┆ 5834 ┆ 3528.5 │
│ 511 ┆ 4 ┆ 1054 ┆ 5006.5 │
│ ... ┆ ... ┆ ... ┆ ... │
│ 511 ┆ 16 ┆ 1156 ┆ 7705.5 │
│ 511 ┆ 17 ┆ 7240 ┆ 4695.5 │
│ 511 ┆ 18 ┆ 2381 ┆ 6429.5 │
│ 511 ┆ 19 ┆ 2376 ┆ 6429.5 │
└──────────┴─────┴────────┴──────────────────────────────┘
</details>
# 答案2
**得分**: 0
```python
import time
start = time.time()
df['avg_of_largest_2_from_prev_5'] = df.groupby('category')['number'].apply(lambda x: x.shift(1).rolling(5, min_periods=0).apply(lambda y: pd.Series(y).nlargest(2).mean()))
end = time.time()
>>> end-start
1.6836352348327637
Speed improvement
start = time.time()
df['avg_of_largest_2_from_prev_5'] = \
( df.groupby('category')['number']
.apply(lambda s: s.shift(1).rolling(5, min_periods=0).agg(lambda s: s.nlargest(2).mean()))
)
end = time.time()
>>> end-start
1.623293399810791
Speed improvement is 3.7%
compared to the sample.
With the chosen sample we can't really see the efficiency of the agg
function compared to the apply
function.
If, on the other hand, we reduce the number of categories to 10 and transfer more lines to be processed accordingly to the agg
function, we realize that the change is very significant.
import pandas as pd
import numpy as np
import time
np.random.seed(123)
n_rows = 10000
data = {'category': np.random.randint(1, 10, n_rows), 'number': np.random.randint(1, n_rows, n_rows)}
df = pd.DataFrame(data)
start = time.time()
df['avg_of_largest_2_from_prev_5'] = df.groupby('category')['number'].apply(lambda x: x.shift(1).rolling(5, min_periods=0).apply(lambda y: pd.Series(y).nlargest(2).mean()))
end = time.time()
print("First method", end-start)
start = time.time()
df['avg_of_largest_2_from_prev_5'] = \
( df.groupby('category')['number']
.apply(lambda s: s.shift(1).rolling(5, min_periods=0).agg(lambda s: s.nlargest(2).mean()))
)
end = time.time()
print("Second method", end-start)
First method 1.420635461807251
Second method 1.339367151260376
First method 1.420635461807251
Second method 1.339367151260376
Improvement is 6%
in this case
And now, I test the following parameters:
n_rows = 1000000
data = {'category': np.random.randint(1, 10, n_rows)...
and get those results with an improvement of 9.5%
:
First method 139.25211119651794
Second method 127.48456478118896
Conclusion
The more data you have and the fewer categories you have, the more the second method is significantly performant.
英文:
Your sample
import time
start = time.time()
df['avg_of_largest_2_from_prev_5'] = df.groupby('category')['number'].apply(lambda x: x.shift(1).rolling(5, min_periods=0).apply(lambda y: pd.Series(y).nlargest(2).mean()))
end = time.time()
>>> end-start
1.6836352348327637
Speed improvement
start = time.time()
df['avg_of_largest_2_from_prev_5'] = \
( df.groupby('category')['number']
.apply(lambda s: s.shift(1).rolling(5, min_periods=0).agg( lambda s: s.nlargest(2).mean() ))
)
end = time.time()
>>> end-start
1.623293399810791
Speed improvement is 3.7%
compared to the sample.
With the chosen sample we can't really see the efficiency of the agg
function compared to the apply
function.
If, on the other hand, we reduce the number of categories to 10 and transfer more lines to be processed accordingly to the agg
function, we realize that the change is very significant.
import pandas as pd
import numpy as np
import time
np.random.seed(123)
n_rows = 10000
data = {'category': np.random.randint(1, 10, n_rows), 'number': np.random.randint(1, n_rows, n_rows)}
df = pd.DataFrame(data)
start = time.time()
df['avg_of_largest_2_from_prev_5'] = df.groupby('category')['number'].apply(lambda x: x.shift(1).rolling(5, min_periods=0).apply(lambda y: pd.Series(y).nlargest(2).mean()))
end = time.time()
print("First method", end-start)
start = time.time()
df['avg_of_largest_2_from_prev_5'] = \
( df.groupby('category')['number']
.apply(lambda s: s.shift(1).rolling(5, min_periods=0).agg( lambda s: s.nlargest(2).mean() ))
)
end = time.time()
print("Second method", end-start)
First method 1.420635461807251
Second method 1.339367151260376
First method 1.420635461807251
Second method 1.339367151260376
Improvement is 6%
in this case
And now, I test the following parameters :
n_rows = 1000000
data = {'category': np.random.randint(1, 10, n_rows)...
and get those results with an improvement of 9.5%
:
First method 139.25211119651794
Second method 127.48456478118896
Conclusion
The more data you have and the less categories you have, the more the second method is significantly performant
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论