英文:
Multi-threading in for Python Pandas Groupby Task
问题
I would like to optimize a task to aggregate data using groupby in Python Pandas. The dataset has over 20 million records. I would like to try multi-threading.
Code:
varlistdic = {"var1": ["mean", "max", "min"], "var2": ["mean", "max", "min"], ..., "var20": "max"}
gr = df.groupby(['userid'])
df_agg = gr.agg(varlistdic)
I would appreciate assistance.
Without mult-threading, the process time was about 40 minutes.
英文:
I would like to optimize a task to aggregate data using groupby in Python Pandas. The dataset has over 20 million records. I would like to try multi-threading.
df:
userid var1 ... var20
1 323 ... 450
1 443 ... 357
2 467 ... 587
3 235 ... 345
3 578 ... 768
4 354 ... 365
Code:
varlistdic = {"var1" : ["mean","max","min"],"var2" : ["mean","max","min"],.,"var20" : "max"}
gr=df.groupby(['userid'])
df_agg=gr.agg(varlistdic)
I would appreciate assistance.
Without mult-threading, the process time was about 40 minutes.
答案1
得分: 1
你有很多方法可以做到这一点,比如joblib
或pandarallel
。在Python中,你可以使用multiprocessing
模块:
import pandas as pd
import multiprocessing as mp
import time
varlistdic = {"var1": ["mean", "max", "min"],
"var2": ["mean", "max", "min"],
"var20": "max"}
def process_user(user, df):
return pd.concat([df.agg(varlistdic)], keys=[user])
if __name__ == '__main__':
# 在这里加载你的数据
# df = pd.read_csv(...)
start = time.time()
with mp.Pool(mp.cpu_count()) as pool:
data = pool.starmap(process_user, df.groupby('user'))
out = pd.concat(data).unstack().dropna(how='all', axis=1)
end = time.time()
print(f"Elapsed time: {end - start:.2f} seconds")
一个类似的输入数据框如下所示:
df
user var1 var2 var20
0 6 123 146 226
1 4 171 129 172
2 8 111 274 226
3 1 171 203 157
4 8 189 199 142
.. ... ... ... ...
95 8 116 290 228
96 5 140 163 202
97 3 137 253 231
98 8 182 141 141
99 7 147 111 238
[100 rows x 4 columns]
返回一个类似的输出:
out
var1 var2 var20
max mean min max mean min max
1 196.0 151.666667 104.0 299.0 198.444444 119.0 228.0
2 187.0 139.111111 110.0 286.0 215.222222 120.0 229.0
3 186.0 143.428571 116.0 286.0 212.428571 103.0 235.0
4 197.0 143.357143 104.0 291.0 213.357143 116.0 234.0
5 173.0 136.800000 100.0 266.0 178.400000 102.0 237.0
6 194.0 153.727273 123.0 299.0 201.636364 110.0 228.0
7 188.0 151.733333 105.0 287.0 189.200000 111.0 238.0
8 193.0 159.928571 105.0 290.0 213.357143 109.0 230.0
9 196.0 151.875000 110.0 298.0 178.812500 102.0 228.0
英文:
You have many solutions to do that like joblib
or pandarallel
. In python, you can use multiprocessing
module:
import pandas as pd
import multiprocessing as mp
import time
varlistdic = {"var1" : ["mean","max","min"],
"var2" : ["mean","max","min"],
"var20" : "max"}
def process_user(user, df):
return pd.concat([df.agg(varlistdic)], keys=[user])
if __name__ == '__main__':
# Load your data here
# df = pd.read_csv(...)
start = time.time()
with mp.Pool(mp.cpu_count()) as pool:
data = pool.starmap(process_user, df.groupby('user'))
out = pd.concat(data).unstack().dropna(how='all', axis=1)
end = time.time()
print(f"Elapsed time: {end - start:.2f} seconds")
An input dataframe like:
>>> df
user var1 var2 var20
0 6 123 146 226
1 4 171 129 172
2 8 111 274 226
3 1 171 203 157
4 8 189 199 142
.. ... ... ... ...
95 8 116 290 228
96 5 140 163 202
97 3 137 253 231
98 8 182 141 141
99 7 147 111 238
[100 rows x 4 columns]
returns an output like:
>>> out
var1 var2 var20
max mean min max mean min max
1 196.0 151.666667 104.0 299.0 198.444444 119.0 228.0
2 187.0 139.111111 110.0 286.0 215.222222 120.0 229.0
3 186.0 143.428571 116.0 286.0 212.428571 103.0 235.0
4 197.0 143.357143 104.0 291.0 213.357143 116.0 234.0
5 173.0 136.800000 100.0 266.0 178.400000 102.0 237.0
6 194.0 153.727273 123.0 299.0 201.636364 110.0 228.0
7 188.0 151.733333 105.0 287.0 189.200000 111.0 238.0
8 193.0 159.928571 105.0 290.0 213.357143 109.0 230.0
9 196.0 151.875000 110.0 298.0 178.812500 102.0 228.0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论