Python Pandas组by任务中的多线程

huangapple go评论88阅读模式
英文:

Multi-threading in for Python Pandas Groupby Task

问题

I would like to optimize a task to aggregate data using groupby in Python Pandas. The dataset has over 20 million records. I would like to try multi-threading.

Code:

  1. varlistdic = {"var1": ["mean", "max", "min"], "var2": ["mean", "max", "min"], ..., "var20": "max"}
  2. gr = df.groupby(['userid'])
  3. df_agg = gr.agg(varlistdic)

I would appreciate assistance.

Without mult-threading, the process time was about 40 minutes.

英文:

I would like to optimize a task to aggregate data using groupby in Python Pandas. The dataset has over 20 million records. I would like to try multi-threading.

  1. df:
  2. userid var1 ... var20
  3. 1 323 ... 450
  4. 1 443 ... 357
  5. 2 467 ... 587
  6. 3 235 ... 345
  7. 3 578 ... 768
  8. 4 354 ... 365

Code:

  1. varlistdic = {"var1" : ["mean","max","min"],"var2" : ["mean","max","min"],.,"var20" : "max"}
  2. gr=df.groupby(['userid'])
  3. df_agg=gr.agg(varlistdic)

I would appreciate assistance.

Without mult-threading, the process time was about 40 minutes.

答案1

得分: 1

你有很多方法可以做到这一点,比如joblibpandarallel。在Python中,你可以使用multiprocessing模块:

  1. import pandas as pd
  2. import multiprocessing as mp
  3. import time
  4. varlistdic = {"var1": ["mean", "max", "min"],
  5. "var2": ["mean", "max", "min"],
  6. "var20": "max"}
  7. def process_user(user, df):
  8. return pd.concat([df.agg(varlistdic)], keys=[user])
  9. if __name__ == '__main__':
  10. # 在这里加载你的数据
  11. # df = pd.read_csv(...)
  12. start = time.time()
  13. with mp.Pool(mp.cpu_count()) as pool:
  14. data = pool.starmap(process_user, df.groupby('user'))
  15. out = pd.concat(data).unstack().dropna(how='all', axis=1)
  16. end = time.time()
  17. print(f"Elapsed time: {end - start:.2f} seconds")

一个类似的输入数据框如下所示:

  1. df
  2. user var1 var2 var20
  3. 0 6 123 146 226
  4. 1 4 171 129 172
  5. 2 8 111 274 226
  6. 3 1 171 203 157
  7. 4 8 189 199 142
  8. .. ... ... ... ...
  9. 95 8 116 290 228
  10. 96 5 140 163 202
  11. 97 3 137 253 231
  12. 98 8 182 141 141
  13. 99 7 147 111 238
  14. [100 rows x 4 columns]

返回一个类似的输出:

  1. out
  2. var1 var2 var20
  3. max mean min max mean min max
  4. 1 196.0 151.666667 104.0 299.0 198.444444 119.0 228.0
  5. 2 187.0 139.111111 110.0 286.0 215.222222 120.0 229.0
  6. 3 186.0 143.428571 116.0 286.0 212.428571 103.0 235.0
  7. 4 197.0 143.357143 104.0 291.0 213.357143 116.0 234.0
  8. 5 173.0 136.800000 100.0 266.0 178.400000 102.0 237.0
  9. 6 194.0 153.727273 123.0 299.0 201.636364 110.0 228.0
  10. 7 188.0 151.733333 105.0 287.0 189.200000 111.0 238.0
  11. 8 193.0 159.928571 105.0 290.0 213.357143 109.0 230.0
  12. 9 196.0 151.875000 110.0 298.0 178.812500 102.0 228.0
英文:

You have many solutions to do that like joblib or pandarallel. In python, you can use multiprocessing module:

  1. import pandas as pd
  2. import multiprocessing as mp
  3. import time
  4. varlistdic = {"var1" : ["mean","max","min"],
  5. "var2" : ["mean","max","min"],
  6. "var20" : "max"}
  7. def process_user(user, df):
  8. return pd.concat([df.agg(varlistdic)], keys=[user])
  9. if __name__ == '__main__':
  10. # Load your data here
  11. # df = pd.read_csv(...)
  12. start = time.time()
  13. with mp.Pool(mp.cpu_count()) as pool:
  14. data = pool.starmap(process_user, df.groupby('user'))
  15. out = pd.concat(data).unstack().dropna(how='all', axis=1)
  16. end = time.time()
  17. print(f"Elapsed time: {end - start:.2f} seconds")

An input dataframe like:

  1. >>> df
  2. user var1 var2 var20
  3. 0 6 123 146 226
  4. 1 4 171 129 172
  5. 2 8 111 274 226
  6. 3 1 171 203 157
  7. 4 8 189 199 142
  8. .. ... ... ... ...
  9. 95 8 116 290 228
  10. 96 5 140 163 202
  11. 97 3 137 253 231
  12. 98 8 182 141 141
  13. 99 7 147 111 238
  14. [100 rows x 4 columns]

returns an output like:

  1. >>> out
  2. var1 var2 var20
  3. max mean min max mean min max
  4. 1 196.0 151.666667 104.0 299.0 198.444444 119.0 228.0
  5. 2 187.0 139.111111 110.0 286.0 215.222222 120.0 229.0
  6. 3 186.0 143.428571 116.0 286.0 212.428571 103.0 235.0
  7. 4 197.0 143.357143 104.0 291.0 213.357143 116.0 234.0
  8. 5 173.0 136.800000 100.0 266.0 178.400000 102.0 237.0
  9. 6 194.0 153.727273 123.0 299.0 201.636364 110.0 228.0
  10. 7 188.0 151.733333 105.0 287.0 189.200000 111.0 238.0
  11. 8 193.0 159.928571 105.0 290.0 213.357143 109.0 230.0
  12. 9 196.0 151.875000 110.0 298.0 178.812500 102.0 228.0

huangapple
  • 本文由 发表于 2023年6月29日 22:37:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/76582103.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定