Python Pandas组by任务中的多线程

huangapple go评论57阅读模式
英文:

Multi-threading in for Python Pandas Groupby Task

问题

I would like to optimize a task to aggregate data using groupby in Python Pandas. The dataset has over 20 million records. I would like to try multi-threading.

Code:

varlistdic = {"var1": ["mean", "max", "min"], "var2": ["mean", "max", "min"], ..., "var20": "max"}
gr = df.groupby(['userid'])
df_agg = gr.agg(varlistdic)

I would appreciate assistance.

Without mult-threading, the process time was about 40 minutes.

英文:

I would like to optimize a task to aggregate data using groupby in Python Pandas. The dataset has over 20 million records. I would like to try multi-threading.

df:  
userid  var1 ... var20  
1       323   ...     450  
1       443   ...     357  
2       467   ...     587  
3       235   ...     345  
3       578   ...     768  
4       354   ...     365  

Code:

varlistdic = {"var1" : ["mean","max","min"],"var2" : ["mean","max","min"],.,"var20" : "max"}  
gr=df.groupby(['userid'])  
df_agg=gr.agg(varlistdic)   

I would appreciate assistance.

Without mult-threading, the process time was about 40 minutes.

答案1

得分: 1

你有很多方法可以做到这一点,比如joblibpandarallel。在Python中,你可以使用multiprocessing模块:

import pandas as pd
import multiprocessing as mp
import time

varlistdic = {"var1": ["mean", "max", "min"],
              "var2": ["mean", "max", "min"],
              "var20": "max"}

def process_user(user, df):
    return pd.concat([df.agg(varlistdic)], keys=[user])

if __name__ == '__main__':
    # 在这里加载你的数据
    # df = pd.read_csv(...)

    start = time.time()
    with mp.Pool(mp.cpu_count()) as pool:
        data = pool.starmap(process_user, df.groupby('user'))
    out = pd.concat(data).unstack().dropna(how='all', axis=1)
    end = time.time()
    print(f"Elapsed time: {end - start:.2f} seconds")

一个类似的输入数据框如下所示:

df
    user  var1  var2  var20
0      6   123   146    226
1      4   171   129    172
2      8   111   274    226
3      1   171   203    157
4      8   189   199    142
..   ...   ...   ...    ...
95     8   116   290    228
96     5   140   163    202
97     3   137   253    231
98     8   182   141    141
99     7   147   111    238

[100 rows x 4 columns]

返回一个类似的输出:

out
    var1                      var2                     var20
     max        mean    min    max        mean    min    max
1  196.0  151.666667  104.0  299.0  198.444444  119.0  228.0
2  187.0  139.111111  110.0  286.0  215.222222  120.0  229.0
3  186.0  143.428571  116.0  286.0  212.428571  103.0  235.0
4  197.0  143.357143  104.0  291.0  213.357143  116.0  234.0
5  173.0  136.800000  100.0  266.0  178.400000  102.0  237.0
6  194.0  153.727273  123.0  299.0  201.636364  110.0  228.0
7  188.0  151.733333  105.0  287.0  189.200000  111.0  238.0
8  193.0  159.928571  105.0  290.0  213.357143  109.0  230.0
9  196.0  151.875000  110.0  298.0  178.812500  102.0  228.0
英文:

You have many solutions to do that like joblib or pandarallel. In python, you can use multiprocessing module:

import pandas as pd
import multiprocessing as mp
import time

varlistdic = {"var1" : ["mean","max","min"],
              "var2" : ["mean","max","min"],
              "var20" : "max"}

def process_user(user, df):
    return pd.concat([df.agg(varlistdic)], keys=[user])

if __name__ == '__main__':
    # Load your data here
    # df = pd.read_csv(...)

    start = time.time()
    with mp.Pool(mp.cpu_count()) as pool:
        data = pool.starmap(process_user, df.groupby('user'))
    out = pd.concat(data).unstack().dropna(how='all', axis=1)
    end = time.time()
    print(f"Elapsed time: {end - start:.2f} seconds")

An input dataframe like:

>>> df
    user  var1  var2  var20
0      6   123   146    226
1      4   171   129    172
2      8   111   274    226
3      1   171   203    157
4      8   189   199    142
..   ...   ...   ...    ...
95     8   116   290    228
96     5   140   163    202
97     3   137   253    231
98     8   182   141    141
99     7   147   111    238

[100 rows x 4 columns]

returns an output like:

>>> out
    var1                      var2                     var20
     max        mean    min    max        mean    min    max
1  196.0  151.666667  104.0  299.0  198.444444  119.0  228.0
2  187.0  139.111111  110.0  286.0  215.222222  120.0  229.0
3  186.0  143.428571  116.0  286.0  212.428571  103.0  235.0
4  197.0  143.357143  104.0  291.0  213.357143  116.0  234.0
5  173.0  136.800000  100.0  266.0  178.400000  102.0  237.0
6  194.0  153.727273  123.0  299.0  201.636364  110.0  228.0
7  188.0  151.733333  105.0  287.0  189.200000  111.0  238.0
8  193.0  159.928571  105.0  290.0  213.357143  109.0  230.0
9  196.0  151.875000  110.0  298.0  178.812500  102.0  228.0

huangapple
  • 本文由 发表于 2023年6月29日 22:37:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/76582103.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定