避免在Pandas数据框中使用for循环遍历列值,而是使用函数。

huangapple go评论71阅读模式
英文:

Avoid for loops over colum values in a pandas dataframe with a function

问题

对于这个问题,有一种更有效的方法可以避免使用多重循环,提高性能。您可以使用Pandas的groupby和transform函数来实现。以下是如何重写您的metrics函数以提高性能的方法:

import pandas as pd

def metrics(df):
    df['out_MSE'] = df.groupby(['Level', 'Kontogruppe', 'model'])['actual_value'].transform('sum') / df.groupby(['Level', 'Kontogruppe', 'model'])['forecast_value'].transform('sum')
    return df

# 调用函数
df = metrics(df)

这种方法将数据按照'Level'、'Kontogruppe'和'model'分组,并使用transform函数计算每个组的总和,然后将结果广播到原始DataFrame的每一行。这比显式的嵌套循环要高效得多,特别是对于大型数据框。

这是一种更Pythonic的方法,能够更简洁地实现您的目标,同时提高性能。

英文:

I have the following structur of a dataframe:

df = pd.DataFrame({'Level': ["a","b", "c"], 'Kontogruppe': ["a", "a", "b"], 
                   'model': ["alpha", "beta", "alpha"], 'MSE': [0, 1 ,1],
                   'actual_value': [1,2,3], 'forecast_value': [2,2,2]}) 
                   

For this dataframe I run severel functions, for example:

def metrics(df):
    df_map= pd.DataFrame({'Level': ["a"], 'Kontogruppe': ["a"],
                            'model': ["alpha"], 'MSE': [0]})
    for i in df['Level'].unique():
        for j in df['Kontogruppe'].unique():
            for k in df['model'].unique():
                df_lkm = df.loc[(df['Level'] == i) & (df['Kontogruppe'] == j) & 
                                    (df['model'] == k)]
                if df_lkm.empty:
                    out_MSE = 10000000000                   

                else:
                    out_MSE = sum(df_lkm['actual_value'])/sum(df_lkm['forecast_value'])                    

                df_map_map = pd.DataFrame({'Level': [i], 'Kontogruppe': [j], 'model': [k], 
                                        'out_MSE': [out_MSE]}) 
                df_map = pd.concat([df_map, df_map_map])
            

                
    df = pd.merge(df, df_map, how='left', on=['Level', 'Kontogruppe', 'model'])   
                    
    return df

df = metrics(df)

so basically I loop over the unique column values and filter the dataframe based on this.
In this case I get for every Level, Kontogruppe and model the value 'out_MSE' gets calculated over all entries of actual_values and forecast_values. And is appended as a value for every row in a new column.

For this problem is there are more efficient way to this?
Is there any pythonic way in general to avoid this for loops, my dataframe is big and this costs a lot of performance.

答案1

得分: 1

Here is the translated content from your request:

如果我理解正确,您可能只想要一个简单的 groupby.sum,然后进行一些后处理。因为您只关心现有的组合,所以没有必要遍历所有组合并分配一个大值。

(df.groupby(['Level', 'Kontogruppe', 'model'], as_index=False)
   [['actual_value', 'forecast_value']].sum()
   .eval('out_MSE = actual_value/forecast_value')
)

输出:

  Level Kontogruppe  model  actual_value  forecast_value  out_MSE
0     a           a  alpha             1               2      0.5
1     b           a   beta             2               2      1.0
2     c           b  alpha             3               2      1.5

用于比较的您的代码输出:

  Level Kontogruppe  model  MSE_x  actual_value  forecast_value  MSE_y  out_MSE
0     a           a  alpha      0             1               2    0.0      NaN
1     a           a  alpha      0             1               2    NaN      0.5
2     b           a   beta      1             2               2    NaN      1.0
3     c           b  alpha      1             3               2    NaN      1.5

(Note: The translated code and output are provided without additional comments or explanations.)

英文:

If I understand correctly, you might just want a simple groupby.sum with a bit of post-processing. Because you only care about the existing combinations, there is no need to loop over all of them and assign a large value.

(df.groupby(['Level', 'Kontogruppe', 'model'], as_index=False)
   [['actual_value', 'forecast_value']].sum()
   .eval('out_MSE = actual_value/forecast_value')
)

Output:

  Level Kontogruppe  model  actual_value  forecast_value  out_MSE
0     a           a  alpha             1               2      0.5
1     b           a   beta             2               2      1.0
2     c           b  alpha             3               2      1.5

Output of your code for comparison:

  Level Kontogruppe  model  MSE_x  actual_value  forecast_value  MSE_y  out_MSE
0     a           a  alpha      0             1               2    0.0      NaN
1     a           a  alpha      0             1               2    NaN      0.5
2     b           a   beta      1             2               2    NaN      1.0
3     c           b  alpha      1             3               2    NaN      1.5

huangapple
  • 本文由 发表于 2023年8月4日 21:36:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76836454.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定