2023年8月4日 21:36:25go评论71阅读模式

英文:

Avoid for loops over colum values in a pandas dataframe with a function

问题

对于这个问题，有一种更有效的方法可以避免使用多重循环，提高性能。您可以使用Pandas的groupby和transform函数来实现。以下是如何重写您的metrics函数以提高性能的方法：

import pandas as pd

def metrics(df):
    df['out_MSE'] = df.groupby(['Level', 'Kontogruppe', 'model'])['actual_value'].transform('sum') / df.groupby(['Level', 'Kontogruppe', 'model'])['forecast_value'].transform('sum')
    return df

# 调用函数
df = metrics(df)

这种方法将数据按照'Level'、'Kontogruppe'和'model'分组，并使用transform函数计算每个组的总和，然后将结果广播到原始DataFrame的每一行。这比显式的嵌套循环要高效得多，特别是对于大型数据框。

这是一种更Pythonic的方法，能够更简洁地实现您的目标，同时提高性能。

英文:

I have the following structur of a dataframe:

df = pd.DataFrame({&#39;Level&#39;: [&quot;a&quot;,&quot;b&quot;, &quot;c&quot;], &#39;Kontogruppe&#39;: [&quot;a&quot;, &quot;a&quot;, &quot;b&quot;], 
                   &#39;model&#39;: [&quot;alpha&quot;, &quot;beta&quot;, &quot;alpha&quot;], &#39;MSE&#39;: [0, 1 ,1],
                   &#39;actual_value&#39;: [1,2,3], &#39;forecast_value&#39;: [2,2,2]})

For this dataframe I run severel functions, for example:

def metrics(df):
    df_map= pd.DataFrame({&#39;Level&#39;: [&quot;a&quot;], &#39;Kontogruppe&#39;: [&quot;a&quot;],
                            &#39;model&#39;: [&quot;alpha&quot;], &#39;MSE&#39;: [0]})
    for i in df[&#39;Level&#39;].unique():
        for j in df[&#39;Kontogruppe&#39;].unique():
            for k in df[&#39;model&#39;].unique():
                df_lkm = df.loc[(df[&#39;Level&#39;] == i) &amp; (df[&#39;Kontogruppe&#39;] == j) &amp; 
                                    (df[&#39;model&#39;] == k)]
                if df_lkm.empty:
                    out_MSE = 10000000000                   

                else:
                    out_MSE = sum(df_lkm[&#39;actual_value&#39;])/sum(df_lkm[&#39;forecast_value&#39;])                    

                df_map_map = pd.DataFrame({&#39;Level&#39;: [i], &#39;Kontogruppe&#39;: [j], &#39;model&#39;: [k], 
                                        &#39;out_MSE&#39;: [out_MSE]}) 
                df_map = pd.concat([df_map, df_map_map])
            

                
    df = pd.merge(df, df_map, how=&#39;left&#39;, on=[&#39;Level&#39;, &#39;Kontogruppe&#39;, &#39;model&#39;])   
                    
    return df

df = metrics(df)

so basically I loop over the unique column values and filter the dataframe based on this.
In this case I get for every Level, Kontogruppe and model the value 'out_MSE' gets calculated over all entries of actual_values and forecast_values. And is appended as a value for every row in a new column.

For this problem is there are more efficient way to this?
Is there any pythonic way in general to avoid this for loops, my dataframe is big and this costs a lot of performance.

答案1

得分: 1

Here is the translated content from your request:

如果我理解正确，您可能只想要一个简单的 groupby.sum，然后进行一些后处理。因为您只关心现有的组合，所以没有必要遍历所有组合并分配一个大值。

(df.groupby(['Level', 'Kontogruppe', 'model'], as_index=False)
   [['actual_value', 'forecast_value']].sum()
   .eval('out_MSE = actual_value/forecast_value')
)

输出：

  Level Kontogruppe  model  actual_value  forecast_value  out_MSE
0     a           a  alpha             1               2      0.5
1     b           a   beta             2               2      1.0
2     c           b  alpha             3               2      1.5

用于比较的您的代码输出：

  Level Kontogruppe  model  MSE_x  actual_value  forecast_value  MSE_y  out_MSE
0     a           a  alpha      0             1               2    0.0      NaN
1     a           a  alpha      0             1               2    NaN      0.5
2     b           a   beta      1             2               2    NaN      1.0
3     c           b  alpha      1             3               2    NaN      1.5

(Note: The translated code and output are provided without additional comments or explanations.)

英文:

If I understand correctly, you might just want a simple groupby.sum with a bit of post-processing. Because you only care about the existing combinations, there is no need to loop over all of them and assign a large value.

(df.groupby([&#39;Level&#39;, &#39;Kontogruppe&#39;, &#39;model&#39;], as_index=False)
   [[&#39;actual_value&#39;, &#39;forecast_value&#39;]].sum()
   .eval(&#39;out_MSE = actual_value/forecast_value&#39;)
)

Output:

  Level Kontogruppe  model  actual_value  forecast_value  out_MSE
0     a           a  alpha             1               2      0.5
1     b           a   beta             2               2      1.0
2     c           b  alpha             3               2      1.5

Output of your code for comparison:

  Level Kontogruppe  model  MSE_x  actual_value  forecast_value  MSE_y  out_MSE
0     a           a  alpha      0             1               2    0.0      NaN
1     a           a  alpha      0             1               2    NaN      0.5
2     b           a   beta      1             2               2    NaN      1.0
3     c           b  alpha      1             3               2    NaN      1.5

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

避免在Pandas数据框中使用for循环遍历列值，而是使用函数。

问题

答案1

Polars将数字字符串转换为列表

Pandas DataFrame：将字符串列转换为列表列

从AJAX网页中使用Python抓取数据。

将BigQuery的输出从Python保存为JSON。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论