英文:
Avoid for loops over colum values in a pandas dataframe with a function
问题
对于这个问题,有一种更有效的方法可以避免使用多重循环,提高性能。您可以使用Pandas的groupby和transform函数来实现。以下是如何重写您的metrics函数以提高性能的方法:
import pandas as pd
def metrics(df):
df['out_MSE'] = df.groupby(['Level', 'Kontogruppe', 'model'])['actual_value'].transform('sum') / df.groupby(['Level', 'Kontogruppe', 'model'])['forecast_value'].transform('sum')
return df
# 调用函数
df = metrics(df)
这种方法将数据按照'Level'、'Kontogruppe'和'model'分组,并使用transform函数计算每个组的总和,然后将结果广播到原始DataFrame的每一行。这比显式的嵌套循环要高效得多,特别是对于大型数据框。
这是一种更Pythonic的方法,能够更简洁地实现您的目标,同时提高性能。
英文:
I have the following structur of a dataframe:
df = pd.DataFrame({'Level': ["a","b", "c"], 'Kontogruppe': ["a", "a", "b"],
'model': ["alpha", "beta", "alpha"], 'MSE': [0, 1 ,1],
'actual_value': [1,2,3], 'forecast_value': [2,2,2]})
For this dataframe I run severel functions, for example:
def metrics(df):
df_map= pd.DataFrame({'Level': ["a"], 'Kontogruppe': ["a"],
'model': ["alpha"], 'MSE': [0]})
for i in df['Level'].unique():
for j in df['Kontogruppe'].unique():
for k in df['model'].unique():
df_lkm = df.loc[(df['Level'] == i) & (df['Kontogruppe'] == j) &
(df['model'] == k)]
if df_lkm.empty:
out_MSE = 10000000000
else:
out_MSE = sum(df_lkm['actual_value'])/sum(df_lkm['forecast_value'])
df_map_map = pd.DataFrame({'Level': [i], 'Kontogruppe': [j], 'model': [k],
'out_MSE': [out_MSE]})
df_map = pd.concat([df_map, df_map_map])
df = pd.merge(df, df_map, how='left', on=['Level', 'Kontogruppe', 'model'])
return df
df = metrics(df)
so basically I loop over the unique column values and filter the dataframe based on this.
In this case I get for every Level, Kontogruppe and model the value 'out_MSE' gets calculated over all entries of actual_values and forecast_values. And is appended as a value for every row in a new column.
For this problem is there are more efficient way to this?
Is there any pythonic way in general to avoid this for loops, my dataframe is big and this costs a lot of performance.
答案1
得分: 1
Here is the translated content from your request:
如果我理解正确,您可能只想要一个简单的 groupby.sum
,然后进行一些后处理。因为您只关心现有的组合,所以没有必要遍历所有组合并分配一个大值。
(df.groupby(['Level', 'Kontogruppe', 'model'], as_index=False)
[['actual_value', 'forecast_value']].sum()
.eval('out_MSE = actual_value/forecast_value')
)
输出:
Level Kontogruppe model actual_value forecast_value out_MSE
0 a a alpha 1 2 0.5
1 b a beta 2 2 1.0
2 c b alpha 3 2 1.5
用于比较的您的代码输出:
Level Kontogruppe model MSE_x actual_value forecast_value MSE_y out_MSE
0 a a alpha 0 1 2 0.0 NaN
1 a a alpha 0 1 2 NaN 0.5
2 b a beta 1 2 2 NaN 1.0
3 c b alpha 1 3 2 NaN 1.5
(Note: The translated code and output are provided without additional comments or explanations.)
英文:
If I understand correctly, you might just want a simple groupby.sum
with a bit of post-processing. Because you only care about the existing combinations, there is no need to loop over all of them and assign a large value.
(df.groupby(['Level', 'Kontogruppe', 'model'], as_index=False)
[['actual_value', 'forecast_value']].sum()
.eval('out_MSE = actual_value/forecast_value')
)
Output:
Level Kontogruppe model actual_value forecast_value out_MSE
0 a a alpha 1 2 0.5
1 b a beta 2 2 1.0
2 c b alpha 3 2 1.5
Output of your code for comparison:
Level Kontogruppe model MSE_x actual_value forecast_value MSE_y out_MSE
0 a a alpha 0 1 2 0.0 NaN
1 a a alpha 0 1 2 NaN 0.5
2 b a beta 1 2 2 NaN 1.0
3 c b alpha 1 3 2 NaN 1.5
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论