多重索引数据框的嵌套循环替代方案

huangapple go评论70阅读模式
英文:

Alternatives for nested loops for dataframes with multiindex

问题

我有一个具有多级索引的大型数据框,我需要对它进行一些简单的数学运算,创建一个新的列。问题在于这需要很长时间。目前,我在这里使用了嵌套循环,但我想不出更符合Python风格的解决方案。代码如下:

for element1 in short_list:
    for element2 in long_list:
        df.loc[(element1, element2), ('name1', 'name2')] = abs(df.loc[(element1, element2), ('name3', 'name4')] - df.loc[(element1, element2), ('name5', 'name6')] * another_list[element1])

我尝试过查找使用.groupby或其他迭代器的解决方案,但要么我不理解它们的功能,要么它们不符合我的需求。我会感激任何对此的帮助。

英文:

I have a large dataframe with multiindex, and I have to do some simple mathematical operations on it, creating a new column. The problem is that it takes a lot of time. For now I use a nested loop for this, but I can't think of any more pythonic solution for this. The code looks like this:

for element1 in short_list:
    for element2 in long_list:
        df.loc[(element1, element2), ('name1', 'name2')] = abs(df.loc[(element1, element2), ('name3', 'name4')] - df.loc[(element1, element2), ('name5', 'name6')] * another_list[element1])

I tried to search for solutions like using .groupby or other iterators, but either I don't understand how they function, or they don't fit my needs. I'd appreciate any help with this.

答案1

得分: 1

在许多情况下,数据框上的操作速度慢的原因是因为它们没有矢量化。对于您的情况,如果您能够以矢量化的形式表达您的操作,您应该会看到性能提升。这将涉及一次在数据框的整个列上执行操作,而不是在循环中的单个单元格上执行操作。也许如果您能提供一个示例数据框,我可以更好地帮助您,但现在您可以尝试两件事:

首先,似乎通过element1索引访问了another_list,这表明它可能可以转换为带有适当索引的Series或DataFrame以进行自动对齐。

another_series = pd.Series(another_list, index=short_list)

其次,您正在从('name5','name6')中减去('name3','name4'),然后将其与another_list中的元素相乘。假设所有这些元素都在多级索引DataFrame中的同一级别上(或者可以轻松地展开为该格式),您可能可以以矢量化的方式执行此操作:

df[('name1', 'name2')] = abs(df[('name3', 'name4')] - df[('name5', 'name6')] * another_series)

英文:

In many cases, the reason that operations on dataframes are slow is because they are not vectorized. For your case, if you're able to express your operation in a vectorized form, you should see a performance improvement. This would involve performing the operation on whole columns of the dataframe at once, rather than on individual cells in a loop. Maybe if you can provide an example dataframe I can help you better, but you can try two things for now:

First it seems that another_list is accessed by element1 index, which indicates it can probably be converted to a Series or a DataFrame with the appropriate indices for automatic alignment.

another_series = pd.Series(another_list, index=short_list)

Secondly, you are subtracting ('name3', 'name4') from ('name5', 'name6') and then multiplying it by elements from another_list. Assuming all these elements are on the same level in the multi-index DataFrame (or can be easily unstacked to that format), you can probably do this in a vectorized way:

df[('name1', 'name2')] = abs(df[('name3', 'name4')] - df[('name5', 'name6')] * another_series)

huangapple
  • 本文由 发表于 2023年6月18日 20:30:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76500550.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定