英文:
Alternatives for nested loops for dataframes with multiindex
问题
我有一个具有多级索引的大型数据框,我需要对它进行一些简单的数学运算,创建一个新的列。问题在于这需要很长时间。目前,我在这里使用了嵌套循环,但我想不出更符合Python风格的解决方案。代码如下:
for element1 in short_list:
for element2 in long_list:
df.loc[(element1, element2), ('name1', 'name2')] = abs(df.loc[(element1, element2), ('name3', 'name4')] - df.loc[(element1, element2), ('name5', 'name6')] * another_list[element1])
我尝试过查找使用.groupby
或其他迭代器的解决方案,但要么我不理解它们的功能,要么它们不符合我的需求。我会感激任何对此的帮助。
英文:
I have a large dataframe with multiindex, and I have to do some simple mathematical operations on it, creating a new column. The problem is that it takes a lot of time. For now I use a nested loop for this, but I can't think of any more pythonic solution for this. The code looks like this:
for element1 in short_list:
for element2 in long_list:
df.loc[(element1, element2), ('name1', 'name2')] = abs(df.loc[(element1, element2), ('name3', 'name4')] - df.loc[(element1, element2), ('name5', 'name6')] * another_list[element1])
I tried to search for solutions like using .groupby or other iterators, but either I don't understand how they function, or they don't fit my needs. I'd appreciate any help with this.
答案1
得分: 1
在许多情况下,数据框上的操作速度慢的原因是因为它们没有矢量化。对于您的情况,如果您能够以矢量化的形式表达您的操作,您应该会看到性能提升。这将涉及一次在数据框的整个列上执行操作,而不是在循环中的单个单元格上执行操作。也许如果您能提供一个示例数据框,我可以更好地帮助您,但现在您可以尝试两件事:
首先,似乎通过element1索引访问了another_list,这表明它可能可以转换为带有适当索引的Series或DataFrame以进行自动对齐。
another_series = pd.Series(another_list, index=short_list)
其次,您正在从('name5','name6')中减去('name3','name4'),然后将其与another_list中的元素相乘。假设所有这些元素都在多级索引DataFrame中的同一级别上(或者可以轻松地展开为该格式),您可能可以以矢量化的方式执行此操作:
df[('name1', 'name2')] = abs(df[('name3', 'name4')] - df[('name5', 'name6')] * another_series)
英文:
In many cases, the reason that operations on dataframes are slow is because they are not vectorized. For your case, if you're able to express your operation in a vectorized form, you should see a performance improvement. This would involve performing the operation on whole columns of the dataframe at once, rather than on individual cells in a loop. Maybe if you can provide an example dataframe I can help you better, but you can try two things for now:
First it seems that another_list is accessed by element1 index, which indicates it can probably be converted to a Series or a DataFrame with the appropriate indices for automatic alignment.
another_series = pd.Series(another_list, index=short_list)
Secondly, you are subtracting ('name3', 'name4') from ('name5', 'name6') and then multiplying it by elements from another_list. Assuming all these elements are on the same level in the multi-index DataFrame (or can be easily unstacked to that format), you can probably do this in a vectorized way:
df[('name1', 'name2')] = abs(df[('name3', 'name4')] - df[('name5', 'name6')] * another_series)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论