问题

我有一个具有多级索引的大型数据框，我需要对它进行一些简单的数学运算，创建一个新的列。问题在于这需要很长时间。目前，我在这里使用了嵌套循环，但我想不出更符合Python风格的解决方案。代码如下：

for element1 in short_list:
    for element2 in long_list:
        df.loc[(element1, element2), ('name1', 'name2')] = abs(df.loc[(element1, element2), ('name3', 'name4')] - df.loc[(element1, element2), ('name5', 'name6')] * another_list[element1])

我尝试过查找使用.groupby或其他迭代器的解决方案，但要么我不理解它们的功能，要么它们不符合我的需求。我会感激任何对此的帮助。

英文:

I have a large dataframe with multiindex, and I have to do some simple mathematical operations on it, creating a new column. The problem is that it takes a lot of time. For now I use a nested loop for this, but I can't think of any more pythonic solution for this. The code looks like this:

for element1 in short_list:
    for element2 in long_list:
        df.loc[(element1, element2), (&#39;name1&#39;, &#39;name2&#39;)] = abs(df.loc[(element1, element2), (&#39;name3&#39;, &#39;name4&#39;)] - df.loc[(element1, element2), (&#39;name5&#39;, &#39;name6&#39;)] * another_list[element1])

I tried to search for solutions like using .groupby or other iterators, but either I don't understand how they function, or they don't fit my needs. I'd appreciate any help with this.

答案1

得分: 1

在许多情况下，数据框上的操作速度慢的原因是因为它们没有矢量化。对于您的情况，如果您能够以矢量化的形式表达您的操作，您应该会看到性能提升。这将涉及一次在数据框的整个列上执行操作，而不是在循环中的单个单元格上执行操作。也许如果您能提供一个示例数据框，我可以更好地帮助您，但现在您可以尝试两件事：

首先，似乎通过element1索引访问了another_list，这表明它可能可以转换为带有适当索引的Series或DataFrame以进行自动对齐。

another_series = pd.Series(another_list, index=short_list)

其次，您正在从（'name5'，'name6'）中减去（'name3'，'name4'），然后将其与another_list中的元素相乘。假设所有这些元素都在多级索引DataFrame中的同一级别上（或者可以轻松地展开为该格式），您可能可以以矢量化的方式执行此操作：

df[('name1', 'name2')] = abs(df[('name3', 'name4')] - df[('name5', 'name6')] * another_series)

英文:

In many cases, the reason that operations on dataframes are slow is because they are not vectorized. For your case, if you're able to express your operation in a vectorized form, you should see a performance improvement. This would involve performing the operation on whole columns of the dataframe at once, rather than on individual cells in a loop. Maybe if you can provide an example dataframe I can help you better, but you can try two things for now:

First it seems that another_list is accessed by element1 index, which indicates it can probably be converted to a Series or a DataFrame with the appropriate indices for automatic alignment.

another_series = pd.Series(another_list, index=short_list)

Secondly, you are subtracting ('name3', 'name4') from ('name5', 'name6') and then multiplying it by elements from another_list. Assuming all these elements are on the same level in the multi-index DataFrame (or can be easily unstacked to that format), you can probably do this in a vectorized way:

df[('name1', 'name2')] = abs(df[('name3', 'name4')] - df[('name5', 'name6')] * another_series)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

多重索引数据框的嵌套循环替代方案

问题

答案1

矩阵与其转置之间的乘法不是对称的且不是半正定的。

如何使Django ModelForm字段更新attrs，以便成功显示样式。

如何使Snakemake通配符适用于空字符串？

fix_final works for x_f=[0,0,0,0,0,0] but for absolutely no other final state – 'Solution Not Found'

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论