2023年5月18日 00:02:37go评论102阅读模式

英文:

How to divide one pandas dataframe (pivot table) by another if columns names are different?

问题

我正在尝试使用pandas计算特定的数据透视表，并希望将其中一个除以另一个。下面是一个最小可复制的示例，其中我创建了名为piv_all_sales的数据帧和名为piv_count的另一个数据帧。我想要将前者除以后者，但DataFrame.div()方法似乎不起作用，可能是因为两个数据帧具有不同的列名。此外，piv_all_sales有8列，而piv_count只有4列。

我尝试过以下方法，但都没有成功：

out1 = piv_net_sales / piv_count
out2 = piv_net_sales.div(piv_count)

这两种方法都产生了一个包含所有NaN值的7x8数据帧。

我还尝试了将每个数据帧转换为NumPy数组，然后对这两个数组进行除法运算，这样就不再关心列名是否不同。但这种方法不够优雅且繁琐。

有没有更好的方法来解决这个问题？

英文:

What I am trying to do

I am calculating certain pivot tables with pandas , and I want to divide one by the other.

An example (with fictional data) is below, where I have data on the sales of certain items, and I want to calculate things like total sales in $, total # of items sold, unit price, etc.

Another example is to calculate weighted averages, where in the last step you need to divide by the weight.

The data has multi-indices because it is the result of slicing and dicing by various classifications (e.g. how many high-quality vs low-quality widgets you have sold in London vs New York).

A minimum reproducible example

In the example below, I create the dataframe piv_all_sales with dimensions 7x8:

7 rows: 2 regions x 3 products + 1 row for the total
8 columns: 2 metrics (gross and net sales) x (3 types of quality (low, medium, high) + 1 column for the total)

piv_all_sales looks like this:

piv_count counts how many items I have sold, and has dimensions 7x4:

I want to divide the former by the latter - but the DataFrame.div() method doesn't work - presumably because the two dataframes have different column names.

An additional complication is that piv_all_sales has 8 columns while piv_count has 4

import numpy as np
import pandas as pd
rng = np.random.default_rng(seed=42)
df=pd.DataFrame()
df[&#39;region&#39;] = np.repeat([&#39;USA&#39;,&#39;Canada&#39;],12)
df[&#39;product&#39;] = np.tile([&#39;apples&#39;,&#39;strawberries&#39;,&#39;bananas&#39;],8)
df[&#39;quality&#39;] = np.repeat([&#39;high&#39;,&#39;medium&#39;,&#39;low&#39;],8)
df[&#39;net sales&#39;] = rng.integers(low=0, high=100, size=24)
df[&#39;gross sales&#39;] = rng.integers(low=50, high=150, size=24)
df[&#39;# items sold&#39;] = rng.integers(low=1, high=20, size=24)
piv_net_sales = pd.pivot_table(data=df,
                           values=[&#39;net sales&#39;],
                           index=[&#39;region&#39;,&#39;product&#39;],
                           columns=[&#39;quality&#39;],
                           aggfunc=&#39;sum&#39;,
                           margins=True)
piv_all_sales = pd.pivot_table(data=df,
                           values=[&#39;net sales&#39;,&#39;gross sales&#39;],
                           index=[&#39;region&#39;,&#39;product&#39;],
                           columns=[&#39;quality&#39;],
                           aggfunc=&#39;sum&#39;,
                           margins=True)
piv_count = pd.pivot_table(data=df,
                           values=[&#39;# items sold&#39;],
                           index=[&#39;region&#39;,&#39;product&#39;],
                           columns=[&#39;quality&#39;],
                           aggfunc=&#39;sum&#39;,
                           margins=True)

What I have tried

I wouldn't know how to divide the (7x8) dataframe by the (7x4) one.

So I started by trying to divide a 7x4 by a 7x4, ie using the dataframe which has only the net sales, not the net and gross together. However, neither works:

out1 = piv_net_sales / piv_count
out2 = piv_net_sales.div(piv_count)

presumably because pandas looks for, and doesn't find, columns with the same names?

Neither works because both produce a 7x8 dataframe of all nans

Partial, inefficient solution

The only thing which kind of works is converting each dataframe to a numpy array, and then dividid the two arrays. This way it no longer matters that the column names were different. However:

it is very inelegant and tedious, because I'd have to convert the dataframes to numpy arrays and the recreate the dataframes with the right indices
I still don't know how to divide the 7x8 dataframe by the 7x4; maybe split the 7x8 into 2 (7x4) arrays, calculate each, and then combine them again?

works but not very elegant nor efficient

out3 = piv_net_sales.to_numpy() / piv_count.to_numpy()

答案1

得分: 2

根据您的方法，如果我理解正确，您可以使用以下代码：

out = (
    pd.concat([piv_all_sales, piv_count], axis=1).stack(1)
        .assign(**{"gross sales": lambda x: x["gross sales"].div(x["# items sold"]),
                   "net sales": lambda x: x["net sales"].div(x["# items sold"])})
        .unstack(2)[["gross sales", "net sales"]]
        .reindex_like(piv_all_sales) # 恢复初始顺序
)
print(out)

输出：

                    gross sales                net sales               
quality                    high   low medium   All    high   low medium   All
region product                                                          
Canada apples               NaN  9.89   6.70  8.05     NaN  4.44   4.08  4.23
       bananas              NaN 24.62  11.00 19.85     NaN 11.85  10.14 11.25
       strawberries         NaN  7.66   8.80  8.02     NaN  3.56   5.07  4.04
USA    apples             13.15   NaN  35.00 15.33    2.19   NaN   3.00  2.27
       bananas             7.67   NaN   4.79  6.23    6.25   NaN   4.88  5.56
       strawberries       12.61   NaN   9.20 11.26    8.22   NaN   3.47  6.34
All                       11.20 11.56   8.07 10.02    5.38  5.39   4.71  5.11

步骤说明：

concat 创建如上所示的输出。
.stack(1) 将第二个索引（quality: mid, high, low）从列移到行。
assign 通过“# items sold”除法计算字段 - 注意 ** 表示关键字参数。
unstack 重新进行数据透视，将quality再次放回列索引。
reindex_like 将列按照初始数据框的顺序重新排列。

英文:

Following your approach and if I understand you correctly, you can use :

out = (
    pd.concat([piv_all_sales, piv_count], axis=1).stack(1)
        .assign(**{&quot;gross sales&quot;: lambda x: x[&quot;gross sales&quot;].div(x[&quot;# items sold&quot;]),
                   &quot;net sales&quot;: lambda x: x[&quot;net sales&quot;].div(x[&quot;# items sold&quot;])})
        .unstack(2)[[&quot;gross sales&quot;, &quot;net sales&quot;]]
        .reindex_like(piv_all_sales) #to restore back the initial order
)

Output :

print(out)
                    gross sales                    net sales                   
quality                    high   low medium   All      high   low medium   All
region product                                                                 
Canada apples               NaN  9.89   6.70  8.05       NaN  4.44   4.08  4.23
       bananas              NaN 24.62  11.00 19.85       NaN 11.85  10.14 11.25
       strawberries         NaN  7.66   8.80  8.02       NaN  3.56   5.07  4.04
USA    apples             13.15   NaN  35.00 15.33      2.19   NaN   3.00  2.27
       bananas             7.67   NaN   4.79  6.23      6.25   NaN   4.88  5.56
       strawberries       12.61   NaN   9.20 11.26      8.22   NaN   3.47  6.34
All                       11.20 11.56   8.07 10.02      5.38  5.39   4.71  5.11

Step-by-step explanation:

concat creates an output like this:

.stack(1) unpivots the second index (the quality: mid, high, low) moving it from the columns to the rows:

assign calculates the fields dividing them by the # items sold - note the ** meaning the keyword arguments:

unstack redoes a pivot, putting the quality as a column index again

reindex_like puts the columns in the same order as the initial dataframe

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何将一个pandas数据帧（数据透视表）除以另一个，如果列名不同？

问题

What I am trying to do

A minimum reproducible example

What I have tried

Partial, inefficient solution

Similar questions

答案1

根据年月拼接的动态枢纽

Python – 这个函数声明有什么问题？

如何使Docker容器自动激活conda环境？

属性赋值预期。当与Jinja结合使用时，javascript

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。