如何将一个pandas数据帧(数据透视表)除以另一个,如果列名不同?

huangapple go评论60阅读模式
英文:

How to divide one pandas dataframe (pivot table) by another if columns names are different?

问题

我正在尝试使用pandas计算特定的数据透视表,并希望将其中一个除以另一个。下面是一个最小可复制的示例,其中我创建了名为piv_all_sales的数据帧和名为piv_count的另一个数据帧。我想要将前者除以后者,但DataFrame.div()方法似乎不起作用,可能是因为两个数据帧具有不同的列名。此外,piv_all_sales有8列,而piv_count只有4列。

我尝试过以下方法,但都没有成功:

  1. out1 = piv_net_sales / piv_count
  2. out2 = piv_net_sales.div(piv_count)

这两种方法都产生了一个包含所有NaN值的7x8数据帧。

我还尝试了将每个数据帧转换为NumPy数组,然后对这两个数组进行除法运算,这样就不再关心列名是否不同。但这种方法不够优雅且繁琐。

有没有更好的方法来解决这个问题?

英文:

What I am trying to do

I am calculating certain pivot tables with pandas , and I want to divide one by the other.

An example (with fictional data) is below, where I have data on the sales of certain items, and I want to calculate things like total sales in $, total # of items sold, unit price, etc.

Another example is to calculate weighted averages, where in the last step you need to divide by the weight.

The data has multi-indices because it is the result of slicing and dicing by various classifications (e.g. how many high-quality vs low-quality widgets you have sold in London vs New York).

A minimum reproducible example

In the example below, I create the dataframe piv_all_sales with dimensions 7x8:

  • 7 rows: 2 regions x 3 products + 1 row for the total
  • 8 columns: 2 metrics (gross and net sales) x (3 types of quality (low, medium, high) + 1 column for the total)

piv_all_sales looks like this:
如何将一个pandas数据帧(数据透视表)除以另一个,如果列名不同?

piv_count counts how many items I have sold, and has dimensions 7x4:
如何将一个pandas数据帧(数据透视表)除以另一个,如果列名不同?

I want to divide the former by the latter - but the DataFrame.div() method doesn't work - presumably because the two dataframes have different column names.

An additional complication is that piv_all_sales has 8 columns while piv_count has 4

import numpy as np
import pandas as pd

rng = np.random.default_rng(seed=42)
df=pd.DataFrame()
df['region'] = np.repeat(['USA','Canada'],12)
df['product'] = np.tile(['apples','strawberries','bananas'],8)
df['quality'] = np.repeat(['high','medium','low'],8)

df['net sales'] = rng.integers(low=0, high=100, size=24)
df['gross sales'] = rng.integers(low=50, high=150, size=24)
df['# items sold'] = rng.integers(low=1, high=20, size=24)


piv_net_sales = pd.pivot_table(data=df,
                           values=['net sales'],
                           index=['region','product'],
                           columns=['quality'],
                           aggfunc='sum',
                           margins=True)

piv_all_sales = pd.pivot_table(data=df,
                           values=['net sales','gross sales'],
                           index=['region','product'],
                           columns=['quality'],
                           aggfunc='sum',
                           margins=True)

piv_count = pd.pivot_table(data=df,
                           values=['# items sold'],
                           index=['region','product'],
                           columns=['quality'],
                           aggfunc='sum',
                           margins=True)

What I have tried

I wouldn't know how to divide the (7x8) dataframe by the (7x4) one.

So I started by trying to divide a 7x4 by a 7x4, ie using the dataframe which has only the net sales, not the net and gross together. However, neither works:

out1 = piv_net_sales / piv_count
out2 = piv_net_sales.div(piv_count)

presumably because pandas looks for, and doesn't find, columns with the same names?

Neither works because both produce a 7x8 dataframe of all nans

Partial, inefficient solution

The only thing which kind of works is converting each dataframe to a numpy array, and then dividid the two arrays. This way it no longer matters that the column names were different. However:

  • it is very inelegant and tedious, because I'd have to convert the dataframes to numpy arrays and the recreate the dataframes with the right indices

  • I still don't know how to divide the 7x8 dataframe by the 7x4; maybe split the 7x8 into 2 (7x4) arrays, calculate each, and then combine them again?

    works but not very elegant nor efficient

    out3 = piv_net_sales.to_numpy() / piv_count.to_numpy()

Similar questions

I have found some similar questions, e.g.

https://stackoverflow.com/questions/54431994/how-to-divide-two-pandas-pivoted-tables

but I have been unable to adapt those answers to my case.

答案1

得分: 2

根据您的方法,如果我理解正确,您可以使用以下代码:

out = (
    pd.concat([piv_all_sales, piv_count], axis=1).stack(1)
        .assign(**{"gross sales": lambda x: x["gross sales"].div(x["# items sold"]),
                   "net sales": lambda x: x["net sales"].div(x["# items sold"])})
        .unstack(2)[["gross sales", "net sales"]]
        .reindex_like(piv_all_sales) # 恢复初始顺序
)

print(out)

输出:

                    gross sales                net sales               
quality                    high   low medium   All    high   low medium   All
region product                                                          
Canada apples               NaN  9.89   6.70  8.05     NaN  4.44   4.08  4.23
       bananas              NaN 24.62  11.00 19.85     NaN 11.85  10.14 11.25
       strawberries         NaN  7.66   8.80  8.02     NaN  3.56   5.07  4.04
USA    apples             13.15   NaN  35.00 15.33    2.19   NaN   3.00  2.27
       bananas             7.67   NaN   4.79  6.23    6.25   NaN   4.88  5.56
       strawberries       12.61   NaN   9.20 11.26    8.22   NaN   3.47  6.34
All                       11.20 11.56   8.07 10.02    5.38  5.39   4.71  5.11

步骤说明:

  • concat 创建如上所示的输出。
  • .stack(1) 将第二个索引(quality: mid, high, low)从列移到行。
  • assign 通过“# items sold”除法计算字段 - 注意 ** 表示关键字参数。
  • unstack 重新进行数据透视,将quality再次放回列索引。
  • reindex_like 将列按照初始数据框的顺序重新排列。
英文:

Following your approach and if I understand you correctly, you can use :

out = (
    pd.concat([piv_all_sales, piv_count], axis=1).stack(1)
        .assign(**{"gross sales": lambda x: x["gross sales"].div(x["# items sold"]),
                   "net sales": lambda x: x["net sales"].div(x["# items sold"])})
        .unstack(2)[["gross sales", "net sales"]]
        .reindex_like(piv_all_sales) #to restore back the initial order
)

Output :

print(out)

                    gross sales                    net sales                   
quality                    high   low medium   All      high   low medium   All
region product                                                                 
Canada apples               NaN  9.89   6.70  8.05       NaN  4.44   4.08  4.23
       bananas              NaN 24.62  11.00 19.85       NaN 11.85  10.14 11.25
       strawberries         NaN  7.66   8.80  8.02       NaN  3.56   5.07  4.04
USA    apples             13.15   NaN  35.00 15.33      2.19   NaN   3.00  2.27
       bananas             7.67   NaN   4.79  6.23      6.25   NaN   4.88  5.56
       strawberries       12.61   NaN   9.20 11.26      8.22   NaN   3.47  6.34
All                       11.20 11.56   8.07 10.02      5.38  5.39   4.71  5.11

Step-by-step explanation:

concat creates an output like this:

如何将一个pandas数据帧(数据透视表)除以另一个,如果列名不同?

.stack(1) unpivots the second index (the quality: mid, high, low) moving it from the columns to the rows:

如何将一个pandas数据帧(数据透视表)除以另一个,如果列名不同?

assign calculates the fields dividing them by the # items sold - note the ** meaning the keyword arguments:

如何将一个pandas数据帧(数据透视表)除以另一个,如果列名不同?

unstack redoes a pivot, putting the quality as a column index again
如何将一个pandas数据帧(数据透视表)除以另一个,如果列名不同?

reindex_like puts the columns in the same order as the initial dataframe

如何将一个pandas数据帧(数据透视表)除以另一个,如果列名不同?

huangapple
  • 本文由 发表于 2023年5月18日 00:02:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76274063.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定