英文:
How to divide one pandas dataframe (pivot table) by another if columns names are different?
问题
我正在尝试使用pandas
计算特定的数据透视表,并希望将其中一个除以另一个。下面是一个最小可复制的示例,其中我创建了名为piv_all_sales
的数据帧和名为piv_count
的另一个数据帧。我想要将前者除以后者,但DataFrame.div()
方法似乎不起作用,可能是因为两个数据帧具有不同的列名。此外,piv_all_sales
有8列,而piv_count
只有4列。
我尝试过以下方法,但都没有成功:
out1 = piv_net_sales / piv_count
out2 = piv_net_sales.div(piv_count)
这两种方法都产生了一个包含所有NaN值的7x8数据帧。
我还尝试了将每个数据帧转换为NumPy数组,然后对这两个数组进行除法运算,这样就不再关心列名是否不同。但这种方法不够优雅且繁琐。
有没有更好的方法来解决这个问题?
英文:
What I am trying to do
I am calculating certain pivot tables with pandas
, and I want to divide one by the other.
An example (with fictional data) is below, where I have data on the sales of certain items, and I want to calculate things like total sales in $, total # of items sold, unit price, etc.
Another example is to calculate weighted averages, where in the last step you need to divide by the weight.
The data has multi-indices because it is the result of slicing and dicing by various classifications (e.g. how many high-quality vs low-quality widgets you have sold in London vs New York).
A minimum reproducible example
In the example below, I create the dataframe piv_all_sales
with dimensions 7x8:
- 7 rows: 2 regions x 3 products + 1 row for the total
- 8 columns: 2 metrics (gross and net sales) x (3 types of quality (low, medium, high) + 1 column for the total)
piv_all_sales
looks like this:
piv_count
counts how many items I have sold, and has dimensions 7x4:
I want to divide the former by the latter - but the DataFrame.div()
method doesn't work - presumably because the two dataframes have different column names.
An additional complication is that piv_all_sales
has 8 columns while piv_count
has 4
import numpy as np
import pandas as pd
rng = np.random.default_rng(seed=42)
df=pd.DataFrame()
df['region'] = np.repeat(['USA','Canada'],12)
df['product'] = np.tile(['apples','strawberries','bananas'],8)
df['quality'] = np.repeat(['high','medium','low'],8)
df['net sales'] = rng.integers(low=0, high=100, size=24)
df['gross sales'] = rng.integers(low=50, high=150, size=24)
df['# items sold'] = rng.integers(low=1, high=20, size=24)
piv_net_sales = pd.pivot_table(data=df,
values=['net sales'],
index=['region','product'],
columns=['quality'],
aggfunc='sum',
margins=True)
piv_all_sales = pd.pivot_table(data=df,
values=['net sales','gross sales'],
index=['region','product'],
columns=['quality'],
aggfunc='sum',
margins=True)
piv_count = pd.pivot_table(data=df,
values=['# items sold'],
index=['region','product'],
columns=['quality'],
aggfunc='sum',
margins=True)
What I have tried
I wouldn't know how to divide the (7x8) dataframe by the (7x4) one.
So I started by trying to divide a 7x4 by a 7x4, ie using the dataframe which has only the net sales, not the net and gross together. However, neither works:
out1 = piv_net_sales / piv_count
out2 = piv_net_sales.div(piv_count)
presumably because pandas looks for, and doesn't find, columns with the same names?
Neither works because both produce a 7x8 dataframe of all nans
Partial, inefficient solution
The only thing which kind of works is converting each dataframe to a numpy array, and then dividid the two arrays. This way it no longer matters that the column names were different. However:
-
it is very inelegant and tedious, because I'd have to convert the dataframes to numpy arrays and the recreate the dataframes with the right indices
-
I still don't know how to divide the 7x8 dataframe by the 7x4; maybe split the 7x8 into 2 (7x4) arrays, calculate each, and then combine them again?
works but not very elegant nor efficient
out3 = piv_net_sales.to_numpy() / piv_count.to_numpy()
Similar questions
I have found some similar questions, e.g.
https://stackoverflow.com/questions/54431994/how-to-divide-two-pandas-pivoted-tables
but I have been unable to adapt those answers to my case.
答案1
得分: 2
根据您的方法,如果我理解正确,您可以使用以下代码:
out = (
pd.concat([piv_all_sales, piv_count], axis=1).stack(1)
.assign(**{"gross sales": lambda x: x["gross sales"].div(x["# items sold"]),
"net sales": lambda x: x["net sales"].div(x["# items sold"])})
.unstack(2)[["gross sales", "net sales"]]
.reindex_like(piv_all_sales) # 恢复初始顺序
)
print(out)
输出:
gross sales net sales
quality high low medium All high low medium All
region product
Canada apples NaN 9.89 6.70 8.05 NaN 4.44 4.08 4.23
bananas NaN 24.62 11.00 19.85 NaN 11.85 10.14 11.25
strawberries NaN 7.66 8.80 8.02 NaN 3.56 5.07 4.04
USA apples 13.15 NaN 35.00 15.33 2.19 NaN 3.00 2.27
bananas 7.67 NaN 4.79 6.23 6.25 NaN 4.88 5.56
strawberries 12.61 NaN 9.20 11.26 8.22 NaN 3.47 6.34
All 11.20 11.56 8.07 10.02 5.38 5.39 4.71 5.11
步骤说明:
concat
创建如上所示的输出。.stack(1)
将第二个索引(quality: mid, high, low)从列移到行。assign
通过“# items sold”除法计算字段 - 注意 ** 表示关键字参数。unstack
重新进行数据透视,将quality再次放回列索引。reindex_like
将列按照初始数据框的顺序重新排列。
英文:
Following your approach and if I understand you correctly, you can use :
out = (
pd.concat([piv_all_sales, piv_count], axis=1).stack(1)
.assign(**{"gross sales": lambda x: x["gross sales"].div(x["# items sold"]),
"net sales": lambda x: x["net sales"].div(x["# items sold"])})
.unstack(2)[["gross sales", "net sales"]]
.reindex_like(piv_all_sales) #to restore back the initial order
)
Output :
print(out)
gross sales net sales
quality high low medium All high low medium All
region product
Canada apples NaN 9.89 6.70 8.05 NaN 4.44 4.08 4.23
bananas NaN 24.62 11.00 19.85 NaN 11.85 10.14 11.25
strawberries NaN 7.66 8.80 8.02 NaN 3.56 5.07 4.04
USA apples 13.15 NaN 35.00 15.33 2.19 NaN 3.00 2.27
bananas 7.67 NaN 4.79 6.23 6.25 NaN 4.88 5.56
strawberries 12.61 NaN 9.20 11.26 8.22 NaN 3.47 6.34
All 11.20 11.56 8.07 10.02 5.38 5.39 4.71 5.11
Step-by-step explanation:
concat
creates an output like this:
.stack(1)
unpivots the second index (the quality: mid, high, low) moving it from the columns to the rows:
assign
calculates the fields dividing them by the # items sold - note the ** meaning the keyword arguments:
unstack
redoes a pivot, putting the quality as a column index again
reindex_like
puts the columns in the same order as the initial dataframe
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论