英文:
How to drop irrelevant indexes in multi index pivot pandas
问题
我有2个索引,即年份和月份。我正在使用数据透视表来显示产品的计数。
现在,在我的数据框中,假设2020年没有七月和八月的记录。但是数据透视表会显示这些月份和值为0的情况。我不希望数据透视表显示这些无关的行(在数据框中不存在),因为它们会使我的最终数据透视表非常长。如何减少这个?
这是我的示例数据框:
df = pd.DataFrame({'Product Type': ['Fruits', 'Fruits', 'Vegetable', 'Vegetable', 'Vegetable', 'Vegetable', 'Fruits', 'Fruits', 'Vegetables', 'Cars', 'Cars', 'Cars', 'Bikes', 'Bikes'],
'Product': ['Apple', 'Banana', 'Apple', 'Apple', 'Brocoli', 'Carrot', 'Apple', 'Banana', 'Brocoli', 'BMW M3', 'BMW M3', 'BMW M3', 'Hayabusa', 'Hayabusa'],
'Amount': [4938, 3285, 4947, 1516, 2212, 3778, 1110, 4436, 1049, 494, 2818, 3737, 954, 4074],
})
以及迄今为止的我的代码:
import pandas as pd
import numpy as np
df = pd.read_csv('try.csv')
bins = [0, 1000, 2000, 5000, float(np.inf)]
labels = ['0-1000', '1000-2000', '2000-5000', '5000+']
df['bins'] = pd.cut(df['Amount'], bins=bins, labels=labels, right=True)
pivot = df.pivot_table(index=['Product Type', 'Product'], columns='bins', aggfunc='size')
pivot.dropna(inplace=True)
pivot
期望输出:
Amount 0-1000 1000-2000 2000-5000 5000+
Product Type Product
Bikes Hayabusa 1 0 1 0
Cars BMW M3 1 0 2 0
Fruits Apple 0 1 1 0
Banana 0 0 2 0
Vegetable Apple 0 1 1 0
Brocoli 0 0 1 0
Carrot 0 0 1 0
Vegetables Brocoli 0 1 0 0
在数据框中,Bikes只包含'Hayabusa',我想要在我的数据透视表的Bike类别中包含它。我应该如何做这个?
英文:
I have 2 indexes say: year and month. And I’m using pivot to display the count of products.
Now, in my df, let’s say there are no records of month july and august in 2020. But the pivot will show these months and values 0. I don’t want the pivot to show these irrelevant rows (which are not present in the df) as they make my final pivot very long. How to reduce this?
Here is my example df:
df = pd.DataFrame({'Product Type': ['Fruits', 'Fruits', 'Vegetable', 'Vegetable', 'Vegetable', 'Vegetable', 'Fruits', 'Fruits', 'Vegetables', 'Cars', 'Cars', 'Cars', 'Bikes', 'Bikes'],
'Product': ['Apple', 'Banana', 'Apple', 'Apple', 'Brocoli', 'Carrot', 'Apple', 'Banana', 'Brocoli', 'BMW M3', 'BMW M3', 'BMW M3', 'Hayabusa', 'Hayabusa'],
'Amount': [4938, 3285, 4947, 1516, 2212, 3778, 1110, 4436, 1049, 494, 2818, 3737, 954, 4074],
})
And my code so far:
import pandas as pd
import numpy as np
df = pd.read_csv('try.csv')
bins = [0,1000,2000,5000,float(np.inf)]
labels = ['0-1000','1000-2000','2000-5000','5000+']
df['bins'] = pd.cut(df['Amount'],bins=bins, labels=labels, right=True)
pivot = df.pivot_table(index=['Product Type','Product'],columns='bins', aggfunc='size')
pivot.dropna(inplace=True)
pivot
Expected ouput:
Amount 0-1000 1000-2000 2000-5000 5000+
Product Type Product
Bikes Hayabusa 1 0 1 0
Cars BMW M3 1 0 2 0
Fruits Apple 0 1 1 0
Banana 0 0 2 0
Vegetable Apple 0 1 1 0
Brocoli 0 0 1 0
Carrot 0 0 1 0
Vegetables Brocoli 0 1 0 0
In the df, Bikes only contains 'hayabusa', which i want in my pivot's Bike category. How should I do this?
答案1
得分: 0
以下是翻译好的部分:
没有确切了解你的数据是什么样子的,以下是我尝试提供答案的内容:
这里是一些示例数据:
details = {
'year':[2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,
2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,],
'month':[1,2,3,4,5,6,7,8,9,10,11,12,
1,2,3,4,5,6,7,8,9,10,11,12,],
'product_count':[102,67,36,23,7,6,np.nan,np.nan,5,3,3,2,
33,36,53,49,42,56,63,39,42,40,54,19,],
'product_category':['Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars',
'Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars',]
}
df = pd.DataFrame(details)
我已经将2020年7月和2020年8月的数据设置为缺失/NaN。
考虑到你提到要使用年份和月份作为索引进行数据透视,我假设你的操作可能类似于以下内容:
(df
.pivot(
index=['year','month'],
columns=['product_category'],
values=['product_count'])
.dropna()
)
请注意,在透视操作的末尾添加了".dropna()",这将从输出中排除了2020年7月和2020年8月的数据。
英文:
Without knowing exactly what your data looks like, the below is my attempt at providing an answer:
Here is some example data:
details = {
'year':[2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,
2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,],
'month':[1,2,3,4,5,6,7,8,9,10,11,12,
1,2,3,4,5,6,7,8,9,10,11,12,],
'product_count':[102,67,36,23,7,6,np.nan,np.nan,5,3,3,2,
33,36,53,49,42,56,63,39,42,40,54,19,],
'product_category':['Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars',
'Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars',]
}
df = pd.DataFrame(details)
I have set data in July 2020 and August 2020 to be missing/NaN.
Considering you state using a pivot with indexes of year and month, I am assuming something like the below:
(df
.pivot(
index=['year','month'],
columns=['product_category'],
values=['product_count'])
.dropna()
)
Notice how chaining ".dropna()" on the end of the pivot excludes July 2020 and August 2020 from the output.
答案2
得分: 0
bins = [0, 1000, 2000, 5000, np.inf]
labels = ['0-1000', '1000-2000', '2000-5000', '5000+']
out = pd.crosstab([df['Product Type'], df['Product']],
pd.cut(df['Amount'], bins=bins, labels=labels)
).reindex(columns=labels, fill_value=0)
输出结果:
Amount 0-1000 1000-2000 2000-5000 5000+
Product Type Product
Bikes Hayabusa 1 0 1 0
Cars BMW M3 1 0 2 0
Fruits Apple 0 1 1 0
Banana 0 0 2 0
Vegetable Apple 0 1 1 0
Brocoli 0 0 1 0
Carrot 0 0 1 0
Vegetables Brocoli 0 1 0 0
使用的输入数据:
df = pd.DataFrame({'Product Type': ['Fruits', 'Fruits', 'Vegetable', 'Vegetable', 'Vegetable', 'Vegetable', 'Fruits', 'Fruits', 'Vegetables', 'Cars', 'Cars', 'Cars', 'Bikes', 'Bikes'],
'Product': ['Apple', 'Banana', 'Apple', 'Apple', 'Brocoli', 'Carrot', 'Apple', 'Banana', 'Brocoli', 'BMW M3', 'BMW M3', 'BMW M3', 'Hayabusa', 'Hayabusa'],
'Amount': [4938, 3285, 4947, 1516, 2212, 3778, 1110, 4436, 1049, 494, 2818, 3737, 954, 4074],
})
英文:
bins = [0, 1000, 2000, 5000, np.inf]
labels = ['0-1000', '1000-2000', '2000-5000', '5000+']
out = pd.crosstab([df['Product Type'], df['Product']],
pd.cut(df['Amount'], bins=bins, labels=labels)
).reindex(columns=labels, fill_value=0)
Output:
Amount 0-1000 1000-2000 2000-5000 5000+
Product Type Product
Bikes Hayabusa 1 0 1 0
Cars BMW M3 1 0 2 0
Fruits Apple 0 1 1 0
Banana 0 0 2 0
Vegetable Apple 0 1 1 0
Brocoli 0 0 1 0
Carrot 0 0 1 0
Vegetables Brocoli 0 1 0 0
Used input:
df = pd.DataFrame({'Product Type': ['Fruits', 'Fruits', 'Vegetable', 'Vegetable', 'Vegetable', 'Vegetable', 'Fruits', 'Fruits', 'Vegetables', 'Cars', 'Cars', 'Cars', 'Bikes', 'Bikes'],
'Product': ['Apple', 'Banana', 'Apple', 'Apple', 'Brocoli', 'Carrot', 'Apple', 'Banana', 'Brocoli', 'BMW M3', 'BMW M3', 'BMW M3', 'Hayabusa', 'Hayabusa'],
'Amount': [4938, 3285, 4947, 1516, 2212, 3778, 1110, 4436, 1049, 494, 2818, 3737, 954, 4074],
})
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论