如何在多重索引的 Pandas 透视表中删除无关的索引

huangapple go评论66阅读模式
英文:

How to drop irrelevant indexes in multi index pivot pandas

问题

我有2个索引,即年份和月份。我正在使用数据透视表来显示产品的计数。

现在,在我的数据框中,假设2020年没有七月和八月的记录。但是数据透视表会显示这些月份和值为0的情况。我不希望数据透视表显示这些无关的行(在数据框中不存在),因为它们会使我的最终数据透视表非常长。如何减少这个?

这是我的示例数据框:

df = pd.DataFrame({'Product Type': ['Fruits', 'Fruits', 'Vegetable', 'Vegetable', 'Vegetable', 'Vegetable', 'Fruits', 'Fruits', 'Vegetables', 'Cars', 'Cars', 'Cars', 'Bikes', 'Bikes'],
                   'Product': ['Apple', 'Banana', 'Apple', 'Apple', 'Brocoli', 'Carrot', 'Apple', 'Banana', 'Brocoli', 'BMW M3', 'BMW M3', 'BMW M3', 'Hayabusa', 'Hayabusa'],
                   'Amount': [4938, 3285, 4947, 1516, 2212, 3778, 1110, 4436, 1049, 494, 2818, 3737, 954, 4074],
                  })

以及迄今为止的我的代码:

import pandas as pd
import numpy as np
df = pd.read_csv('try.csv')
bins = [0, 1000, 2000, 5000, float(np.inf)]
labels = ['0-1000', '1000-2000', '2000-5000', '5000+']
df['bins'] = pd.cut(df['Amount'], bins=bins, labels=labels, right=True)
pivot = df.pivot_table(index=['Product Type', 'Product'], columns='bins', aggfunc='size')
pivot.dropna(inplace=True)
pivot

期望输出:

Amount                 0-1000  1000-2000  2000-5000  5000+
Product Type Product                                      
Bikes        Hayabusa       1          0          1      0
Cars         BMW M3         1          0          2      0
Fruits       Apple          0          1          1      0
             Banana         0          0          2      0
Vegetable    Apple          0          1          1      0
             Brocoli        0          0          1      0
             Carrot         0          0          1      0
Vegetables   Brocoli        0          1          0      0

在数据框中,Bikes只包含'Hayabusa',我想要在我的数据透视表的Bike类别中包含它。我应该如何做这个?

英文:

I have 2 indexes say: year and month. And I’m using pivot to display the count of products.

Now, in my df, let’s say there are no records of month july and august in 2020. But the pivot will show these months and values 0. I don’t want the pivot to show these irrelevant rows (which are not present in the df) as they make my final pivot very long. How to reduce this?

Here is my example df:

df = pd.DataFrame({'Product Type': ['Fruits', 'Fruits', 'Vegetable', 'Vegetable', 'Vegetable', 'Vegetable', 'Fruits', 'Fruits', 'Vegetables', 'Cars', 'Cars', 'Cars', 'Bikes', 'Bikes'],
                   'Product': ['Apple', 'Banana', 'Apple', 'Apple', 'Brocoli', 'Carrot', 'Apple', 'Banana', 'Brocoli', 'BMW M3', 'BMW M3', 'BMW M3', 'Hayabusa', 'Hayabusa'],
                   'Amount': [4938, 3285, 4947, 1516, 2212, 3778, 1110, 4436, 1049, 494, 2818, 3737, 954, 4074],
                  })

And my code so far:

import pandas as pd
import numpy as np
df = pd.read_csv('try.csv')
bins = [0,1000,2000,5000,float(np.inf)]
labels = ['0-1000','1000-2000','2000-5000','5000+']
df['bins'] = pd.cut(df['Amount'],bins=bins, labels=labels, right=True)
pivot = df.pivot_table(index=['Product Type','Product'],columns='bins', aggfunc='size')
pivot.dropna(inplace=True)
pivot

Expected ouput:

Amount                 0-1000  1000-2000  2000-5000  5000+
Product Type Product                                      
Bikes        Hayabusa       1          0          1      0
Cars         BMW M3         1          0          2      0
Fruits       Apple          0          1          1      0
             Banana         0          0          2      0
Vegetable    Apple          0          1          1      0
             Brocoli        0          0          1      0
             Carrot         0          0          1      0
Vegetables   Brocoli        0          1          0      0

In the df, Bikes only contains 'hayabusa', which i want in my pivot's Bike category. How should I do this?

答案1

得分: 0

以下是翻译好的部分:

没有确切了解你的数据是什么样子的,以下是我尝试提供答案的内容:

这里是一些示例数据:

details = {
    'year':[2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,
            2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,],
    'month':[1,2,3,4,5,6,7,8,9,10,11,12,
             1,2,3,4,5,6,7,8,9,10,11,12,],
    'product_count':[102,67,36,23,7,6,np.nan,np.nan,5,3,3,2,
                     33,36,53,49,42,56,63,39,42,40,54,19,],
    'product_category':['Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars',
                        'Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars',]
}

df = pd.DataFrame(details)

我已经将2020年7月和2020年8月的数据设置为缺失/NaN。

考虑到你提到要使用年份和月份作为索引进行数据透视,我假设你的操作可能类似于以下内容:

(df
 .pivot(
    index=['year','month'], 
    columns=['product_category'], 
    values=['product_count'])
 .dropna()
)

请注意,在透视操作的末尾添加了".dropna()",这将从输出中排除了2020年7月和2020年8月的数据。

如何在多重索引的 Pandas 透视表中删除无关的索引

英文:

Without knowing exactly what your data looks like, the below is my attempt at providing an answer:

Here is some example data:

details = {
    'year':[2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,
            2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,],
    'month':[1,2,3,4,5,6,7,8,9,10,11,12,
             1,2,3,4,5,6,7,8,9,10,11,12,],
    'product_count':[102,67,36,23,7,6,np.nan,np.nan,5,3,3,2,
                     33,36,53,49,42,56,63,39,42,40,54,19,],
    'product_category':['Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars',
                        'Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars',]
                        }

df = pd.DataFrame(details) 

I have set data in July 2020 and August 2020 to be missing/NaN.

Considering you state using a pivot with indexes of year and month, I am assuming something like the below:

(df
 .pivot(
    index=['year','month'], 
    columns=['product_category'], 
    values=['product_count'])
 .dropna()
)

Notice how chaining ".dropna()" on the end of the pivot excludes July 2020 and August 2020 from the output.

如何在多重索引的 Pandas 透视表中删除无关的索引

答案2

得分: 0

使用 cutcrosstab

bins = [0, 1000, 2000, 5000, np.inf]
labels = ['0-1000', '1000-2000', '2000-5000', '5000+']

out = pd.crosstab([df['Product Type'], df['Product']],
                  pd.cut(df['Amount'], bins=bins, labels=labels)
                 ).reindex(columns=labels, fill_value=0)

输出结果:

Amount                 0-1000  1000-2000  2000-5000  5000+
Product Type Product                                      
Bikes        Hayabusa       1          0          1      0
Cars         BMW M3         1          0          2      0
Fruits       Apple          0          1          1      0
             Banana         0          0          2      0
Vegetable    Apple          0          1          1      0
             Brocoli        0          0          1      0
             Carrot         0          0          1      0
Vegetables   Brocoli        0          1          0      0

使用的输入数据:

df = pd.DataFrame({'Product Type': ['Fruits', 'Fruits', 'Vegetable', 'Vegetable', 'Vegetable', 'Vegetable', 'Fruits', 'Fruits', 'Vegetables', 'Cars', 'Cars', 'Cars', 'Bikes', 'Bikes'],
                   'Product': ['Apple', 'Banana', 'Apple', 'Apple', 'Brocoli', 'Carrot', 'Apple', 'Banana', 'Brocoli', 'BMW M3', 'BMW M3', 'BMW M3', 'Hayabusa', 'Hayabusa'],
                   'Amount': [4938, 3285, 4947, 1516, 2212, 3778, 1110, 4436, 1049, 494, 2818, 3737, 954, 4074],
                  })
英文:

Use cut and crosstab:

bins = [0, 1000, 2000, 5000, np.inf]
labels = ['0-1000', '1000-2000', '2000-5000', '5000+']

out = pd.crosstab([df['Product Type'], df['Product']],
                  pd.cut(df['Amount'], bins=bins, labels=labels)
                 ).reindex(columns=labels, fill_value=0)

Output:

Amount                 0-1000  1000-2000  2000-5000  5000+
Product Type Product                                      
Bikes        Hayabusa       1          0          1      0
Cars         BMW M3         1          0          2      0
Fruits       Apple          0          1          1      0
             Banana         0          0          2      0
Vegetable    Apple          0          1          1      0
             Brocoli        0          0          1      0
             Carrot         0          0          1      0
Vegetables   Brocoli        0          1          0      0

Used input:

df = pd.DataFrame({'Product Type': ['Fruits', 'Fruits', 'Vegetable', 'Vegetable', 'Vegetable', 'Vegetable', 'Fruits', 'Fruits', 'Vegetables', 'Cars', 'Cars', 'Cars', 'Bikes', 'Bikes'],
                   'Product': ['Apple', 'Banana', 'Apple', 'Apple', 'Brocoli', 'Carrot', 'Apple', 'Banana', 'Brocoli', 'BMW M3', 'BMW M3', 'BMW M3', 'Hayabusa', 'Hayabusa'],
                   'Amount': [4938, 3285, 4947, 1516, 2212, 3778, 1110, 4436, 1049, 494, 2818, 3737, 954, 4074],
                  })

huangapple
  • 本文由 发表于 2023年6月22日 17:30:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76530445.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定