2023年6月22日 17:30:28go评论78阅读模式

英文:

How to drop irrelevant indexes in multi index pivot pandas

问题

我有2个索引，即年份和月份。我正在使用数据透视表来显示产品的计数。

现在，在我的数据框中，假设2020年没有七月和八月的记录。但是数据透视表会显示这些月份和值为0的情况。我不希望数据透视表显示这些无关的行（在数据框中不存在），因为它们会使我的最终数据透视表非常长。如何减少这个？

这是我的示例数据框：

df = pd.DataFrame({'Product Type': ['Fruits', 'Fruits', 'Vegetable', 'Vegetable', 'Vegetable', 'Vegetable', 'Fruits', 'Fruits', 'Vegetables', 'Cars', 'Cars', 'Cars', 'Bikes', 'Bikes'],
                   'Product': ['Apple', 'Banana', 'Apple', 'Apple', 'Brocoli', 'Carrot', 'Apple', 'Banana', 'Brocoli', 'BMW M3', 'BMW M3', 'BMW M3', 'Hayabusa', 'Hayabusa'],
                   'Amount': [4938, 3285, 4947, 1516, 2212, 3778, 1110, 4436, 1049, 494, 2818, 3737, 954, 4074],
                  })

以及迄今为止的我的代码：

import pandas as pd
import numpy as np
df = pd.read_csv('try.csv')
bins = [0, 1000, 2000, 5000, float(np.inf)]
labels = ['0-1000', '1000-2000', '2000-5000', '5000+']
df['bins'] = pd.cut(df['Amount'], bins=bins, labels=labels, right=True)
pivot = df.pivot_table(index=['Product Type', 'Product'], columns='bins', aggfunc='size')
pivot.dropna(inplace=True)
pivot

期望输出：

Amount                 0-1000  1000-2000  2000-5000  5000+
Product Type Product                                      
Bikes        Hayabusa       1          0          1      0
Cars         BMW M3         1          0          2      0
Fruits       Apple          0          1          1      0
             Banana         0          0          2      0
Vegetable    Apple          0          1          1      0
             Brocoli        0          0          1      0
             Carrot         0          0          1      0
Vegetables   Brocoli        0          1          0      0

在数据框中，Bikes只包含'Hayabusa'，我想要在我的数据透视表的Bike类别中包含它。我应该如何做这个？

英文:

I have 2 indexes say: year and month. And I’m using pivot to display the count of products.

Now, in my df, let’s say there are no records of month july and august in 2020. But the pivot will show these months and values 0. I don’t want the pivot to show these irrelevant rows (which are not present in the df) as they make my final pivot very long. How to reduce this?

Here is my example df:

df = pd.DataFrame({&#39;Product Type&#39;: [&#39;Fruits&#39;, &#39;Fruits&#39;, &#39;Vegetable&#39;, &#39;Vegetable&#39;, &#39;Vegetable&#39;, &#39;Vegetable&#39;, &#39;Fruits&#39;, &#39;Fruits&#39;, &#39;Vegetables&#39;, &#39;Cars&#39;, &#39;Cars&#39;, &#39;Cars&#39;, &#39;Bikes&#39;, &#39;Bikes&#39;],
                   &#39;Product&#39;: [&#39;Apple&#39;, &#39;Banana&#39;, &#39;Apple&#39;, &#39;Apple&#39;, &#39;Brocoli&#39;, &#39;Carrot&#39;, &#39;Apple&#39;, &#39;Banana&#39;, &#39;Brocoli&#39;, &#39;BMW M3&#39;, &#39;BMW M3&#39;, &#39;BMW M3&#39;, &#39;Hayabusa&#39;, &#39;Hayabusa&#39;],
                   &#39;Amount&#39;: [4938, 3285, 4947, 1516, 2212, 3778, 1110, 4436, 1049, 494, 2818, 3737, 954, 4074],
                  })

And my code so far:

import pandas as pd
import numpy as np
df = pd.read_csv(&#39;try.csv&#39;)
bins = [0,1000,2000,5000,float(np.inf)]
labels = [&#39;0-1000&#39;,&#39;1000-2000&#39;,&#39;2000-5000&#39;,&#39;5000+&#39;]
df[&#39;bins&#39;] = pd.cut(df[&#39;Amount&#39;],bins=bins, labels=labels, right=True)
pivot = df.pivot_table(index=[&#39;Product Type&#39;,&#39;Product&#39;],columns=&#39;bins&#39;, aggfunc=&#39;size&#39;)
pivot.dropna(inplace=True)
pivot

Expected ouput:

Amount                 0-1000  1000-2000  2000-5000  5000+
Product Type Product                                      
Bikes        Hayabusa       1          0          1      0
Cars         BMW M3         1          0          2      0
Fruits       Apple          0          1          1      0
             Banana         0          0          2      0
Vegetable    Apple          0          1          1      0
             Brocoli        0          0          1      0
             Carrot         0          0          1      0
Vegetables   Brocoli        0          1          0      0

In the df, Bikes only contains 'hayabusa', which i want in my pivot's Bike category. How should I do this?

答案1

得分: 0

以下是翻译好的部分：

没有确切了解你的数据是什么样子的，以下是我尝试提供答案的内容：

这里是一些示例数据：

details = {
    'year':[2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,
            2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,],
    'month':[1,2,3,4,5,6,7,8,9,10,11,12,
             1,2,3,4,5,6,7,8,9,10,11,12,],
    'product_count':[102,67,36,23,7,6,np.nan,np.nan,5,3,3,2,
                     33,36,53,49,42,56,63,39,42,40,54,19,],
    'product_category':['Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars',
                        'Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars','Cars',]
}

df = pd.DataFrame(details)

我已经将2020年7月和2020年8月的数据设置为缺失/NaN。

考虑到你提到要使用年份和月份作为索引进行数据透视，我假设你的操作可能类似于以下内容：

(df
 .pivot(
    index=['year','month'], 
    columns=['product_category'], 
    values=['product_count'])
 .dropna()
)

请注意，在透视操作的末尾添加了".dropna()"，这将从输出中排除了2020年7月和2020年8月的数据。

英文:

Without knowing exactly what your data looks like, the below is my attempt at providing an answer:

Here is some example data:

details = {
    &#39;year&#39;:[2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,
            2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,],
    &#39;month&#39;:[1,2,3,4,5,6,7,8,9,10,11,12,
             1,2,3,4,5,6,7,8,9,10,11,12,],
    &#39;product_count&#39;:[102,67,36,23,7,6,np.nan,np.nan,5,3,3,2,
                     33,36,53,49,42,56,63,39,42,40,54,19,],
    &#39;product_category&#39;:[&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,
                        &#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,&#39;Cars&#39;,]
                        }

df = pd.DataFrame(details)

I have set data in July 2020 and August 2020 to be missing/NaN.

Considering you state using a pivot with indexes of year and month, I am assuming something like the below:

(df
 .pivot(
    index=[&#39;year&#39;,&#39;month&#39;], 
    columns=[&#39;product_category&#39;], 
    values=[&#39;product_count&#39;])
 .dropna()
)

Notice how chaining ".dropna()" on the end of the pivot excludes July 2020 and August 2020 from the output.

答案2

得分: 0

使用 cut 和 crosstab：

bins = [0, 1000, 2000, 5000, np.inf]
labels = ['0-1000', '1000-2000', '2000-5000', '5000+']

out = pd.crosstab([df['Product Type'], df['Product']],
                  pd.cut(df['Amount'], bins=bins, labels=labels)
                 ).reindex(columns=labels, fill_value=0)

输出结果：

Amount                 0-1000  1000-2000  2000-5000  5000+
Product Type Product                                      
Bikes        Hayabusa       1          0          1      0
Cars         BMW M3         1          0          2      0
Fruits       Apple          0          1          1      0
             Banana         0          0          2      0
Vegetable    Apple          0          1          1      0
             Brocoli        0          0          1      0
             Carrot         0          0          1      0
Vegetables   Brocoli        0          1          0      0

使用的输入数据：

df = pd.DataFrame({'Product Type': ['Fruits', 'Fruits', 'Vegetable', 'Vegetable', 'Vegetable', 'Vegetable', 'Fruits', 'Fruits', 'Vegetables', 'Cars', 'Cars', 'Cars', 'Bikes', 'Bikes'],
                   'Product': ['Apple', 'Banana', 'Apple', 'Apple', 'Brocoli', 'Carrot', 'Apple', 'Banana', 'Brocoli', 'BMW M3', 'BMW M3', 'BMW M3', 'Hayabusa', 'Hayabusa'],
                   'Amount': [4938, 3285, 4947, 1516, 2212, 3778, 1110, 4436, 1049, 494, 2818, 3737, 954, 4074],
                  })

英文:

Use cut and crosstab:

bins = [0, 1000, 2000, 5000, np.inf]
labels = [&#39;0-1000&#39;, &#39;1000-2000&#39;, &#39;2000-5000&#39;, &#39;5000+&#39;]

out = pd.crosstab([df[&#39;Product Type&#39;], df[&#39;Product&#39;]],
                  pd.cut(df[&#39;Amount&#39;], bins=bins, labels=labels)
                 ).reindex(columns=labels, fill_value=0)

Output:

Amount                 0-1000  1000-2000  2000-5000  5000+
Product Type Product                                      
Bikes        Hayabusa       1          0          1      0
Cars         BMW M3         1          0          2      0
Fruits       Apple          0          1          1      0
             Banana         0          0          2      0
Vegetable    Apple          0          1          1      0
             Brocoli        0          0          1      0
             Carrot         0          0          1      0
Vegetables   Brocoli        0          1          0      0

Used input:

df = pd.DataFrame({&#39;Product Type&#39;: [&#39;Fruits&#39;, &#39;Fruits&#39;, &#39;Vegetable&#39;, &#39;Vegetable&#39;, &#39;Vegetable&#39;, &#39;Vegetable&#39;, &#39;Fruits&#39;, &#39;Fruits&#39;, &#39;Vegetables&#39;, &#39;Cars&#39;, &#39;Cars&#39;, &#39;Cars&#39;, &#39;Bikes&#39;, &#39;Bikes&#39;],
                   &#39;Product&#39;: [&#39;Apple&#39;, &#39;Banana&#39;, &#39;Apple&#39;, &#39;Apple&#39;, &#39;Brocoli&#39;, &#39;Carrot&#39;, &#39;Apple&#39;, &#39;Banana&#39;, &#39;Brocoli&#39;, &#39;BMW M3&#39;, &#39;BMW M3&#39;, &#39;BMW M3&#39;, &#39;Hayabusa&#39;, &#39;Hayabusa&#39;],
                   &#39;Amount&#39;: [4938, 3285, 4947, 1516, 2212, 3778, 1110, 4436, 1049, 494, 2818, 3737, 954, 4074],
                  })

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在多重索引的 Pandas 透视表中删除无关的索引

问题

答案1

答案2

如何在 aiogram 状态期间检测内联 Python 按钮？

用向前填充替换高值

Pandas基于百分比变化和初始值的新累积列

Pandas 计算“自高点以来的柱数”

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论