2023年5月11日 12:16:42go评论77阅读模式

英文:

How to include unobserved values of categorical variables in Pandas value_counts

问题

以下是您要翻译的内容：

在Pandas中，对Categorical系列调用value_counts将确保每个可能的值都有一个计数，即使该计数为零，所有这些都是微妙的，轻微文档化，也许很少关心，但紧紧抓住。我们要深入探讨一下。

假设我像这样定义一个包含两个Categorical列的DataFrame：

import pandas as pd
abc_categorical = pd.CategoricalDtype(['a', 'b', 'c'], ordered=True)
size_categorical = pd.CategoricalDtype(['s', 'm', 'l'], ordered=True)

df = pd.DataFrame({
        'a':     pd.Categorical(list('aaabababababa'), dtype=abc_categorical),
        'tsize': pd.Categorical(list('smmssmlmlslsl'), dtype=size_categorical),
    }
)

    a   tsize
0   a   s
1   a   m
2   a   m
3   b   s
4   a   s
5   b   m
6   a   l
7   b   m
8   a   l
9   b   s
10  a   l
11  b   s
12  a   l

Series.value_counts

我没有任何一行的'a'列的值是'c'，所以Series.value_counts给出了它的计数为零。

df.a.value_counts(sort=False)

a    8
b    5
c    0
Name: a, dtype: int64

因为我们费心定义了一个Categorical，Pandas知道列'a'可以取值'c'，但在数据中从未出现过，所以我们得到了0。到目前为止，一切都好。

DataFrame.value_counts

然而，如果我调用DataFrame.value_counts来计算两列的值的组合，对于'b'＆'l'组合，我也不会得到零，其中没有值，并且没有'c'的计数。

df.value_counts(sort=False)

a  tsize
a  s        2
   m        2
   l        4
b  s        3
   m        2
dtype: int64

糟糕！

交叉表？

pandas.crosstab函数做得更好一些，给出了'b'＆'l'的零计数，但仍然省略了'c'的值。

pd.crosstab(df.a, df.tsize)

tsize   s   m   l
a
    a   2   2   4
    b   3   2   0

期望的结果

我认为，value_counts应该返回类似于这样的内容：

            count
a   tsize
a   s       2
    m       2
    l       4
b   s       3
    m       2
    l       0
c   s       0
    m       0
    l       0

DataFrame.value_counts应该像Series.value_counts一样，或者至少提供一个这样做的选项，也许是"value_counts(include_zeros=True)"。说到这一点，交叉表也应该这样做。

我实际的问题

我的问题是：是否有一种简洁的习惯方式可以让Pandas做我在这里寻找的计数，包括零？

注意：在Pandas 2.0.1的上下文中提出。

希望这对您有所帮助！

英文:

In Pandas, calling value_counts on a Categorical series will make sure that each possible value gets a count even when that count is zero, all of which is subtle, lightly documented, and maybe seldom cared-about, but hold on tight. We're going down a rabbit hole.

Let's say I define a DataFrame with two Categorical columns like this:

import pandas as pd
abc_categorical = pd.CategoricalDtype([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;], ordered=True)
size_categorical = pd.CategoricalDtype([&#39;s&#39;, &#39;m&#39;, &#39;l&#39;], ordered=True)

df = pd.DataFrame({
        &#39;a&#39;:     pd.Categorical(list(&#39;aaabababababa&#39;), dtype=abc_categorical),
        &#39;tsize&#39;: pd.Categorical(list(&#39;smmssmlmlslsl&#39;), dtype=size_categorical),
    }
)

	a	tsize
0	a	s
1	a	m
2	a	m
3	b	s
4	a	s
5	b	m
6	a	l
7	b	m
8	a	l
9	b	s
10	a	l
11	b	s
12	a	l

Series.value_counts

I have no rows where the 'a' column has the value 'c', so Series.value_counts gives us a zero count for it.

df.a.value_counts(sort=False)

a    8
b    5
c    0
Name: a, dtype: int64

Because we bothered to define a Categorical, Pandas knows that the column 'a' could take on the value 'c', but that never occurs in the data, so we get a 0. So far, so good.

DataFrame.value_counts

However, if I call DataFrame.value_counts to count values of both columns in combination, I don't get a zero either for the 'b' & 'l' combination, of which there are zero, and nothing for 'c'.

df.value_counts(sort=False)

a  tsize
a  s        2
   m        2
   l        4
b  s        3
   m        2
dtype: int64

Bummer!

Crosstab?

The pandas.crosstab function does a little better, giving a zero count for 'b' & 'l', but still leaves out the 'c' values.

pd.crosstab(df.a, df.tsize)

tsize   s   m   l
a
    a   2   2   4
    b   3   2   0

Expected results

I think, value_counts should return something like this:

            count
a   tsize
a   s       2
    m       2
    l       4
b   s       3
    m       2
    l       0
c   s       0
    m       0
    l       0

DataFrame.value_counts should do like Series.value_counts, or at least provide an option to do so, maybe "value_counts(include_zeros=True)". For that matter, crosstab should do likewise.

My actual question

My question: Is there a concise idiomatic way to get Pandas to do the counts I'm looking for here including the zeros?

Note: asked in context of Pandas 2.0.1

答案1

得分: 1

以下是您要翻译的内容：

是的，DataFrame.groupby 尊重 Categorical！

Groupby

counts = df.groupby(["a", "tsize"]).size()
counts

DataFrameGroupBy.size 返回每个组中的行数，其索引是两种分类类型的笛卡尔积。

a  tsize
a  s        2
   m        2
   l        4
b  s        3
   m        2
   l        0
c  s        0
   m        0
   l        0
dtype: int64

太棒了！所有的计数和零。另外，请注意，T恤尺寸是按照正确的小、中、大顺序排列，而不是按字母顺序排列。这要归功于Pandas的另一个聪明工具。

CategoricalIndex

索引的打印方式使其看起来像是一个字符串元组数组，但底层并非如此。

counts.index

MultiIndex([('a', 'l'),
            ('a', 'm'),
            ('a', 's'),
            ('b', 'l'),
            ('b', 'm'),
            ('b', 's'),
            ('c', 'l'),
            ('c', 'm'),
            ('c', 's')],
           names=['a', 'tsize'])

Pandas提供了一个CategoricalIndex，如果我们深入到MultiIndex的级别，可以看到它们在起作用。

counts.index.get_level_values('a')

CategoricalIndex(['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], categories=['a', 'b', 'c'], ordered=True, dtype='category', name='a')

counts.index.get_level_values('tsize')

CategoricalIndex(['s', 'm', 'l', 's', 'm', 'l', 's', 'm', 'l'], categories=['s', 'm', 'l'], ordered=True, dtype='category', name='tsize')

Pivot table

为了完整起见，pivot_table 也尊重 Categoricals 并产生与 groupby 相同的结果。

pd.pivot_table(df, index=["a", "tsize"], aggfunc='size')

Observed选项

如果您不想要零怎么办？groupby 和 pivot_table 都接受一个默认为False的 'observed' 参数。如果将observed设置为True，我们只会得到分类分组器的观察值，复制了 value_counts 的行为。

不错！

英文:

Yep, DataFrame.groupby respects Categorical!

Groupby

counts = df.groupby([&quot;a&quot;, &quot;tsize&quot;]).size()
counts

DataFrameGroupBy.size returns the number of rows in each group as a Series whose index is the cartesian product of the two categorical types.

a  tsize
a  s        2
   m        2
   l        4
b  s        3
   m        2
   l        0
c  s        0
   m        0
   l        0
dtype: int64

Awesome! All the counts and the zeros. Also, note that the tshirt sizes are in proper small, medium, large order rather than alphabetical. This is thanks to another smart bit of Pandas kit.

CategoricalIndex

The way the index prints itself makes it look like an array of tuples of strings, but that's not what's under the hood.

counts.index

MultiIndex([(&#39;a&#39;, &#39;l&#39;),
            (&#39;a&#39;, &#39;m&#39;),
            (&#39;a&#39;, &#39;s&#39;),
            (&#39;b&#39;, &#39;l&#39;),
            (&#39;b&#39;, &#39;m&#39;),
            (&#39;b&#39;, &#39;s&#39;),
            (&#39;c&#39;, &#39;l&#39;),
            (&#39;c&#39;, &#39;m&#39;),
            (&#39;c&#39;, &#39;s&#39;)],
           names=[&#39;a&#39;, &#39;tsize&#39;])

Pandas provides a CategoricalIndex and we can see them at work if we drill into the levels of the MultiIndex.

counts.index.get_level_values(&#39;a&#39;)

CategoricalIndex([&#39;a&#39;, &#39;a&#39;, &#39;a&#39;, &#39;b&#39;, &#39;b&#39;, &#39;b&#39;, &#39;c&#39;, &#39;c&#39;, &#39;c&#39;], categories=[&#39;a&#39;, &#39;b&#39;, &#39;c&#39;], ordered=True, dtype=&#39;category&#39;, name=&#39;a&#39;)

counts.index.get_level_values(&#39;tsize&#39;)

CategoricalIndex([&#39;s&#39;, &#39;m&#39;, &#39;l&#39;, &#39;s&#39;, &#39;m&#39;, &#39;l&#39;, &#39;s&#39;, &#39;m&#39;, &#39;l&#39;], categories=[&#39;s&#39;, &#39;m&#39;, &#39;l&#39;], ordered=True, dtype=&#39;category&#39;, name=&#39;tsize&#39;)

Pivot table

For completeness, pivot_table also respects Categoricals and gives the same results as groupby.

pd.pivot_table(df, index=[&quot;a&quot;, &quot;tsize&quot;], aggfunc=&#39;size&#39;)

Observed option

What if you don't want the zeros? Both groupby and pivot_table accept an 'observed' parameter that defaults to False. If observed is set to True, we only get observed values for categorical groupers, replicating the behavior of value_counts.

Nice!

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在 Pandas 的 value_counts 中包含未观察到的分类变量数值。

问题

Series.value_counts

DataFrame.value_counts

交叉表？

期望的结果

我实际的问题

Series.value_counts

DataFrame.value_counts

Crosstab?

Expected results

My actual question

答案1

Groupby

CategoricalIndex

Pivot table

Observed选项

Groupby

CategoricalIndex

Pivot table

Observed option

如何理解这个表达？

如何使用numpy复制tf.get_variable的默认行为？ (TensorFlow v1.15.0)

Xarray的open_mfdataset()函数是否能处理嵌套结构中可变数量的文件？

Python中的类继承在Robot Framework中。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论