如何在 Pandas 的 value_counts 中包含未观察到的分类变量数值。

huangapple go评论77阅读模式
英文:

How to include unobserved values of categorical variables in Pandas value_counts

问题

以下是您要翻译的内容:

在Pandas中,对Categorical系列调用value_counts将确保每个可能的值都有一个计数,即使该计数为零,所有这些都是微妙的,轻微文档化,也许很少关心,但紧紧抓住。我们要深入探讨一下。

假设我像这样定义一个包含两个Categorical列的DataFrame

import pandas as pd
abc_categorical = pd.CategoricalDtype(['a', 'b', 'c'], ordered=True)
size_categorical = pd.CategoricalDtype(['s', 'm', 'l'], ordered=True)

df = pd.DataFrame({
        'a':     pd.Categorical(list('aaabababababa'), dtype=abc_categorical),
        'tsize': pd.Categorical(list('smmssmlmlslsl'), dtype=size_categorical),
    }
)
    a   tsize
0   a   s
1   a   m
2   a   m
3   b   s
4   a   s
5   b   m
6   a   l
7   b   m
8   a   l
9   b   s
10  a   l
11  b   s
12  a   l

Series.value_counts

我没有任何一行的'a'列的值是'c',所以Series.value_counts给出了它的计数为零。

df.a.value_counts(sort=False)
a    8
b    5
c    0
Name: a, dtype: int64

因为我们费心定义了一个Categorical,Pandas知道列'a'可以取值'c',但在数据中从未出现过,所以我们得到了0。到目前为止,一切都好。

DataFrame.value_counts

然而,如果我调用DataFrame.value_counts来计算两列的值的组合,对于'b'&'l'组合,我也不会得到零,其中没有值,并且没有'c'的计数。

df.value_counts(sort=False)
a  tsize
a  s        2
   m        2
   l        4
b  s        3
   m        2
dtype: int64

糟糕!

交叉表?

pandas.crosstab函数做得更好一些,给出了'b'&'l'的零计数,但仍然省略了'c'的值。

pd.crosstab(df.a, df.tsize)
tsize   s   m   l
a
    a   2   2   4
    b   3   2   0

期望的结果

我认为,value_counts应该返回类似于这样的内容:

            count
a   tsize
a   s       2
    m       2
    l       4
b   s       3
    m       2
    l       0
c   s       0
    m       0
    l       0

DataFrame.value_counts应该像Series.value_counts一样,或者至少提供一个这样做的选项,也许是"value_counts(include_zeros=True)"。说到这一点,交叉表也应该这样做。

我实际的问题

我的问题是:是否有一种简洁的习惯方式可以让Pandas做我在这里寻找的计数,包括零?

注意:在Pandas 2.0.1的上下文中提出。

希望这对您有所帮助!

英文:

In Pandas, calling value_counts on a Categorical series will make sure that each possible value gets a count even when that count is zero, all of which is subtle, lightly documented, and maybe seldom cared-about, but hold on tight. We're going down a rabbit hole.

Let's say I define a DataFrame with two Categorical columns like this:

import pandas as pd
abc_categorical = pd.CategoricalDtype(['a', 'b', 'c'], ordered=True)
size_categorical = pd.CategoricalDtype(['s', 'm', 'l'], ordered=True)

df = pd.DataFrame({
        'a':     pd.Categorical(list('aaabababababa'), dtype=abc_categorical),
        'tsize': pd.Categorical(list('smmssmlmlslsl'), dtype=size_categorical),
    }
)
	a	tsize
0	a	s
1	a	m
2	a	m
3	b	s
4	a	s
5	b	m
6	a	l
7	b	m
8	a	l
9	b	s
10	a	l
11	b	s
12	a	l

Series.value_counts

I have no rows where the 'a' column has the value 'c', so Series.value_counts gives us a zero count for it.

df.a.value_counts(sort=False)
a    8
b    5
c    0
Name: a, dtype: int64

Because we bothered to define a Categorical, Pandas knows that the column 'a' could take on the value 'c', but that never occurs in the data, so we get a 0. So far, so good.

DataFrame.value_counts

However, if I call DataFrame.value_counts to count values of both columns in combination, I don't get a zero either for the 'b' & 'l' combination, of which there are zero, and nothing for 'c'.

df.value_counts(sort=False)
a  tsize
a  s        2
   m        2
   l        4
b  s        3
   m        2
dtype: int64

Bummer!

Crosstab?

The pandas.crosstab function does a little better, giving a zero count for 'b' & 'l', but still leaves out the 'c' values.

pd.crosstab(df.a, df.tsize)
tsize   s   m   l
a
    a   2   2   4
    b   3   2   0

Expected results

I think, value_counts should return something like this:

            count
a   tsize
a   s       2
    m       2
    l       4
b   s       3
    m       2
    l       0
c   s       0
    m       0
    l       0

DataFrame.value_counts should do like Series.value_counts, or at least provide an option to do so, maybe "value_counts(include_zeros=True)". For that matter, crosstab should do likewise.

My actual question

My question: Is there a concise idiomatic way to get Pandas to do the counts I'm looking for here including the zeros?

Note: asked in context of Pandas 2.0.1

答案1

得分: 1

以下是您要翻译的内容:

是的,DataFrame.groupby 尊重 Categorical

Groupby

counts = df.groupby(["a", "tsize"]).size()
counts

DataFrameGroupBy.size 返回每个组中的行数,其索引是两种分类类型的笛卡尔积。

a  tsize
a  s        2
   m        2
   l        4
b  s        3
   m        2
   l        0
c  s        0
   m        0
   l        0
dtype: int64

太棒了!所有的计数和零。另外,请注意,T恤尺寸是按照正确的小、中、大顺序排列,而不是按字母顺序排列。这要归功于Pandas的另一个聪明工具。

CategoricalIndex

索引的打印方式使其看起来像是一个字符串元组数组,但底层并非如此。

counts.index
MultiIndex([('a', 'l'),
            ('a', 'm'),
            ('a', 's'),
            ('b', 'l'),
            ('b', 'm'),
            ('b', 's'),
            ('c', 'l'),
            ('c', 'm'),
            ('c', 's')],
           names=['a', 'tsize'])

Pandas提供了一个CategoricalIndex,如果我们深入到MultiIndex的级别,可以看到它们在起作用。

counts.index.get_level_values('a')
CategoricalIndex(['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], categories=['a', 'b', 'c'], ordered=True, dtype='category', name='a')
counts.index.get_level_values('tsize')
CategoricalIndex(['s', 'm', 'l', 's', 'm', 'l', 's', 'm', 'l'], categories=['s', 'm', 'l'], ordered=True, dtype='category', name='tsize')

Pivot table

为了完整起见,pivot_table 也尊重 Categoricals 并产生与 groupby 相同的结果。

pd.pivot_table(df, index=["a", "tsize"], aggfunc='size')

Observed选项

如果您不想要零怎么办?groupbypivot_table 都接受一个默认为False的 'observed' 参数。如果将observed设置为True,我们只会得到分类分组器的观察值,复制了 value_counts 的行为。

不错!

英文:

Yep, DataFrame.groupby respects Categorical!

Groupby

counts = df.groupby(["a", "tsize"]).size()
counts

DataFrameGroupBy.size returns the number of rows in each group as a Series whose index is the cartesian product of the two categorical types.

a  tsize
a  s        2
   m        2
   l        4
b  s        3
   m        2
   l        0
c  s        0
   m        0
   l        0
dtype: int64

Awesome! All the counts and the zeros. Also, note that the tshirt sizes are in proper small, medium, large order rather than alphabetical. This is thanks to another smart bit of Pandas kit.

CategoricalIndex

The way the index prints itself makes it look like an array of tuples of strings, but that's not what's under the hood.

counts.index
MultiIndex([('a', 'l'),
            ('a', 'm'),
            ('a', 's'),
            ('b', 'l'),
            ('b', 'm'),
            ('b', 's'),
            ('c', 'l'),
            ('c', 'm'),
            ('c', 's')],
           names=['a', 'tsize'])

Pandas provides a CategoricalIndex and we can see them at work if we drill into the levels of the MultiIndex.

counts.index.get_level_values('a')
CategoricalIndex(['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], categories=['a', 'b', 'c'], ordered=True, dtype='category', name='a')
counts.index.get_level_values('tsize')
CategoricalIndex(['s', 'm', 'l', 's', 'm', 'l', 's', 'm', 'l'], categories=['s', 'm', 'l'], ordered=True, dtype='category', name='tsize')

Pivot table

For completeness, pivot_table also respects Categoricals and gives the same results as groupby.

pd.pivot_table(df, index=["a", "tsize"], aggfunc='size')

Observed option

What if you don't want the zeros? Both groupby and pivot_table accept an 'observed' parameter that defaults to False. If observed is set to True, we only get observed values for categorical groupers, replicating the behavior of value_counts.

Nice!

huangapple
  • 本文由 发表于 2023年5月11日 12:16:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76224122.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定