“pandas.diff”的”axial inconsistency”

huangapple go评论70阅读模式
英文:

Axial inconsistency of pandas.diff

问题

The following code works:

这段代码可以正常工作:

df['col'].diff()

The result is:

结果如下:

0     NaN
1    True
Name: col, dtype: object

However, the code:

然而,下面的代码:

df.T.diff(axis=1)

gives the error:

会产生错误:

numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

Is that a bug?

这是一个错误吗?

英文:

Consider the dataframe:

df = pd.DataFrame({'col': [True, False]})

The following code works:

df['col'].diff()

The result is:

0     NaN
1    True
Name: col, dtype: object

However, the code:

df.T.diff(axis=1)

gives the error:

numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

Is that a bug?

答案1

得分: 1

这似乎是有意为之,根据 GH15856。在NumPy中,布尔数组之间的算术运算 (+, -, *, /, 等) 不再被支持。

axis=1 上使用 diff 时,pandas 试图计算沿着列轴的连续元素之间的差异(因为在这里由于转置而包含布尔值),由于底层运行了NumPy来计算,因此会引发 TypeError

print(df.T)

        0      1
col  True  False

np.array(False) - np.array(True)

TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

这可能令人困惑,因为使用Python布尔值进行相同操作成功:

False - True
# 返回 -1

但是 @seberg 解释了原因:

> 这是一个相当古老的弃用,虽然我似乎记得有一些关于只弃用 -False 而不是 True - True 的讨论。请注意,Python布尔值与NumPy布尔值不同,它们在实际上是整数。NumPy布尔值的行为更不像整数,如果将两个布尔值相加,你会再次得到一个布尔值,等等。

英文:

It seems like this behaviour is intentional as per GH15856. Arithmetic operations (+, -, *, /, etc.) between boolean arrays in NumPy are not (or not anymore?) supported.

With diff on axis=1, pandas tries to compute the difference between consecutive elements along the columns axis (which happens to hold booleans here because of the transposition) and since NumPy is run under the hood to compute that, a TypeError is raised.

print(df.T)

        0      1
col  True  False

np.array(False) - np.array(True)

TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

This can be counterintuitive, because the same operation when using Python boolean, succeeds :

False - True
# return -1

But @seberg explains why :

> This is a pretty old deprecation, although I do seem to remember some
> discussion about only deprecating the unary operator -False not True -
> True. Note that Python booleans are different from NumPy ones, they
> are practically integers. NumPy booleans behave much less like
> integers, if you add two booleans you get a boolean again, etc.

答案2

得分: 1

以下是您要翻译的内容:

"你正在看到的行为似乎与文档中明确说明的不符:

对于布尔数据类型,这里使用的是 operator.xor() 而不是 operator.sub()。结果根据 DataFrame 中的当前数据类型计算,但结果的数据类型始终为 float64。

还有一个有趣的测试:

df = pd.DataFrame({'col': [True, False], 'col2': [True, False]})

print("", "df:", sep='\n')
print(df, df.dtypes, sep='\n')

print("", "df 的差异:", sep='\n')
res = df.diff()
print(res, res.dtypes, sep='\n')

print("", "df['col'] 的差异:", sep='\n')
res = df['col'].diff()
print(res, res.dtypes, sep='\n')

print("", "df.T:", sep='\n')
res = df.T
print(res, res.dtypes, sep='\n')

print("", "df.T 的差异(axis=0):", sep='\n')
res = df.T.diff(axis=0)
print(res, res.dtypes, sep='\n')

print("", "df.T 转换为 object 数据类型:", sep='\n')
res = df.T.astype(object)
print(res, res.dtypes, sep='\n')

print("", "df.T 转换为 object 数据类型后的差异(axis=1):", sep='\n')
res = df.T.astype(object).diff(axis=1)
print(res, res.dtypes, sep='\n')

try:
    print("", "df.T 的差异(axis=1):", sep='\n')
    res = df.T.diff(axis=1)
    print(res, res.dtypes, sep='\n')
except TypeError:
    print('得到 TypeError')

输出:

df:
     col   col2
0   True   True
1  False  False
col     bool
col2    bool
dtype: object

df 的差异:
    col  col2
0   NaN   NaN
1  True  True
col     object
col2    object
dtype: object

df['col'] 的差异:
0     NaN
1    True
Name: col, dtype: object
object

df.T:
         0      1
col   True  False
col2  True  False
0    bool
1    bool
dtype: object

df.T 的差异axis=0:
          0      1
col     NaN    NaN
col2  False  False
0    object
1    object
dtype: object

df.T 转换为 object 数据类型:
         0      1
col   True  False
col2  True  False
0    object
1    object
dtype: object

df.T 转换为 object 数据类型后的差异axis=1:
        0   1
col   NaN  -1
col2  NaN  -1
0    object
1    object
dtype: object

df.T 的差异axis=1:
得到 TypeError

如果我们在调用 diff(axis=1) 之前将列的数据类型更改为 object 类型,将不会引发错误,并且结果似乎会将布尔值转换为整数,然后执行整数减法。然而,正如 OP 指出的,没有 使用 astype(object) 进行的相同操作会引发 TypeError:“TypeError: numpy boolean subtract, the - operator, is not supported, use the bitwise_xor, the ^ operator, or the logical_xor function instead.”,尽管 diff() 文档中声称“对于布尔数据类型,这里使用的是 operator.xor() 而不是 operator.sub()”。"

英文:

The behavior you're seeing would appear to be at odds with the docs which clearly state:

> For boolean dtypes, this uses operator.xor() rather than operator.sub(). The result is calculated according to current dtype in DataFrame, however dtype of the result is always float64.

Also interesting is the following test:

df = pd.DataFrame({'col': [True, False], 'col2': [True, False]})

print("","df:",sep='\n')
print(df,df.dtypes,sep='\n')

print("","diff of df:",sep='\n')
res = df.diff()
print(res,res.dtypes,sep='\n')

print("","diff of df['col']:",sep='\n')
res = df['col'].diff()
print(res,res.dtypes,sep='\n')

print("","df.T:",sep='\n')
res = df.T
print(res,res.dtypes,sep='\n')

print("","diff(axis=0) of df.T:",sep='\n')
res = df.T.diff(axis=0)
print(res,res.dtypes,sep='\n')

print("","df.T.astype(object):",sep='\n')
res = df.T.astype(object)
print(res,res.dtypes,sep='\n')

print("","diff(axis=1) of df.T.astype(object):",sep='\n')
res = df.T.astype(object).diff(axis=1)
print(res,res.dtypes,sep='\n')

try:
    print("","diff(axis=1) of df.T:",sep='\n')
    res = df.T.diff(axis=1)
    print(res,res.dtypes,sep='\n')
except TypeError:
    print('got TypeError')

Output:

df:
     col   col2
0   True   True
1  False  False
col     bool
col2    bool
dtype: object

diff of df:
    col  col2
0   NaN   NaN
1  True  True
col     object
col2    object
dtype: object

diff of df['col']:
0     NaN
1    True
Name: col, dtype: object
object

df.T:
         0      1
col   True  False
col2  True  False
0    bool
1    bool
dtype: object

diff(axis=0) of df.T:
          0      1
col     NaN    NaN
col2  False  False
0    object
1    object
dtype: object

df.T.astype(object):
         0      1
col   True  False
col2  True  False
0    object
1    object
dtype: object

diff(axis=1) of df.T.astype(object):
        0   1
col   NaN  -1
col2  NaN  -1
0    object
1    object
dtype: object

diff(axis=1) of df.T:
got TypeError

If we change the column types to object using astype() before the call to diff(axis=1), no error is raised and the result appears to cast the boolean values to int prior to performing the diff using integer subtraction.

However, as OP points out, this same operation without astype(object) raises the TypeError TypeError: numpy boolean subtract, the -operator, is not supported, use the bitwise_xor, the^ operator, or the logical_xor function instead., despite the claim in the diff() docs that For boolean dtypes, this uses operator.xor() rather than operator.sub().

huangapple
  • 本文由 发表于 2023年4月11日 02:07:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/75979562.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定