英文:
Axial inconsistency of pandas.diff
问题
The following code works:
这段代码可以正常工作:
df['col'].diff()
The result is:
结果如下:
0 NaN
1 True
Name: col, dtype: object
However, the code:
然而,下面的代码:
df.T.diff(axis=1)
gives the error:
会产生错误:
numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.
Is that a bug?
这是一个错误吗?
英文:
Consider the dataframe:
df = pd.DataFrame({'col': [True, False]})
The following code works:
df['col'].diff()
The result is:
0 NaN
1 True
Name: col, dtype: object
However, the code:
df.T.diff(axis=1)
gives the error:
numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.
Is that a bug?
答案1
得分: 1
这似乎是有意为之,根据 GH15856。在NumPy中,布尔数组之间的算术运算 (+, -, *, /, 等)
不再被支持。
在 axis=1
上使用 diff
时,pandas 试图计算沿着列轴的连续元素之间的差异(因为在这里由于转置而包含布尔值),由于底层运行了NumPy来计算,因此会引发 TypeError
。
print(df.T)
0 1
col True False
np.array(False) - np.array(True)
TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.
这可能令人困惑,因为使用Python布尔值进行相同操作成功:
False - True
# 返回 -1
但是 @seberg 解释了原因:
> 这是一个相当古老的弃用,虽然我似乎记得有一些关于只弃用 -False
而不是 True - True
的讨论。请注意,Python布尔值与NumPy布尔值不同,它们在实际上是整数。NumPy布尔值的行为更不像整数,如果将两个布尔值相加,你会再次得到一个布尔值,等等。
英文:
It seems like this behaviour is intentional as per GH15856. Arithmetic operations (+, -, *, /, etc.)
between boolean arrays in NumPy are not (or not anymore?) supported.
With diff
on axis=1
, pandas tries to compute the difference between consecutive elements along the columns axis (which happens to hold booleans here because of the transposition) and since NumPy is run under the hood to compute that, a TypeError
is raised.
print(df.T)
0 1
col True False
np.array(False) - np.array(True)
TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.
This can be counterintuitive, because the same operation when using Python boolean, succeeds :
False - True
# return -1
But @seberg explains why :
> This is a pretty old deprecation, although I do seem to remember some
> discussion about only deprecating the unary operator -False not True -
> True. Note that Python booleans are different from NumPy ones, they
> are practically integers. NumPy booleans behave much less like
> integers, if you add two booleans you get a boolean again, etc.
答案2
得分: 1
以下是您要翻译的内容:
"你正在看到的行为似乎与文档中明确说明的不符:
对于布尔数据类型,这里使用的是
operator.xor()
而不是operator.sub()
。结果根据 DataFrame 中的当前数据类型计算,但结果的数据类型始终为 float64。
还有一个有趣的测试:
df = pd.DataFrame({'col': [True, False], 'col2': [True, False]})
print("", "df:", sep='\n')
print(df, df.dtypes, sep='\n')
print("", "df 的差异:", sep='\n')
res = df.diff()
print(res, res.dtypes, sep='\n')
print("", "df['col'] 的差异:", sep='\n')
res = df['col'].diff()
print(res, res.dtypes, sep='\n')
print("", "df.T:", sep='\n')
res = df.T
print(res, res.dtypes, sep='\n')
print("", "df.T 的差异(axis=0):", sep='\n')
res = df.T.diff(axis=0)
print(res, res.dtypes, sep='\n')
print("", "df.T 转换为 object 数据类型:", sep='\n')
res = df.T.astype(object)
print(res, res.dtypes, sep='\n')
print("", "df.T 转换为 object 数据类型后的差异(axis=1):", sep='\n')
res = df.T.astype(object).diff(axis=1)
print(res, res.dtypes, sep='\n')
try:
print("", "df.T 的差异(axis=1):", sep='\n')
res = df.T.diff(axis=1)
print(res, res.dtypes, sep='\n')
except TypeError:
print('得到 TypeError')
输出:
df:
col col2
0 True True
1 False False
col bool
col2 bool
dtype: object
df 的差异:
col col2
0 NaN NaN
1 True True
col object
col2 object
dtype: object
df['col'] 的差异:
0 NaN
1 True
Name: col, dtype: object
object
df.T:
0 1
col True False
col2 True False
0 bool
1 bool
dtype: object
df.T 的差异(axis=0):
0 1
col NaN NaN
col2 False False
0 object
1 object
dtype: object
df.T 转换为 object 数据类型:
0 1
col True False
col2 True False
0 object
1 object
dtype: object
df.T 转换为 object 数据类型后的差异(axis=1):
0 1
col NaN -1
col2 NaN -1
0 object
1 object
dtype: object
df.T 的差异(axis=1):
得到 TypeError
如果我们在调用 diff(axis=1)
之前将列的数据类型更改为 object 类型,将不会引发错误,并且结果似乎会将布尔值转换为整数,然后执行整数减法。然而,正如 OP 指出的,没有 使用 astype(object)
进行的相同操作会引发 TypeError:“TypeError: numpy boolean subtract, the -
operator, is not supported, use the bitwise_xor, the ^
operator, or the logical_xor function instead.”,尽管 diff()
文档中声称“对于布尔数据类型,这里使用的是 operator.xor()
而不是 operator.sub()
”。"
英文:
The behavior you're seeing would appear to be at odds with the docs which clearly state:
> For boolean dtypes, this uses operator.xor() rather than operator.sub(). The result is calculated according to current dtype in DataFrame, however dtype of the result is always float64.
Also interesting is the following test:
df = pd.DataFrame({'col': [True, False], 'col2': [True, False]})
print("","df:",sep='\n')
print(df,df.dtypes,sep='\n')
print("","diff of df:",sep='\n')
res = df.diff()
print(res,res.dtypes,sep='\n')
print("","diff of df['col']:",sep='\n')
res = df['col'].diff()
print(res,res.dtypes,sep='\n')
print("","df.T:",sep='\n')
res = df.T
print(res,res.dtypes,sep='\n')
print("","diff(axis=0) of df.T:",sep='\n')
res = df.T.diff(axis=0)
print(res,res.dtypes,sep='\n')
print("","df.T.astype(object):",sep='\n')
res = df.T.astype(object)
print(res,res.dtypes,sep='\n')
print("","diff(axis=1) of df.T.astype(object):",sep='\n')
res = df.T.astype(object).diff(axis=1)
print(res,res.dtypes,sep='\n')
try:
print("","diff(axis=1) of df.T:",sep='\n')
res = df.T.diff(axis=1)
print(res,res.dtypes,sep='\n')
except TypeError:
print('got TypeError')
Output:
df:
col col2
0 True True
1 False False
col bool
col2 bool
dtype: object
diff of df:
col col2
0 NaN NaN
1 True True
col object
col2 object
dtype: object
diff of df['col']:
0 NaN
1 True
Name: col, dtype: object
object
df.T:
0 1
col True False
col2 True False
0 bool
1 bool
dtype: object
diff(axis=0) of df.T:
0 1
col NaN NaN
col2 False False
0 object
1 object
dtype: object
df.T.astype(object):
0 1
col True False
col2 True False
0 object
1 object
dtype: object
diff(axis=1) of df.T.astype(object):
0 1
col NaN -1
col2 NaN -1
0 object
1 object
dtype: object
diff(axis=1) of df.T:
got TypeError
If we change the column types to object using astype()
before the call to diff(axis=1)
, no error is raised and the result appears to cast the boolean values to int prior to performing the diff using integer subtraction.
However, as OP points out, this same operation without astype(object)
raises the TypeError TypeError: numpy boolean subtract, the
-operator, is not supported, use the bitwise_xor, the
^ operator, or the logical_xor function instead.
, despite the claim in the diff()
docs that For boolean dtypes, this uses operator.xor() rather than operator.sub()
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论