“pandas.diff”的”axial inconsistency”

huangapple go评论108阅读模式
英文:

Axial inconsistency of pandas.diff

问题

The following code works:

这段代码可以正常工作:

  1. df['col'].diff()

The result is:

结果如下:

  1. 0 NaN
  2. 1 True
  3. Name: col, dtype: object

However, the code:

然而,下面的代码:

  1. df.T.diff(axis=1)

gives the error:

会产生错误:

  1. numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

Is that a bug?

这是一个错误吗?

英文:

Consider the dataframe:

  1. df = pd.DataFrame({'col': [True, False]})

The following code works:

  1. df['col'].diff()

The result is:

  1. 0 NaN
  2. 1 True
  3. Name: col, dtype: object

However, the code:

  1. df.T.diff(axis=1)

gives the error:

  1. numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

Is that a bug?

答案1

得分: 1

这似乎是有意为之,根据 GH15856。在NumPy中,布尔数组之间的算术运算 (+, -, *, /, 等) 不再被支持。

axis=1 上使用 diff 时,pandas 试图计算沿着列轴的连续元素之间的差异(因为在这里由于转置而包含布尔值),由于底层运行了NumPy来计算,因此会引发 TypeError

  1. print(df.T)
  2. 0 1
  3. col True False
  4. np.array(False) - np.array(True)
  5. TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

这可能令人困惑,因为使用Python布尔值进行相同操作成功:

  1. False - True
  2. # 返回 -1

但是 @seberg 解释了原因:

> 这是一个相当古老的弃用,虽然我似乎记得有一些关于只弃用 -False 而不是 True - True 的讨论。请注意,Python布尔值与NumPy布尔值不同,它们在实际上是整数。NumPy布尔值的行为更不像整数,如果将两个布尔值相加,你会再次得到一个布尔值,等等。

英文:

It seems like this behaviour is intentional as per GH15856. Arithmetic operations (+, -, *, /, etc.) between boolean arrays in NumPy are not (or not anymore?) supported.

With diff on axis=1, pandas tries to compute the difference between consecutive elements along the columns axis (which happens to hold booleans here because of the transposition) and since NumPy is run under the hood to compute that, a TypeError is raised.

  1. print(df.T)
  2. 0 1
  3. col True False
  4. np.array(False) - np.array(True)
  5. TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

This can be counterintuitive, because the same operation when using Python boolean, succeeds :

  1. False - True
  2. # return -1

But @seberg explains why :

> This is a pretty old deprecation, although I do seem to remember some
> discussion about only deprecating the unary operator -False not True -
> True. Note that Python booleans are different from NumPy ones, they
> are practically integers. NumPy booleans behave much less like
> integers, if you add two booleans you get a boolean again, etc.

答案2

得分: 1

以下是您要翻译的内容:

"你正在看到的行为似乎与文档中明确说明的不符:

对于布尔数据类型,这里使用的是 operator.xor() 而不是 operator.sub()。结果根据 DataFrame 中的当前数据类型计算,但结果的数据类型始终为 float64。

还有一个有趣的测试:

  1. df = pd.DataFrame({'col': [True, False], 'col2': [True, False]})
  2. print("", "df:", sep='\n')
  3. print(df, df.dtypes, sep='\n')
  4. print("", "df 的差异:", sep='\n')
  5. res = df.diff()
  6. print(res, res.dtypes, sep='\n')
  7. print("", "df['col'] 的差异:", sep='\n')
  8. res = df['col'].diff()
  9. print(res, res.dtypes, sep='\n')
  10. print("", "df.T:", sep='\n')
  11. res = df.T
  12. print(res, res.dtypes, sep='\n')
  13. print("", "df.T 的差异(axis=0):", sep='\n')
  14. res = df.T.diff(axis=0)
  15. print(res, res.dtypes, sep='\n')
  16. print("", "df.T 转换为 object 数据类型:", sep='\n')
  17. res = df.T.astype(object)
  18. print(res, res.dtypes, sep='\n')
  19. print("", "df.T 转换为 object 数据类型后的差异(axis=1):", sep='\n')
  20. res = df.T.astype(object).diff(axis=1)
  21. print(res, res.dtypes, sep='\n')
  22. try:
  23. print("", "df.T 的差异(axis=1):", sep='\n')
  24. res = df.T.diff(axis=1)
  25. print(res, res.dtypes, sep='\n')
  26. except TypeError:
  27. print('得到 TypeError')

输出:

  1. df:
  2. col col2
  3. 0 True True
  4. 1 False False
  5. col bool
  6. col2 bool
  7. dtype: object
  8. df 的差异:
  9. col col2
  10. 0 NaN NaN
  11. 1 True True
  12. col object
  13. col2 object
  14. dtype: object
  15. df['col'] 的差异:
  16. 0 NaN
  17. 1 True
  18. Name: col, dtype: object
  19. object
  20. df.T:
  21. 0 1
  22. col True False
  23. col2 True False
  24. 0 bool
  25. 1 bool
  26. dtype: object
  27. df.T 的差异axis=0:
  28. 0 1
  29. col NaN NaN
  30. col2 False False
  31. 0 object
  32. 1 object
  33. dtype: object
  34. df.T 转换为 object 数据类型:
  35. 0 1
  36. col True False
  37. col2 True False
  38. 0 object
  39. 1 object
  40. dtype: object
  41. df.T 转换为 object 数据类型后的差异axis=1:
  42. 0 1
  43. col NaN -1
  44. col2 NaN -1
  45. 0 object
  46. 1 object
  47. dtype: object
  48. df.T 的差异axis=1:
  49. 得到 TypeError

如果我们在调用 diff(axis=1) 之前将列的数据类型更改为 object 类型,将不会引发错误,并且结果似乎会将布尔值转换为整数,然后执行整数减法。然而,正如 OP 指出的,没有 使用 astype(object) 进行的相同操作会引发 TypeError:“TypeError: numpy boolean subtract, the - operator, is not supported, use the bitwise_xor, the ^ operator, or the logical_xor function instead.”,尽管 diff() 文档中声称“对于布尔数据类型,这里使用的是 operator.xor() 而不是 operator.sub()”。"

英文:

The behavior you're seeing would appear to be at odds with the docs which clearly state:

> For boolean dtypes, this uses operator.xor() rather than operator.sub(). The result is calculated according to current dtype in DataFrame, however dtype of the result is always float64.

Also interesting is the following test:

  1. df = pd.DataFrame({'col': [True, False], 'col2': [True, False]})
  2. print("","df:",sep='\n')
  3. print(df,df.dtypes,sep='\n')
  4. print("","diff of df:",sep='\n')
  5. res = df.diff()
  6. print(res,res.dtypes,sep='\n')
  7. print("","diff of df['col']:",sep='\n')
  8. res = df['col'].diff()
  9. print(res,res.dtypes,sep='\n')
  10. print("","df.T:",sep='\n')
  11. res = df.T
  12. print(res,res.dtypes,sep='\n')
  13. print("","diff(axis=0) of df.T:",sep='\n')
  14. res = df.T.diff(axis=0)
  15. print(res,res.dtypes,sep='\n')
  16. print("","df.T.astype(object):",sep='\n')
  17. res = df.T.astype(object)
  18. print(res,res.dtypes,sep='\n')
  19. print("","diff(axis=1) of df.T.astype(object):",sep='\n')
  20. res = df.T.astype(object).diff(axis=1)
  21. print(res,res.dtypes,sep='\n')
  22. try:
  23. print("","diff(axis=1) of df.T:",sep='\n')
  24. res = df.T.diff(axis=1)
  25. print(res,res.dtypes,sep='\n')
  26. except TypeError:
  27. print('got TypeError')

Output:

  1. df:
  2. col col2
  3. 0 True True
  4. 1 False False
  5. col bool
  6. col2 bool
  7. dtype: object
  8. diff of df:
  9. col col2
  10. 0 NaN NaN
  11. 1 True True
  12. col object
  13. col2 object
  14. dtype: object
  15. diff of df['col']:
  16. 0 NaN
  17. 1 True
  18. Name: col, dtype: object
  19. object
  20. df.T:
  21. 0 1
  22. col True False
  23. col2 True False
  24. 0 bool
  25. 1 bool
  26. dtype: object
  27. diff(axis=0) of df.T:
  28. 0 1
  29. col NaN NaN
  30. col2 False False
  31. 0 object
  32. 1 object
  33. dtype: object
  34. df.T.astype(object):
  35. 0 1
  36. col True False
  37. col2 True False
  38. 0 object
  39. 1 object
  40. dtype: object
  41. diff(axis=1) of df.T.astype(object):
  42. 0 1
  43. col NaN -1
  44. col2 NaN -1
  45. 0 object
  46. 1 object
  47. dtype: object
  48. diff(axis=1) of df.T:
  49. got TypeError

If we change the column types to object using astype() before the call to diff(axis=1), no error is raised and the result appears to cast the boolean values to int prior to performing the diff using integer subtraction.

However, as OP points out, this same operation without astype(object) raises the TypeError TypeError: numpy boolean subtract, the -operator, is not supported, use the bitwise_xor, the^ operator, or the logical_xor function instead., despite the claim in the diff() docs that For boolean dtypes, this uses operator.xor() rather than operator.sub().

huangapple
  • 本文由 发表于 2023年4月11日 02:07:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/75979562.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定