未报告的整数溢出在pandas数学运算中,使用np.int16,np.int32时的pandas.eval()。

huangapple go评论92阅读模式
英文:

Unreported integer-overflow in pandas math arithmetic's, pandas.eval() when using np.int16, np.int32

问题

I need to process quite huge dataframe ~80M records, and essentially memory consumption is an issue. So columns with numerical data are shrank to possible minimum dtype, like np.int8, np.int16, np.int32. at some stage I need to compute new column, using some math from existing columns, and it needs capacity of int64. that where most pandas arithmetic constructions fail. I took me a while to track that the reason was integer overflow: in simple words, calculations like

newCol = col16col16,
newCol = col32
value16,

produce often incorrect result, despite newCol is created as int64. here are some simple explicit example: calculate newCol = A * 100000, which obviously for any A=aaaaa should compute to value like aaaaa00000.

however, see below:

  1. # trying to use const c=int64() to force expr evaluator to use int64 c = np.int64(10000) df.eval('R = @c * A', inplace=True)
  2. # trying to create new int64 col 1st and use it in calc: df['T']=0 # this creates new col 'T' dtype=int64 filled with 0 df.eval('T = 100000 * A', inplace=True)
  3. df['S']=0 # trying to force int64 via 1st element 'S', which is int64 df['S'] = df['S'] + df['A'] * 100000
  4. # here finally this approach works, calculation is using int64 instructions: df['X']=1 df.eval('X = X * 100000 * A', inplace=True)
  5. # just preformatting pd.set_option('display.max_columns', None) pd.options.display.width=222 df.index=[''] * len(df)
  6. print(df) df.info()
  7. A Q W R T S X 3396 339600000 339600000 339600000 339600000 339600000 339600000 26508 -1644167296 -1644167296 -1644167296 -1644167296 -1644167296 2650800000 27942 -1500767296 -1500767296 -1500767296 -1500767296 -1500767296 2794200000 3441 344100000 344100000 344100000 344100000 344100000 344100000 27880 -1506967296 -1506967296 -1506967296 -1506967296 -1506967296 2788000000 <class 'pandas.core.frame.DataFrame'> Index: 5 entries, to Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 A 5 non-null int16 1 Q 5 non-null int32 2 W 5 non-null int32 3 R 5 non-null int32 4 T 5 non-null int64 5 S 5 non-null int64 6 X 5 non-null int64 dtypes: int16(1), int32(3), int64(3) memory usage: 230.0+ bytes ```
  8. with 6 different ways to do trivial math, only col 'X' produced what I (and I'd guess most users) expected.
  9. obvious way to solve would be, at first to convert source column's dtype to int64 (like doing it 'on the fly') like
  10. ``` df['x'] = df['A'].astype(np.int64) * 100000 ```
  11. but I don't see its a good solution in my case, as data already is to big to create an even bigger tmp copy, and performance is also will degrade with redundant conversions..
  12. So my question: how to calculate it on the fly, not creating a copy of whole source data columns in int64 (don't have RAM for that), e.g. directly properly calc `newCol64 = srcCol8 * srcCol16 * srcCol16` in 64 bits cpu.
  13. is there explicit type translation syntaxis available for `pandas.eval()`? which could be done on the fly while computing result row by row?
  14. <details>
  15. <summary>英文:</summary>
  16. I need to process quite huge dataframe ~80M records, and essentially memory consumption is an issue. So columns with numerical data are shrank to possible minimum dtype, like np.int8, np.int16, np.int32.
  17. at some stage I need to compute new column, using some math from existing columns, and it needs capacity of int64. that where most pandas arithmetic constructions fail. I took me a while to track that the reason was integer overflow:
  18. in simple words, calculations like
  19. newCol = col16*col16,
  20. newCol = col32*value16,
  21. produce often incorrect result, despite newCol is created as int64.
  22. here are some simple explicit example: calculate `newCol = A * 100000`, which obviously for any `A=aaaaa` should compute to value like `aaaaa00000`.
  23. however, see below:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1<<7,1<<15, size=(int(5))), columns=list('A'), dtype=np.int16)
df.eval('Q = A * 100000', inplace=True) # 1st naive approach from a head
df['W'] = df['A'] * 100000

trying to use const c=int64() to force expr evaluator to use int64

c = np.int64(10000)
df.eval('R = @c * A', inplace=True)

trying to create new int64 col 1st and use it in calc:

df['T']=0 # this creates new col 'T' dtype=int64 filled with 0
df.eval('T = 100000 * A', inplace=True)

df['S']=0

trying to force int64 via 1st element 'S', which is int64

df['S'] = df['S'] + df['A'] * 100000

here finally this approach works, calculation is using int64 instructions:

df['X']=1
df.eval('X = X * 100000 * A', inplace=True)

just preformatting

pd.set_option('display.max_columns', None)
pd.options.display.width=222
df.index=[''] * len(df)

print(df)
df.info()

A Q W R T S X
3396 339600000 339600000 339600000 339600000 339600000 339600000
26508 -1644167296 -1644167296 -1644167296 -1644167296 -1644167296 2650800000
27942 -1500767296 -1500767296 -1500767296 -1500767296 -1500767296 2794200000
3441 344100000 344100000 344100000 344100000 344100000 344100000
27880 -1506967296 -1506967296 -1506967296 -1506967296 -1506967296 2788000000
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, to
Data columns (total 7 columns):

Column Non-Null Count Dtype


0 A 5 non-null int16
1 Q 5 non-null int32
2 W 5 non-null int32
3 R 5 non-null int32
4 T 5 non-null int64
5 S 5 non-null int64
6 X 5 non-null int64
dtypes: int16(1), int32(3), int64(3)
memory usage: 230.0+ bytes

  1. with 6 different ways to do trivial math, only col &#39;X&#39; produced what I (and I&#39;d guess most users) expected.
  2. obvious way to solve would be, at first to convert source column&#39;s dtype to int64 (like doing it &#39;on the fly&#39;) like

df['x'] = df['A'].astype(np.int64) * 100000

  1. but I don&#39;t see its a good solution in my case, as data already is to big to create an even bigger tmp copy, and performance is also will degrade with redundant conversions..
  2. So my question: how to calculate it on the fly, not creating a copy of whole source data columns in int64 (don&#39;t have RAM for that), e.g. directly properly calc `newCol64 = srcCol8 * srcCol16 * srcCol16` in 64 bits cpu.
  3. is there explicit type translation syntaxis available for `pandas.eval()`? which could be done on the fly while computing result row by row?
  4. </details>
  5. # 答案1
  6. **得分**: 1
  7. 如果你将一个数组乘以一个标量值,所选的数据类型将与数组的数据类型相同。然而,如果你在两个数组之间进行数学运算,比如 `X`X * A),数组将被“向上转型”:
  8. ```python
  9. >>> df['A'] * c
  10. 0 -31392
  11. 1 14992
  12. 2 -14064
  13. 3 6944
  14. 4 -13904
  15. Name: A, dtype: int16 # 不良结果,不良数据类型
  16. >>> df['A'] * [c]
  17. 0 196380000
  18. 1 288570000
  19. 2 317770000
  20. 3 168500000
  21. 4 76270000
  22. Name: A, dtype: int64 # 一切正常

你可以阅读有关 输出类型确定 的文档。

在你的情况下,你可以这样做:

  1. c = np.array([10000])
  2. df.eval('R = @c * A', inplace=True)

输出:

  1. >>> df
  2. A R
  3. 0 12399 123990000
  4. 1 6026 60260000
  5. 2 17133 171330000
  6. 3 30974 309740000
  7. 4 28216 282160000
  8. >>> df.info()
  9. <class 'pandas.core.frame.DataFrame'>
  10. RangeIndex: 5 entries, 0 to 4
  11. Data columns (total 2 columns):
  12. # Column Non-Null Count Dtype
  13. --- ------ -------------- -----
  14. 0 A 5 non-null int16
  15. 1 R 5 non-null int64
  16. dtypes: int16(1), int64(1)
  17. memory usage: 178.0 bytes

编辑

由于某种原因,在我的机器上,df['A'] * [100000] 计算为 int32,尽管 Python 是 64 位的...(所以结果是错误的)。使用 c=np.array([100000],dtype=np.int64) 而不是 [100000] 则可以正常工作。

即使你的系统是 64 位的,似乎默认的整数类型在你的情况下是 32 位的(我不知道为什么)。要检查默认的数据类型,你可以使用:

  1. # 我的系统
  2. >>> np.int_
  3. numpy.int64
  4. # 你的系统
  5. >>> np.int_
  6. numpy.int32

在你的情况下,如果你执行 df['A'] * [10000][10000] 将被转换为 np.array([10000], dtype=np.int32),这就是为什么结果会出错的原因。你必须明确地使用 np.array([10000], dtype=np.int64) 来创建数组。

英文:

If you multiply an array by a scalar value, the selected dtype will be that of the array. However if you do a math operation between two arrays like X (X * A), the array will be "upcast":

  1. &gt;&gt;&gt; df[&#39;A&#39;] * c
  2. 0 -31392
  3. 1 14992
  4. 2 -14064
  5. 3 6944
  6. 4 -13904
  7. Name: A, dtype: int16 # bad result, bad dtype
  8. &gt;&gt;&gt; df[&#39;A&#39;] * [c]
  9. 0 196380000
  10. 1 288570000
  11. 2 317770000
  12. 3 168500000
  13. 4 76270000
  14. Name: A, dtype: int64 # everything is ok

You can read the documentation about output type determination.

In your case, you can do:

  1. c = np.array([10000])
  2. df.eval(&#39;R = @c * A&#39;, inplace=True)

Output:

  1. &gt;&gt;&gt; df
  2. A R
  3. 0 12399 123990000
  4. 1 6026 60260000
  5. 2 17133 171330000
  6. 3 30974 309740000
  7. 4 28216 282160000
  8. &gt;&gt;&gt; df.info()
  9. &lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
  10. RangeIndex: 5 entries, 0 to 4
  11. Data columns (total 2 columns):
  12. # Column Non-Null Count Dtype
  13. --- ------ -------------- -----
  14. 0 A 5 non-null int16
  15. 1 R 5 non-null int64
  16. dtypes: int16(1), int64(1)
  17. memory usage: 178.0 bytes

EDIT

> For some reason on my machine df['A'] * [100000] is calculated as int32, despite python is 64bit.. (and so result is wrong). Using c=np.array([100000],dtype=np.int64) instead of [100000] is works though.

Even if your system is 64bit, it seems the default int type is 32bit in your case (and I don't know why). To check the default dtype, you can use:

  1. # My system
  2. &gt;&gt;&gt; np.int_
  3. numpy.int64
  4. # Your system
  5. &gt;&gt;&gt; np.int_
  6. numpy.int32

In your case, if you do df[&#39;A&#39;] * [10000], [10000] will be converted as np.array([10000], dtype=np.int32) that's why the result will be wrong. You have to be explicit by creating the array with np.array([10000], dtype=np.int64).

huangapple
  • 本文由 发表于 2023年6月15日 16:59:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76480831.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定