2023年6月15日 16:59:20go评论92阅读模式

英文:

Unreported integer-overflow in pandas math arithmetic's, pandas.eval() when using np.int16, np.int32

问题

I need to process quite huge dataframe ~80M records, and essentially memory consumption is an issue. So columns with numerical data are shrank to possible minimum dtype, like np.int8, np.int16, np.int32. at some stage I need to compute new column, using some math from existing columns, and it needs capacity of int64. that where most pandas arithmetic constructions fail. I took me a while to track that the reason was integer overflow: in simple words, calculations like

newCol = col16col16,
newCol = col32value16,

produce often incorrect result, despite newCol is created as int64. here are some simple explicit example: calculate newCol = A * 100000, which obviously for any A=aaaaa should compute to value like aaaaa00000.

however, see below:


# trying to use const c=int64() to force expr evaluator to use int64 c = np.int64(10000) df.eval(&#39;R = @c * A&#39;, inplace=True) 
# trying to create new int64 col 1st and use it in calc: df[&#39;T&#39;]=0 # this creates new col &#39;T&#39; dtype=int64 filled with 0 df.eval(&#39;T = 100000 * A&#39;, inplace=True)
df[&#39;S&#39;]=0 # trying to force int64 via 1st element &#39;S&#39;, which is int64 df[&#39;S&#39;] = df[&#39;S&#39;] + df[&#39;A&#39;] * 100000
# here finally this approach works, calculation is using int64 instructions: df[&#39;X&#39;]=1 df.eval(&#39;X = X * 100000 * A&#39;, inplace=True)
# just preformatting pd.set_option(&#39;display.max_columns&#39;, None) pd.options.display.width=222 df.index=[&#39;&#39;] * len(df)
print(df) df.info()
A Q W R T S X 3396 339600000 339600000 339600000 339600000 339600000 339600000 26508 -1644167296 -1644167296 -1644167296 -1644167296 -1644167296 2650800000 27942 -1500767296 -1500767296 -1500767296 -1500767296 -1500767296 2794200000 3441 344100000 344100000 344100000 344100000 344100000 344100000 27880 -1506967296 -1506967296 -1506967296 -1506967296 -1506967296 2788000000 &lt;class &#39;pandas.core.frame.DataFrame&#39;&gt; Index: 5 entries, to Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 A 5 non-null int16 1 Q 5 non-null int32 2 W 5 non-null int32 3 R 5 non-null int32 4 T 5 non-null int64 5 S 5 non-null int64 6 X 5 non-null int64 dtypes: int16(1), int32(3), int64(3) memory usage: 230.0+ bytes ```
with 6 different ways to do trivial math, only col &#39;X&#39; produced what I (and I&#39;d guess most users) expected.
obvious way to solve would be, at first to convert source column&#39;s dtype to int64 (like doing it &#39;on the fly&#39;) like
``` df[&#39;x&#39;] = df[&#39;A&#39;].astype(np.int64) * 100000 ```
but I don&#39;t see its a good solution in my case, as data already is to big to create an even bigger tmp copy, and performance is also will degrade with redundant conversions..
So my question: how to calculate it on the fly, not creating a copy of whole source data columns in int64 (don&#39;t have RAM for that), e.g. directly properly calc `newCol64 = srcCol8 * srcCol16 * srcCol16` in 64 bits cpu.
is there explicit type translation syntaxis available for `pandas.eval()`? which could be done on the fly while computing result row by row?
<details>
<summary>英文:</summary>
I need to process quite huge dataframe ~80M records, and essentially memory consumption is an issue. So columns with numerical data are shrank to possible minimum dtype, like np.int8, np.int16, np.int32.
at some stage I need to compute new column, using some math from existing columns, and it needs capacity of int64. that where most pandas arithmetic constructions fail. I took me a while to track that the reason was integer overflow:
in simple words, calculations like
       newCol = col16*col16, 
       newCol = col32*value16, 
produce often incorrect result, despite newCol is created as int64.
here are some simple explicit example: calculate `newCol = A * 100000`, which obviously for any `A=aaaaa` should compute to value like `aaaaa00000`. 
however, see below:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1<<7,1<<15, size=(int(5))), columns=list('A'), dtype=np.int16)
df.eval('Q = A * 100000', inplace=True) # 1st naive approach from a head
df['W'] = df['A'] * 100000

trying to use const c=int64() to force expr evaluator to use int64

c = np.int64(10000)
df.eval('R = @c * A', inplace=True)

trying to create new int64 col 1st and use it in calc:

df['T']=0 # this creates new col 'T' dtype=int64 filled with 0
df.eval('T = 100000 * A', inplace=True)

df['S']=0

trying to force int64 via 1st element 'S', which is int64

df['S'] = df['S'] + df['A'] * 100000

here finally this approach works, calculation is using int64 instructions:

df['X']=1
df.eval('X = X * 100000 * A', inplace=True)

just preformatting

pd.set_option('display.max_columns', None)
pd.options.display.width=222
df.index=[''] * len(df)

print(df)
df.info()

A Q W R T S X
3396 339600000 339600000 339600000 339600000 339600000 339600000
26508 -1644167296 -1644167296 -1644167296 -1644167296 -1644167296 2650800000
27942 -1500767296 -1500767296 -1500767296 -1500767296 -1500767296 2794200000
3441 344100000 344100000 344100000 344100000 344100000 344100000
27880 -1506967296 -1506967296 -1506967296 -1506967296 -1506967296 2788000000
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, to
Data columns (total 7 columns):

Column Non-Null Count Dtype

0 A 5 non-null int16
1 Q 5 non-null int32
2 W 5 non-null int32
3 R 5 non-null int32
4 T 5 non-null int64
5 S 5 non-null int64
6 X 5 non-null int64
dtypes: int16(1), int32(3), int64(3)
memory usage: 230.0+ bytes


with 6 different ways to do trivial math, only col &#39;X&#39; produced what I (and I&#39;d guess most users) expected.
obvious way to solve would be, at first to convert source column&#39;s dtype to int64 (like doing it &#39;on the fly&#39;) like

df['x'] = df['A'].astype(np.int64) * 100000


but I don&#39;t see its a good solution in my case, as data already is to big to create an even bigger tmp copy, and performance is also will degrade with redundant conversions..
So my question: how to calculate it on the fly, not creating a copy of whole source data columns in int64 (don&#39;t have RAM for that), e.g. directly properly calc `newCol64 = srcCol8 * srcCol16 * srcCol16` in 64 bits cpu.
is there explicit type translation syntaxis available for `pandas.eval()`? which could be done on the fly while computing result row by row?
</details>
# 答案1
**得分**: 1
如果你将一个数组乘以一个标量值，所选的数据类型将与数组的数据类型相同。然而，如果你在两个数组之间进行数学运算，比如 `X`（X * A），数组将被“向上转型”：
```python
>>> df['A'] * c
0   -31392
1    14992
2   -14064
3     6944
4   -13904
Name: A, dtype: int16  # 不良结果，不良数据类型
>>> df['A'] * [c]
0    196380000
1    288570000
2    317770000
3    168500000
4     76270000
Name: A, dtype: int64  # 一切正常

你可以阅读有关输出类型确定的文档。

在你的情况下，你可以这样做：

c = np.array([10000])
df.eval('R = @c * A', inplace=True)

输出：

>>> df
       A          R
0  12399  123990000
1   6026   60260000
2  17133  171330000
3  30974  309740000
4  28216  282160000
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       5 non-null      int16
 1   R       5 non-null      int64
dtypes: int16(1), int64(1)
memory usage: 178.0 bytes

编辑

由于某种原因，在我的机器上，df['A'] * [100000] 计算为 int32，尽管 Python 是 64 位的...（所以结果是错误的）。使用 c=np.array([100000],dtype=np.int64) 而不是 [100000] 则可以正常工作。

即使你的系统是 64 位的，似乎默认的整数类型在你的情况下是 32 位的（我不知道为什么）。要检查默认的数据类型，你可以使用：

# 我的系统
>>> np.int_
numpy.int64
# 你的系统
>>> np.int_
numpy.int32

在你的情况下，如果你执行 df['A'] * [10000]，[10000] 将被转换为 np.array([10000], dtype=np.int32)，这就是为什么结果会出错的原因。你必须明确地使用 np.array([10000], dtype=np.int64) 来创建数组。

英文:

If you multiply an array by a scalar value, the selected dtype will be that of the array. However if you do a math operation between two arrays like X (X * A), the array will be "upcast":

&gt;&gt;&gt; df[&#39;A&#39;] * c
0   -31392
1    14992
2   -14064
3     6944
4   -13904
Name: A, dtype: int16  # bad result, bad dtype
&gt;&gt;&gt; df[&#39;A&#39;] * [c]
0    196380000
1    288570000
2    317770000
3    168500000
4     76270000
Name: A, dtype: int64  # everything is ok

You can read the documentation about output type determination.

In your case, you can do:

c = np.array([10000])
df.eval(&#39;R = @c * A&#39;, inplace=True)

Output:

&gt;&gt;&gt; df
       A          R
0  12399  123990000
1   6026   60260000
2  17133  171330000
3  30974  309740000
4  28216  282160000
&gt;&gt;&gt; df.info()
&lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       5 non-null      int16
 1   R       5 non-null      int64
dtypes: int16(1), int64(1)
memory usage: 178.0 bytes

EDIT

> For some reason on my machine df['A'] * [100000] is calculated as int32, despite python is 64bit.. (and so result is wrong). Using c=np.array([100000],dtype=np.int64) instead of [100000] is works though.

Even if your system is 64bit, it seems the default int type is 32bit in your case (and I don't know why). To check the default dtype, you can use:

# My system
&gt;&gt;&gt; np.int_
numpy.int64
# Your system
&gt;&gt;&gt; np.int_
numpy.int32

In your case, if you do df['A'] * [10000], [10000] will be converted as np.array([10000], dtype=np.int32) that's why the result will be wrong. You have to be explicit by creating the array with np.array([10000], dtype=np.int64).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

未报告的整数溢出在pandas数学运算中，使用np.int16，np.int32时的pandas.eval()。

问题

trying to use const c=int64() to force expr evaluator to use int64

trying to create new int64 col 1st and use it in calc:

trying to force int64 via 1st element 'S', which is int64

here finally this approach works, calculation is using int64 instructions:

just preformatting

Column Non-Null Count Dtype

email dataframe as table in mail body using python

如何使用pyarrow和parquet对pandas DataFrame进行加密

如何扩展 3D numpy 数组的值？

Pandas数据操作，根据同一列的其他行计算列值

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。