0/1在PyPy中为什么比False/True更快?

huangapple go评论73阅读模式
英文:

Why is 0/1 faster than False/True for this sieve in PyPy?

问题

与 https://stackoverflow.com/questions/57838797/why-use-true-is-slower-than-use-1-in-python3 类似但我在使用pypy3而且没有使用sum函数

```python
def sieve_num(n):
    nums = [0] * n
    for i in range(2, n):
        if i * i >= n: break
        if nums[i] == 0:
            for j in range(i*i, n, i):
                nums[j] = 1

    return [i for i in range(2, n) if nums[i] == 0]


def sieve_bool(n):
    nums = [False] * n
    for i in range(2, n):
        if i * i >= n: break
        if nums[i] == False:
            for j in range(i*i, n, i):
                nums[j] = True

    return [i for i in range(2, n) if nums[i] == False]

sieve_num(10**8) 花费 2.55 秒,但 sieve_bool(10**8) 花费 4.45 秒,这是一个明显的差异。

我怀疑 [0]*n 在某种程度上比 [False]*n 更小,并且更好地适应缓存,但 sys.getsizeof 和 vmprof 行性能分析对于PyPy不受支持。我能获取的唯一信息是 sieve_num<listcomp> 花费了 116 毫秒(占总执行时间的 19%),而 sieve_bool<listcomp> 花费了 450 毫秒(占总执行时间的 40%)。

使用 PyPy 7.3.1 实现的 Python 3.6.9 在搭载 Intel i7-7700HQ 和 24 GB RAM 的 Ubuntu 20.04 上。在 Python 3.8.10 上,sieve_bool 仅略慢一些。


<details>
<summary>英文:</summary>

Similar to https://stackoverflow.com/questions/57838797/why-use-true-is-slower-than-use-1-in-python3 but I&#39;m using pypy3 and not using the sum function.

def sieve_num(n):
nums = [0] * n
for i in range(2, n):
if i * i >= n: break
if nums[i] == 0:
for j in range(i*i, n, i):
nums[j] = 1

return [i for i in range(2, n) if nums[i] == 0]

def sieve_bool(n):
nums = [False] * n
for i in range(2, n):
if i * i >= n: break
if nums[i] == False:
for j in range(i*i, n, i):
nums[j] = True

return [i for i in range(2, n) if nums[i] == False]

`sieve_num(10**8)` takes 2.55 s, but `sieve_bool(10**8)` takes 4.45 s, which is a noticeable difference.

My suspicion was that `[0]*n` is somehow smaller than `[False]*n` and fits into cache better, but `sys.getsizeof` and vmprof line profiling are unsupported for PyPy. The only info I could get is that `&lt;listcomp&gt;` for `sieve_num` took 116 ms (19% of total execution time) while `&lt;listcomp&gt;` for `sieve_bool` tool 450 ms (40% of total execution time). 

Using PyPy 7.3.1 implementing Python 3.6.9 on Intel i7-7700HQ with 24 GB RAM on Ubuntu 20.04. With Python 3.8.10 `sieve_bool` is only slightly slower.

</details>


# 答案1
**得分**: 7

由于PyPy使用了一种特殊的实现方式,用于“适合64位的整数列表”。它还有一些其他特殊情况,比如“浮点数列表”、“只包含ASCII字符的字符串列表”等。其主要目标是节省内存:64位整数列表的存储方式与`array.array('l')`一样,而不是实际整数对象的指针列表。你节省的不是列表本身的大小,它的大小不会改变,而是你不需要大量同时存在的小整数对象的内存。

没有“布尔值列表”的特殊情况,因为一开始只有两个布尔对象。因此,在这种情况下,使用“64位整数列表”的策略不会带来节省内存的好处。当然,我们可以更好地存储该列表,每个条目只使用一位,但这在Python中不是一个常见的模式;我们只是从未实现过这个。

那么为什么它会变慢呢?原因是在“一般对象列表”的情况下,JIT编译器需要生成额外的代码,以检查每次从列表中读取项目时的对象类型,以及每次将项目放入列表时需要额外的GC逻辑。这不是很多代码,但在你的情况下,我猜测它会使执行`nums[j] = 1`的内部循环生成的汇编代码长度加倍。

目前,在PyPy和CPython(*)中,最快的方法可能是使用`array.array('B')`而不是列表,这既避免了PyPy特定的问题,又使用更少的内存(如果你的数据结构包含10**8个元素,这总是性能提升)。

EDIT: (*) 不,事实证明CPython可能对内存带宽太慢而无法成为限制。在我的机器上,使用字节时,PyPy可能快30-35%。此外,请参阅评论中的一种方法,可以将CPython从比PyPy慢9倍加速到比PyPy慢3倍,但通常会减慢PyPy的速度。

<details>
<summary>英文:</summary>

The reason is that PyPy uses a special implementation for &quot;list of ints that fit in 64 bits&quot;.  It has got a few other special cases, like &quot;list of floats&quot;, &quot;list of strings that contain only ascii&quot;, etc.  The goal is primarily to save memory: a list of 64-bit integers is stored just like an `array.array(&#39;l&#39;)` and not a list of pointers to actual integer objects.  You save memory not in the size of the list itself---which doesn&#39;t change---but in the fact that you don&#39;t need a very large number of small additional integer objects all existing at once.

There is no special case for &quot;list of boolean&quot;, because there are only ever two boolean objects in the first place.  So there would be no memory-saving benefit in using a strategy like &quot;list of 64-bit ints&quot; in this case.  Of course, we could do better and store that list with only one bit per entry, but it is not a really common pattern in Python; we just never got around to implementing that.

So why it is slower, anyway?  The reason is that in the &quot;list of general objects&quot; case, the JIT compiler needs to produce extra code to check the type of objects every time it reads an item from the list, and extra GC logic every time it puts an item into the list.  This is not a lot of code, but in your case, I guess it doubles the length of the (extremely short) generated assembly for the inner loop doing `nums[j] = 1`.

Right now, both in PyPy and CPython(*), the fastest is probably to use `array.array(&#39;B&#39;)` instead of a list, which both avoids that PyPy-specific issue and also uses substantially less memory (always a performance win if your data structures contain 10**8 elements).

EDIT: (*) no, turns out that CPython is probably too slow for the memory bandwidth to be a limit.  On my machine, PyPy is maybe 30-35% faster when using bytes.  See also comments for a hack that speeds up CPython from 9x to 3x slower than PyPy, but which as usual slows down PyPy.

</details>



huangapple
  • 本文由 发表于 2023年2月19日 09:43:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/75497496.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定