为什么Cython在这个简单循环中比Numba慢得多?

huangapple go评论85阅读模式
英文:

Why is cython so much slower than numba for this simple loop?

问题

我有一个简单的循环,只是对一个NumPy数组的第二行进行求和。在Numba中,我只需要这样做:

from numba import njit

@njit('float64(float64[:, ::1])', fastmath=True)
def fast_sum_nb(array_2d):
    s = 0.0
    for i in range(array_2d.shape[1]):
        s += array_2d[1, i]
    return s

如果我计时这段代码,得到的结果如下:

import numpy as np

A = np.random.rand(2, 1000)

%timeit fast_sum_nb(A)

结果为:

305 ns ± 7.81 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

要在Cython中执行相同的操作,我首先需要创建一个setup.py文件,其中包括以下内容:

from setuptools import setup
from Cython.Build import cythonize
from setuptools.extension import Extension

ext_modules = [
    Extension(
        'test_sum',
        language='c',
        sources=['test.pyx'],  # 源文件列表
        extra_compile_args=['-Ofast', '-march=native'],  # 示例额外的编译参数
    )
]

setup(
    name="test module",
    ext_modules=cythonize(ext_modules, compiler_directives={'language_level': "3"})
)

我使用了尽可能高的编译选项。然后,Cython求和的代码如下:

#cython: language_level=3
from cython cimport boundscheck
from cython cimport wraparound

@boundscheck(False)
@wraparound(False)
def fast_sum(double[:, ::1] arr):
    cdef int i=0
    cdef double s=0.0
    for i in range(arr.shape[1]):
        s += arr[1, i]
    return s

我使用以下命令进行编译:

python setup.py build_ext --inplace

现在,如果我计时这段代码,得到的结果如下:

import numpy as np

A = np.random.rand(2, 1000)

%timeit fast_sum(A)

结果为:

564 ns ± 1.25 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

为什么Cython版本速度较慢?


Cython生成的C代码注释如下所示:

(图片链接)

Numba生成的汇编代码似乎是:

vaddpd -96(%rsi,%rdx,8), %ymm0, %ymm0
vaddpd -64(%rsi,%rdx,8), %ymm1, %ymm1
vaddpd -32(%rsi,%rdx,8), %ymm2, %ymm2
vaddpd (%rsi,%rdx,8), %ymm3, %ymm3
addq $16, %rdx
cmpq %rdx, %rcx
jne .LBB0_5
vaddpd %ymm0, %ymm1, %ymm0
vaddpd %ymm0, %ymm2, %ymm0
vaddpd %ymm0, %ymm3, %ymm0
vextractf128 $1, %ymm0, %xmm1
vaddpd %xmm1, %xmm0, %xmm0
vpermilpd $1, %xmm0, %xmm1
vaddsd %xmm1, %xmm0, %xmm0

我不知道如何获取Cython代码的汇编版本。它生成的C文件很大,.so文件也会解析成一个大文件。

这种速度差异持续存在(实际上会增加),如果我增加2D数组中的列数,所以这似乎不是一个调用开销的问题。

我正在使用Ubuntu上的Cython版本0.29.35和Numba版本0.57.0。

英文:

I have a simple loop that just sums the second row of a numpy array. In numba I need only do:

from numba import njit
@njit('float64(float64[:, ::1])', fastmath=True)
    def fast_sum_nb(array_2d):
        s = 0.0
        for i in range(array_2d.shape[1]):
            s += array_2d[1, i]
        return s

If I time the code I get:

In [3]: import numpy as np
In [4]: A = np.random.rand(2, 1000)
In [5]: %timeit fast_sum_nb(A)
305 ns ± 7.81 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

To do the same thing in cython I need first make setup.py which has:

from setuptools import setup
from Cython.Build import cythonize
from setuptools.extension import Extension

ext_modules = [
    Extension(
        'test_sum',
        language='c',
        sources=['test.pyx'],  # list of source files
        extra_compile_args=['-Ofast', '-march=native'],  # example extra compiler arguments
    )
]

setup(
    name = "test module",
    ext_modules = cythonize(ext_modules, compiler_directives={'language_level' : "3"})  
)

I have the highest possible compilation options. The cython summation code is then:

#cython: language_level=3
from cython cimport boundscheck
from cython cimport wraparound
@boundscheck(False)
@wraparound(False)
def fast_sum(double[:, ::1] arr):
    cdef int i=0
    cdef double s=0.0
    for i in range(arr.shape[1]):
        s += arr[1, i]
    return s

I compile it with:

python setup.py build_ext --inplace

If I now time this I get:

In [2]: import numpy as np
In [3]: A = np.random.rand(2, 1000)
In [4]: %timeit fast_sum(A)
564 ns ± 1.25 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Why is the cython version so much slower?


The annotated C code from cython looks like this:

为什么Cython在这个简单循环中比Numba慢得多?

The assembly produced by numba seems to be:

vaddpd	-96(%rsi,%rdx,8), %ymm0, %ymm0
vaddpd	-64(%rsi,%rdx,8), %ymm1, %ymm1
vaddpd	-32(%rsi,%rdx,8), %ymm2, %ymm2
vaddpd	(%rsi,%rdx,8), %ymm3, %ymm3
addq	$16, %rdx
cmpq	%rdx, %rcx
jne	.LBB0_5
vaddpd	%ymm0, %ymm1, %ymm0
vaddpd	%ymm0, %ymm2, %ymm0
vaddpd	%ymm0, %ymm3, %ymm0
vextractf128	$1, %ymm0, %xmm1
vaddpd	%xmm1, %xmm0, %xmm0
vpermilpd	$1, %xmm0, %xmm1
vaddsd	%xmm1, %xmm0, %xmm0

I don't know how to get the assembly for the cython code. The C file it produces is huge and the .so file disassembles to a large file as well.

This speed difference persists (in fact it increases) if I increase the number of columns in the 2d array so it doesn't seem to be a calling overhead issue.

I am using Cython version 0.29.35 and numba version 0.57.0 on Ubuntu.

答案1

得分: 8

这段文本中的代码和技术内容包括有关Numba、LLVM IR、机器代码、Cython、C编程、GCC、Clang、循环展开、Cython性能、编译器选项、Ghidra等方面的信息。

如果您需要任何进一步的翻译或解释,请提出具体的问题。

英文:

It looks like

decorated py code  --Numba-->  LLVM IR  --LLVM-->  machine code 

is just producing better-optimized code than

cython code  --Cython-->  C  --gcc-or-clang-->  machine code

There are three things that conspire to give Cython worse performance in this benchmark.

  1. Cython overhead (minor)

  2. The loop Cython generates (major)

  3. Which compiler is being used (major)

Overhead (reason 1)

Cython apparently has some extra overhead. You can observe this by passing in an array of shape (1,0).
The Numba function is still much faster.
This isn't too surprising, given that Cython is the more general tool, and it tries to be extra careful with inputs, error handling, etc., even when it's overkill.
Unless you're calling this function a lot with very small inputs, this shouldn't matter much.

Loop Unrolling (reasons 2 & 3 together)

Based on the disassembly in the updated question (copied here), Numba + LLVM are creating a nicely unrolled loop.
Notice how it's using YMM0..YMM3, not just one vector register.

vaddpd  -96(%rsi,%rdx,8), %ymm0, %ymm0
vaddpd  -64(%rsi,%rdx,8), %ymm1, %ymm1
vaddpd  -32(%rsi,%rdx,8), %ymm2, %ymm2
vaddpd  (%rsi,%rdx,8), %ymm3, %ymm3
addq    $16, %rdx
cmpq    %rdx, %rcx
jne .LBB0_5
vaddpd  %ymm0, %ymm1, %ymm0
vaddpd  %ymm0, %ymm2, %ymm0
vaddpd  %ymm0, %ymm3, %ymm0
vextractf128    $1, %ymm0, %xmm1
vaddpd  %xmm1, %xmm0, %xmm0
vpermilpd   $1, %xmm0, %xmm1
vaddsd  %xmm1, %xmm0, %xmm0

In contrast, here's core decompiled loop from Cython when using gcc. There's no unrolling here.

    do {
      uVar9 = uVar8 + 1;
      auVar15 = vaddpd_avx(auVar15,*(undefined (*) [32])(local_118 + local_d0 + uVar8 * 0x20));
      uVar8 = uVar9;
    } while (uVar9 < (ulong)local_108 >> 2);

The decompilation of the clang output is similar, but happens to perform even worse (see the benchmark results below). For some reason, clang doesn't want to unroll Cython's loop.

    do {
        auVar1._8_8_ = 0;
        auVar1._0_8_ = *(ulong *)(local_e8.data + lVar4 * 8 + local_e8.strides[0]);
        auVar5 = vaddsd_avx(auVar5,auVar1);
        lVar4 = (long)((int)lVar4 + 1);
    } while (lVar4 < local_e8.shape[1]);

Making Cython Fast

It can sometimes be tricky to make Cython generate ultra-fast code, but fortunately there's another option: use Cython only for glue between Python and C.

Try putting this in your .pyx file:

cdef extern from "impl.h":
    double fast_sum_c(double[] x, size_t n) nogil

def fast_sum_cyc(double[:, ::1] arr):
    # The pointer retrieval is only safe 
    # because of the "::1" constraint.
    return fast_sum_c(&arr[1, 0], arr.shape[1])

and then create an impl.h file with the following contents:

#pragma once

double fast_sum_c(double const *x, size_t n) {
    double s = 0.0;
    for (size_t i = 0; i < n; ++i) {
        s += x[i];
    }
    return s;
}

On my machine, with a (2,1000) input array, here are the timeit runtimes:

Compiler    Numba  OrigCy  CyAndC
----------  -----  ------  ------
LLVM/Clang  240ns   900ns   250ns
gcc         n/a     380ns   380ns

Observations:

  • Numba has a little lower overhead than Cython.

  • Cython + Clang performs exceptionally poorly on this benchmark.

  • ...but a thin Cython wrapper + hand written C code + Clang is almost as good as Numba.

  • GCC doesn't seem as sensitive to the code Cython generates. We get the same speed with Cython's code and hand written code.

Here's the most important assembly snippet from the clang-compiled version of fast_sum_c.
Unsurprisingly, it's very similar to what Numba produced (since they both use LLVM as their backend):

58b0:       c5 fd 58 04 cf          vaddpd (%rdi,%rcx,8),%ymm0,%ymm0
58b5:       c5 f5 58 4c cf 20       vaddpd 0x20(%rdi,%rcx,8),%ymm1,%ymm1
58bb:       c5 ed 58 54 cf 40       vaddpd 0x40(%rdi,%rcx,8),%ymm2,%ymm2
58c1:       c5 e5 58 5c cf 60       vaddpd 0x60(%rdi,%rcx,8),%ymm3,%ymm3
58c7:       48 83 c1 10             add    $0x10,%rcx
58cb:       48 39 c8                cmp    %rcx,%rax
58ce:       75 e0                   jne    58b0 <fast_sum_c+0x40>
58d0:       c5 f5 58 c0             vaddpd %ymm0,%ymm1,%ymm0
58d4:       c5 ed 58 c0             vaddpd %ymm0,%ymm2,%ymm0
58d8:       c5 e5 58 c0             vaddpd %ymm0,%ymm3,%ymm0
58dc:       c4 e3 7d 19 c1 01       vextractf128 $0x1,%ymm0,%xmm1
58e2:       c5 f9 58 c1             vaddpd %xmm1,%xmm0,%xmm0
58e6:       c4 e3 79 05 c8 01       vpermilpd $0x1,%xmm0,%xmm1
58ec:       c5 fb 58 c1             vaddsd %xmm1,%xmm0,%xmm0

Notes

  • Compiler options to encourage more loop unrolling didn't help. In the "Making Cython Fast", I tried adding various pragmas and compiler flags to gcc to encourage it to do some unrolling; nothing helped. Additionally, clang never had trouble with my hand written C code, but never wanted to unroll the Cython-generated loop.

  • objdump -d test_sum.*.so produces a disassembly. Looking for the vaddpd instructions helps with locating the loop of interest.

  • Ghidra can be used to decompile the code. This helps a little with understanding it. Compiling the extension module with -g and -gdwarf-4 makes it so Ghidra's DWARF decoding works, injecting a bit more metadata in the decompilation.

  • For these tests, I used clang 14.0.0 and gcc 11.3.0.

答案2

得分: 4

TL;DR: 这个回答在@MrFooz的好回答基础上提供了额外的细节,以解释为什么Cython代码在Clang和GCC上都很慢。性能问题来自于三个优化的组合:一个来自Clang,一个来自GCC,一个来自Cython...


底层原理

首先,Cython生成一个C代码,其步长在编译时不是已知的。这是一个问题,因为编译器的自动矢量化不能轻松地使用SIMD指令来矢量化代码,因为数组在理论上可能不是连续的(尽管在实际中始终是连续的)。因此,Clang的自动矢量化器无法优化循环(既不能自动矢量化,也不能展开循环)。GCC优化器聪明一些:它为步长为1的情况(即连续数组)生成了专门的矢量化代码。以下是生成的Cython代码:

  for (__pyx_t_3 = 0; __pyx_t_3 < __pyx_t_2; __pyx_t_3+=1) {
    __pyx_v_i = __pyx_t_3;
    __pyx_t_4 = 1;
    __pyx_t_5 = __pyx_v_i;
    __pyx_v_s = (__pyx_v_s + (*((double *) ( /* dim=1 */ ((char *) (((double *) ( /* dim=0 */ (__pyx_v_arr.data + __pyx_t_4 * __pyx_v_arr.strides[0]) )) + __pyx_t_5)) ))));
  }

注意__pyx_v_arr.strides[0],它在编译时没有被替换为1,尽管Cython应该知道数组是连续的。有一个解决这个Cython优化问题的方法:使用1D内存视图。

以下是修改后的Cython代码:

#cython: language_level=3
from cython cimport boundscheck
from cython cimport wraparound
@boundscheck(False)
@wraparound(False)
def fast_sum(double[:, ::1] arr):
    cdef int i=0
    cdef double s=0.0
    cdef double[::1] line = arr[1]
    for i in range(arr.shape[1]):
        s += line[i]
    return s

不幸的是,这段代码由于两个底层编译器问题而不会更快...

GCC默认情况下不会展开这样的循环。这是一个众所周知的长期未优化问题。事实上,甚至有一个关于这个特定C代码的开放问题。使用编译标志-funroll-loops -fvariable-expansion-in-unroller可以提高生成的性能,尽管生成的代码仍然不完美。

至于Clang,这是另一个未优化问题,阻止了代码变得快速。GCC和Clang在过去关于循环中使用不同大小的类型进行矢量化时都有一些未完成的自动矢量化问题(甚至包括GCC中的有符号/无符号)。要解决此问题,您应该在使用双精度数组时使用64位整数。以下是最终修改后的Cython代码:

#cython: language_level=3
from cython cimport boundscheck
from cython cimport wraparound
@boundscheck(False)
@wraparound(False)
def fast_sum(double[:, ::1] arr):
    cdef long i=0
    cdef double s=0.0
    cdef double[::1] line = arr[1]
    for i in range(arr.shape[1]):
        s += line[i]
    return s

请注意,Numba默认使用64位整数(例如,用于循环迭代器和索引),而且Numba使用基于LLVM的LLVM-Lite(与Clang一样),因此不会出现这种问题。


性能基准

以下是我的计算机上的性能结果,使用i5-9600KF处理器、GCC 12.2.0、Clang 14.0.6和Python 3.11:

初始代码:
    Cython GCC:      389 ns
    Cython Clang:   1050 ns
    Numba:           232 ns

修改后的代码:
    Cython GCC:      276 ns
    Cython Clang:    242 ns

Numba和Cython+Clang之间的非常小的额外开销是由于不同的启动开销。通常情况下,这么小的时间不应该是一个问题,因为不应该频繁从CPython中调用Cython/Numba函数。在这种病态情况下,调用函数的函数也应该被修改为使用Cython/Numba。当这不可行时,Numba和Cython都不会生成快速代码,因此应该优先使用更低级别的C扩展。

英文:

TL;DR: this answer adds additional details on top of the good answer of @MrFooz so to understand why the Cython code is slow with both Clang and GCC. The performance issue is coming from a combination of 3 missed optimizations : one from Clang, one from GCC and one from Cython...


Under the hood

First of all, Cython generates a C code with a stride not known at compile time. This is a problem since auto-vectorized of compilers cannot easily vectorize the code using SIMD instruction as the array could theoretically not be contiguous (even though it will always be in practice). As a result, the Clang auto-vectorizer fail to optimize the loop (both auto-vectorization and unrolling). The GCC optimizer is a bit more clever : it generate a specialized vectorized code for a stride of 1 (ie. contiguous arrays). Here is the generated Cython code:

  for (__pyx_t_3 = 0; __pyx_t_3 < __pyx_t_2; __pyx_t_3+=1) {
    __pyx_v_i = __pyx_t_3;
    __pyx_t_4 = 1;
    __pyx_t_5 = __pyx_v_i;
    __pyx_v_s = (__pyx_v_s + (*((double *) ( /* dim=1 */ ((char *) (((double *) ( /* dim=0 */ (__pyx_v_arr.data + __pyx_t_4 * __pyx_v_arr.strides[0]) )) + __pyx_t_5)) ))));
  }

Note the __pyx_v_arr.strides[0] which is not replaced by 1 at compile time while Cython is supposed to know the array is contiguous. There is a workaround for this Cython missed-optimization : using 1D memory-views.

Here is the modified Cython code:

#cython: language_level=3
from cython cimport boundscheck
from cython cimport wraparound
@boundscheck(False)
@wraparound(False)
def fast_sum(double[:, ::1] arr):
    cdef int i=0
    cdef double s=0.0
    cdef double[::1] line = arr[1]
    for i in range(arr.shape[1]):
        s += line[i]
    return s

This code is unfortunately not faster because of two underlying compiler issues...

GCC does not unroll such a loop yet by default. This is a well-known long-standing missed optimization. In fact, there is even an open issue for this specific C code. Using the compilation flags -funroll-loops -fvariable-expansion-in-unroller helps to improve the resulting performance though the generated code is still not perfect.

When it comes to Clang, this is another missed optimization preventing the code to be fast. GCC and Clang had several open issues in the past related to missed auto-vectorization when using types of different size in loop to be vectorized (and even with signed/unsigned for GCC). To fix this issue, you should use 64-bit integers when using double-precision arrays. Here is the final modified Cython code:

#cython: language_level=3
from cython cimport boundscheck
from cython cimport wraparound
@boundscheck(False)
@wraparound(False)
def fast_sum(double[:, ::1] arr):
    cdef long i=0
    cdef double s=0.0
    cdef double[::1] line = arr[1]
    for i in range(arr.shape[1]):
        s += line[i]
    return s

Note that Numba use 64-bit integers by default (eg. for loop iterators and indices) and Numba use LLVM-Lite (based on LLVM, like Clang) so such a problem does not happen here.


Benchmark

Here are performance results on my machine with a i5-9600KF processor, GCC 12.2.0, Clang 14.0.6 and Python 3.11:

Initial code:
    Cython GCC:      389 ns
    Cython Clang:   1050 ns
    Numba:           232 ns

Modified code:
    Cython GCC:      276 ns
    Cython Clang:    242 ns

The very small overhead between Numba and Cython+Clang is due to different startup overheads. Generally, such a small time should not be an issue since one should not call Cython/Numba functions from CPython a lot of time. In such a pathological situation, the caller function should also be modified to use Cython/Numba. When this is not possible nether Numba nor Cython will produce a fast code, so lower-level C extension should be preferred instead.

huangapple
  • 本文由 发表于 2023年6月5日 22:05:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76407241.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定