为什么numba切片比numpy切片快得多?

huangapple go评论73阅读模式
英文:

Why is numba slicing so much faster than numpy slicing?

问题

以下是您提供的代码的翻译部分:

def test(x):
    k = x[1:2]
    l = x[0:3]
    m = x[0:1]

@njit
def test2(x):
    k = x[1:2]
    l = x[0:3]
    m = x[0:1]

x = np.arange(5)

test2(x)

%timeit test(x)
%timeit test2(x)

776 ns ± 1.83 ns 每次循环均值 ± 7 次运行的标准差1000000 次循环
280 ns ± 2.53 ns 每次循环均值 ± 7 次运行的标准差1000000 次循环

它们之间的差距随着切片的增加而加大

def test(x):
    k = x[1:2]
    l = x[0:3]
    m = x[0:1]
    n = x[1:3]
    o = x[2:3]

@njit
def test2(x):
    k = x[1:2]
    l = x[0:3]
    m = x[0:1]
    n = x[1:3]
    o = x[2:3]

test2(x)

%timeit test(x)
%timeit test2(x)
1.18 μs ± 1.82 ns 每次循环均值 ± 7 次运行的标准差1000000 次循环
279 ns ± 0.562 ns 每次循环均值 ± 7 次运行的标准差1000000 次循环

numpy 函数似乎线性变慢而 numba 函数则不关心您切片多少次这正是我期望在两种情况下发生的事情

编辑

在chrslg的回答之后我决定在两个函数中都添加一个返回语句在两者上的输入如下

```python
return k, l, m, n, o

并且计时如下:

1.23 μs ± 2 ns 每次循环均值 ± 7 次运行的标准差1000000 次循环
1.61 μs ± 9.6 ns 每次循环均值 ± 7 次运行的标准差1000000 次循环

因此现在 numba 函数变得更慢似乎只是一个无用的代码然而看到用户jared的评论后我决定测试具有他尝试的切片的相同产品操作

```python
def test5(x):
    k = x[1:2]
    l = x[0:3]
    m = x[0:1]
    n = x[1:4]
    o = x[2:3]
    return (k * l * m * n * o)

@njit
def test6(x):
    k = x[1:2]
    l = x[0:3]
    m = x[0:1]
    n = x[1:4]
    o = x[2:3]
    return (k * l * m * n * o)

test6(x)

%timeit test5(x)
%timeit test6(x)
5.79 μs ± 202 ns 每次循环均值 ± 7 次运行的标准差100000 次循环
787 ns ± 1.52 ns 每次循环均值 ± 7 次运行的标准差1000000 次循环

现在 numba 函数比简单的返回函数更快(!),速度差距再次扩大我现在更加困惑了

希望这些翻译对您有所帮助。如果您需要进一步的解释或有其他问题,请随时提出。

英文:
def test(x):
k = x[1:2]
l = x[0:3]
m = x[0:1]
@njit
def test2(x):
k = x[1:2]
l = x[0:3]
m = x[0:1]
x = np.arange(5)    
test2(x)
%timeit test(x)
%timeit test2(x)
776 ns ± 1.83 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
280 ns ± 2.53 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

The gap between them widens as the slices increase

def test(x):
k = x[1:2]
l = x[0:3]
m = x[0:1]
n = x[1:3]
o = x[2:3]
@njit
def test2(x):
k = x[1:2]
l = x[0:3]
m = x[0:1]
n = x[1:3]
o = x[2:3]
test2(x)
%timeit test(x)
%timeit test2(x)
1.18 µs ± 1.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
279 ns ± 0.562 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

numpy function seems to become linearly slower and the numba function doesn't care about how many times you slice it(which was what i was expecting to happen on both cases)

EDIT:

After chrslg answer I decided to put a return statement of both functions. Just input on both

return k,l,m,n,o

and the timings were:

1.23 µs ± 2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.61 µs ± 9.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

So the numba function now becomes slower, which seems to make that its indeed just a dead code. However, after seeing user jared comment, i decided to test the same product operation with the slices that he tried:

def test5(x):
k = x[1:2]
l = x[0:3]
m = x[0:1]
n = x[1:4]
o = x[2:3]
return (k*l*m*n*o)
@njit
def test6(x):
k = x[1:2]
l = x[0:3]
m = x[0:1]
n = x[1:4]
o = x[2:3]
return (k*l*m*n*o)
test6(x)
%timeit test5(x)
%timeit test6(x)
5.79 µs ± 202 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
787 ns ± 1.52 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Now the the numba function becomes faster than the simply return function(!) and the speed gap widens again. I am honestly more confused now.

答案1

得分: 4

因为它可能什么都没做。

我得到了与numba相同的时间,对于这个函数

def test3(x):
    pass

请注意,test也几乎什么都没有做。这些只是切片,没有与之关联的任何操作。因此,没有数据传输或其他任何操作。
只是创建了3个变量,并进行了一些边界调整。

如果代码是关于一个包含5000000个元素的数组,以及从中切片的1000000个元素,它不会更慢。因此,我想,你想要在“更大”的情况下扩展一些东西时,决定不增加数据大小(因为你可能知道数据大小在这里不相关),而是增加行数。

但是,好吧,即使test几乎什么都没做,仍然会执行这3个未使用的切片。

而numba会编译一些生成的C代码。编译器和优化器没有理由保留那些之后从未使用过的切片变量。

我在这里完全是在猜测(我从未见过numba生成的代码)。但我想代码可能看起来像这样

void test2(double *x, int xstride, int xn){
    double *k = x+1*xstride;
    int kstride=xstride;
    int kn=1;
    double *l=x;
    int lstride=xstride;
    int ln=3;
    double *m=x;
    int mstride=xstride;
    int mn=1;
    // 然后从这里开始迭代这些切片是可能的
    // 例如k[:] = m[:]可能会生成这样的代码
    // for(int i=0; i<kn; i++) k[i*stride] = m[i*stride]
}

(这里我使用“stride”表示double *算术中的大小,实际上stride是以字节为单位的,但这并不重要,这只是伪代码)

我的观点是,如果之后有什么东西(就像我在注释中写的那样),那么即使这段代码只是一些简单的算术运算,它仍然是“几乎什么都没有,但不是什么都没有”。

但之后什么都没有。所以,这只是一些本地变量的初始化,显然没有任何副作用的代码。对于编译器优化器来说,很容易删除所有这些代码。并编译一个具有完全相同效果和结果的空函数。

所以,再次强调,这只是我猜测的情况。但是任何像样的代码生成器+编译器都应该为test2编译一个空函数。因此,test2test3是相同的东西。

而解释器通常不会进行这种优化(首先,很难预先知道将要发生什么,其次,用于优化的时间是在运行时,因此存在权衡,而对于编译器来说,即使编译需要1小时的时间来节省1ns的运行时间,它仍然是值得的)。

编辑:更多实验

我和Jared都提出的想法是,做一些事情,无论是什么,都可以强制切片存在,并比较在numba必须执行某些操作时与否,从而真正执行这些切片。问题是,一旦你开始做某事,无论如何,切片本身的计时就变得微不足道。因为切片什么都不是。

但是,好吧,统计学上可以去除它,仍然可以在某种程度上测量“切片”部分。

以下是一些计时数据

空函数

在我的电脑上,空函数在纯Python中的成本为130 ns。在numba中为540 ns。

这并不奇怪。什么都不做,但在穿越“Python/C界面”时可能会产生一些成本,仅仅是为了那个“Python/C”而已。
不多虽然

时间与切片数量

下一个实验是你做过的确切实验(顺便说一下,顺便说一下,你的帖子包含了我的回答的证明:你已经看到在纯Python中时间是O(n),其中n是切片数量,而在numba中是O(1)。这已经证明了在numba中根本没有切片。如果切片被执行,无论是在numba还是在任何其他非量子计算机中:D,成本都必须是O(n)。现在,当然,如果t=100+0.000001*n,可能很难区分O(n)和O(1)。这也是我开始评估“空”情况的原因)

在纯Python中,仅仅切片,切片数量不断增加,显然是O(n),的确:

为什么numba切片比numpy切片快得多?

线性回归表明,这大致是138+n*274,以ns为单位。

这与“空”时间一致

对于numba,另一方面,我们得到了

为什么numba切片比numpy切片快得多?

因此,不需要线性回归来证明

  1. 它确实是O(1)
  2. 计时与“空”情况的540 ns一致

请注意,这意味着对于n=2或更多的切片,在我的电脑上,numba变得有竞争力。在此之前,不是。
但是,竞争在“什么都不做”的情况下...

英文:

Because it is probably doing nothing.

I get the same timings, for numba, for this function

def test3(x):
    pass

Note that test does almost nothing neither. Those are just slices, without any actions associated to it. So, no data transfer or anything.
Just the creation of 3 variables, and some boundaries tweaks.

If the code was about an array of 5000000 elements, and slices of 1000000 elements from it, it wouldn't be slower. Hence the reason, I suppose, for which, when you wanted to scale a bit things on "bigger" case, you decided not to increase data size (because you probably knew that data size was not relevant here), but to increase the number of lines.

But, well, test, even doing almost nothing, is still doing these 3, then unused, slices.

Where as numba compiles some generated C code. And compiler, with optimizer, has no reason to keeps those slice variables that are never used afterward.

I totally speculate here (I've never seen numba's generated code). But I imagine code could look line

void test2(double *x, int xstride, int xn){
    double *k = x+1*xstride;
    int kstride=xstride;
    int kn=1;
    double *l=x;
    int lstride=xstride;
    int ln=3;
    double *m=x;
    int mstride=xstride;
    int mn=1;
    // And then it would be possible, from there to iterates those slices
    // for example k[:]=m[:] could generate this code
    // for(int i=0; i<kn; i++) k[i*stride] = m[i*stride]
}

(Here I use stride the size in double * arithmetics, when in reality strides is in bytes, but it doesn't matter, that is just pseudo code)

My point is, if there was something afterward (like what I put in comment), then, this code, even if it is just a few arithmetics operations, would still be "almost nothing, but not nothing".

But there is nothing afterward. So, it is just some local variable initialization, with code with clearly no side effect. It is very easy for the compiler optimizer to just drop all that code. And compile an empty function, which has exactly the same effect and result.

So, again, just speculation on my behalf. But any decent code generator+compiler should just compile an empty function for test2. So test2 and test3 are the same things.

While an interpreter does not, usually this kind of optimization (first, it is harder to know in advance what is coming, and second, time spent to optimize is at runtime, so there is a tradeoff, when, for a compiler, even if it takes 1 hour of compile time to spare 1ns of runtime, it still worth it)

Edit: some more experiments

The idea that jared and I both had, that is doing something, whatever it is, to force the slices to exist, and compare what happens with numba when it has to do something, and therefore to really do the slices, is natual. The problem is that as soon as you start doing something, anything, the the timing of the slices themselves become negligible. Because slicing is nothing.

But, well, statistically you can remove that and still measure somehow the "slicing" part.

Here are some timings data

Empty function

On my computer, an empty function cost 130 ns in pure python. And 540 ns with numba.

Which is not surprising. Doing nothing, but doing so while crossing the "python/C frontier" probably cost a bit, just for that "python/C".
Not much tho

Time vs number of slices

Next experiment is the exact one you made (since, btw, your post contain its own proof of my answer: you already saw that in pure python time is O(n), n being the number of slices, when in numba it is O(1). That alone proves that there is no slicing at all occurring. If the slicing were done, in numba as in any other non-quantum computer :D, cost has to be O(n). Now, of course if it t=100+0.000001*n, it might be hard to distinguish O(n) from O(1). Hence the reason why I started by evaluating the "empty" case

In pure python, slicing only, with increasing number of slices in obviously O(n), indeed:

为什么numba切片比numpy切片快得多?

A linear regression tells that this is roughly 138+n×274, in ns.

Which is consistent with "empty" time

For numba, on the other hand, we get

为什么numba切片比numpy切片快得多?

So no need for a linear regression to prove that

  1. It is indeed O(1)
  2. Timing is consistent with the 540 ns of "empty" case

Note that this means that, for n=2 or more slices, on my computer, numba becomes competitive. Before, it is not.
But, well, competition in doing "nothing"...

With usage of slices

If we add code afterward to force the usage of the slices, of course, things change. Compiler can't just remove slices.

But we have to be careful

  1. To avoid having a O(n) addition in the operation itself
  2. To distinguish the timing of the operation from the, probably negligible timing of the slicing

What I did then, is computing some addition slice1[1]+slice2[2]+slice3[3]+...

But what ever the number of slices, I have 1000 terms in this addition. So for n=2 (2 slices), that addition is slice1[1]+slice2[2]+slice1[3]+slice2[4]+... with 1000 terms.

That should help remove the O(n) part due to that addition. And then, with big enough data, we can extract some value from the variations around this, even tho the variations are quite negligible before the addition time itself (and therefore even before the noise of that addition time. But with enough measurement, that noise become low enough to start seeing things)

In pure python

为什么numba切片比numpy切片快得多?

A linear regressions gives 199000 + 279×n ns

What we learn from this, is that my experimental setup is ok. 279 is close enough to the previous 274 to say that, indeed, the addition part, as huge as it is (200000 ns) is O(1), since the O(n) part remained unchanged compared to slicing only. So we just have the same timing as before + a huge constant for the addition part.

With numba

All that was just the preamble to justify the experimental setup. Now come the interesting part, the experiment itself

为什么numba切片比numpy切片快得多?

Linear regression tells 1147 + 1.3×n

So, here, it is indeed O(n).

Conclusion

1.

Slicing in numba does cost something. It is O(n).
But without usage of it, the compiler just remove it, and we get a O(1) operation.

Proof that the reason was, indeed, that in your version, numba code is simply doing nothing

2.
Cost of the operation, whatever it is, that you do with the slice to force it to be used, and to prevent the compiler to just remove it, is way bigger, which, without statistical precautions, mask the O(n) part. Hence the feeling that "it is the same when we use the variable".

Anyway, numba is faster than numpy most of the time.

I mean, numpy is a good way to have "compiled language speed", without using compiled language. But it does not beat real compilation.
So, it is quite classical to have a naive algorithm in numba beating a very smart vectorization in numpy. (classical, and very disappointing for someone like me, who made a living in being the guy who knowns who to vectorize things in numpy. Sometimes, I feel that with numba, the most naive nested for loops are better).

It stops being so, tho, when

  1. Numpy make usage of several cores (you can do that with numba too. But not just with naive algorithms)
  2. You are doing operations for which a very smart algorithm exist. Numpy algorithm have decades of optimizations. Can't beat that with 3 nested loops. Except that some tasks are so simple that they can't really be optimized.

So, I still prefer numpy over numba. Prefer to use decades of optimizations behind numpy that reinvent the wheel in numba. Plus, sometimes it is preferable not to rely on a compiler.

But, well, it is classical to have numba beating numpy.

Just, not with the ratios of your case. Because in your case, you were comparing (as I think I've proven now, and as you'd proven yourself, by seeing that numpy case was O(n) when numba case was O(1)), "doing slices with numpy vs doing nothing with numba"

huangapple
  • 本文由 发表于 2023年7月17日 09:19:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76701018.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定