英文:
How to make numpy clip run faster?
问题
I have a custom machine learning objective function which is a kind of a linear bounded function and mainly use numpy.clip. During training, the objective function will be run a whole of times. The training time thus depends on how fast numpy.clip can run.
So, my question is 'is there anything I can do to make numpy.clip run faster?'
So far, I tried numba but I basically get no improvement at all (example below).
import timeit
from numba import jit
import numpy as np
def clip(x, l, u):
return x.clip(l, u)
@jit(nopython=True, parallel=True, fastmath=True)
def clip2(x, l, u):
return x.clip(l, u)
x = np.random.rand(1000)
l = -np.random.rand(1000)
u = np.random.rand(1000)
timeit.timeit(lambda: clip(x, l, u))
# >>> 7.600710524711758
timeit.timeit(lambda: clip2(x, l, u))
# >>> 23.19934402871877
Is there anything wrong with the way I use numba or it really cannot help in this case?
Is there any other approach worth a try?
One note is for my use case, the vector length for x, l, and u in clip (defined above) is mainly around 1000. So, I really want to optimize for such a particular case.
Thanks so much for your help.
英文:
I have a custom machine learning objective function which is a kind of a linear bounded function and mainly use numpy.clip. During training, the objective function will be run a whole of times. The training time thus depends on how fast numpy.clip can run.
So, my question is 'is there anything I can do to make numpy.clip run faster?'
So far, I tried numba but I basically get no improvement at all (example below).
import timeit
from numba import jit
import numpy as np
def clip(x, l, u):
return x.clip(l, u)
@jit(nopython=True, parallel=True, fastmath=True)
def clip2(x, l, u):
return x.clip(l, u)
x = np.random.rand(1000)
l = -np.random.rand(1000)
u = np.random.rand(1000)
timeit.timeit(lambda: clip(x, l, u))
>>> 7.600710524711758
timeit.timeit(lambda: clip2(x, l, u))
>>> 23.19934402871877
Is there anything wrong with the way I use numba or it really cannot help in this case?
Is there any other approach worth a try?
One note is for my use case, the vector length for x, l and u in clip (defined above) is mainly around 1000. So, I really want to optimize for such a particular case.
Thanks so much for your help.
答案1
得分: 4
如评论中指出的,Numba在第一次调用函数时引入了一些编译开销(针对特定的数据类型签名)。是否应将其包括在基准测试中很难根据您分享的有限信息来回答。
Numba支持的NumPy函数非常方便和强大,但通常通过为您的应用程序实现特定函数可以获得更高的性能。
parallel=True
如警告所示,不会有任何效果。
使用np.clip
,如果您愿意就地修改输入,可能可以通过使用out=
关键字稍微提高性能。
总体而言,根据我的经验,使用numba.vectorize
可以获得最佳性能。
在我的计算机上,经过热身(不包括编译),结果如下:
clip1: 7.19 微秒 ± 546 纳秒每次循环(平均值 ± 7 次运行的标准差,每次循环 100,000 次)
clip2: 2.88 微秒 ± 35.9 纳秒每次循环(平均值 ± 7 次运行的标准差,每次循环 100,000 次)
clip3: 2.54 微秒 ± 177 纳秒每次循环(平均值 ± 7 次运行的标准差,每次循环 100,000 次)
clip4: 1.2 微秒 ± 39.3 纳秒每次循环(平均值 ± 7 次运行的标准差,每次循环 100,000 次)
英文:
As pointed out in the comments, Numba introduces some compilation overhead the first time the function is called (for a particular datatype signature). Whether that should be included in the benchmark is difficult to answer based on the limited information you've shared.
The Numpy functions supported by Numba are convenient and robust, but you can often gain a little extra performance by implementing a specific function for your application.
The parallel=True
doesn't do anything as shown by the warning.
Using np.clip
you could perhaps gain a little by using the out=
keyword if you're willing to modify the input (in place).
Overall I get the best performance using numba.vectorize
, as is often the case in my experience.
from numba import njit, vectorize
import numpy as np
def clip1(x, l, u):
return x.clip(l, u)
@njit(fastmath=True)
def clip2(x, l, u):
return x.clip(l, u)
@njit(fastmath=True)
def clip3(x, l, u):
return np.clip(x, l, u, out=x)
@vectorize
def clip4(x, l, u):
return max(min(x, u), l)
On my machine, with a warm-up (excluding compilation), this results in:
clip1: 7.19 µs ± 546 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
clip2: 2.88 µs ± 35.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
clip3: 2.54 µs ± 177 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
clip4: 1.2 µs ± 39.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论