如何让numpy的clip函数运行更快?

huangapple go评论55阅读模式
英文:

How to make numpy clip run faster?

问题

I have a custom machine learning objective function which is a kind of a linear bounded function and mainly use numpy.clip. During training, the objective function will be run a whole of times. The training time thus depends on how fast numpy.clip can run.

So, my question is 'is there anything I can do to make numpy.clip run faster?'

So far, I tried numba but I basically get no improvement at all (example below).

import timeit
from numba import jit
import numpy as np

def clip(x, l, u):
    return x.clip(l, u)

@jit(nopython=True, parallel=True, fastmath=True)
def clip2(x, l, u):
    return x.clip(l, u)

x = np.random.rand(1000)
l = -np.random.rand(1000)
u = np.random.rand(1000)

timeit.timeit(lambda: clip(x, l, u))
# >>> 7.600710524711758
timeit.timeit(lambda: clip2(x, l, u))
# >>> 23.19934402871877

Is there anything wrong with the way I use numba or it really cannot help in this case?

Is there any other approach worth a try?

One note is for my use case, the vector length for x, l, and u in clip (defined above) is mainly around 1000. So, I really want to optimize for such a particular case.

Thanks so much for your help.

英文:

I have a custom machine learning objective function which is a kind of a linear bounded function and mainly use numpy.clip. During training, the objective function will be run a whole of times. The training time thus depends on how fast numpy.clip can run.

So, my question is 'is there anything I can do to make numpy.clip run faster?'

So far, I tried numba but I basically get no improvement at all (example below).

import timeit
from numba import jit
import numpy as np

def clip(x, l, u):
    return x.clip(l, u)

@jit(nopython=True, parallel=True, fastmath=True)
def clip2(x, l, u):
    return x.clip(l, u)

x = np.random.rand(1000)
l = -np.random.rand(1000)
u = np.random.rand(1000)

timeit.timeit(lambda: clip(x, l, u))
>>> 7.600710524711758
timeit.timeit(lambda: clip2(x, l, u))
>>> 23.19934402871877

Is there anything wrong with the way I use numba or it really cannot help in this case?

Is there any other approach worth a try?

One note is for my use case, the vector length for x, l and u in clip (defined above) is mainly around 1000. So, I really want to optimize for such a particular case.

Thanks so much for your help.

答案1

得分: 4

如评论中指出的,Numba在第一次调用函数时引入了一些编译开销(针对特定的数据类型签名)。是否应将其包括在基准测试中很难根据您分享的有限信息来回答。

Numba支持的NumPy函数非常方便和强大,但通常通过为您的应用程序实现特定函数可以获得更高的性能。

parallel=True 如警告所示,不会有任何效果。

使用np.clip,如果您愿意就地修改输入,可能可以通过使用out=关键字稍微提高性能。

总体而言,根据我的经验,使用numba.vectorize可以获得最佳性能。

在我的计算机上,经过热身(不包括编译),结果如下:

clip1: 7.19 微秒 ± 546   纳秒每次循环(平均值 ± 7 次运行的标准差,每次循环 100,000 次)
clip2: 2.88 微秒 ±  35.9 纳秒每次循环(平均值 ± 7 次运行的标准差,每次循环 100,000 次)
clip3: 2.54 微秒 ± 177   纳秒每次循环(平均值 ± 7 次运行的标准差,每次循环 100,000 次)
clip4: 1.2  微秒 ±  39.3 纳秒每次循环(平均值 ± 7 次运行的标准差,每次循环 100,000 次)
英文:

As pointed out in the comments, Numba introduces some compilation overhead the first time the function is called (for a particular datatype signature). Whether that should be included in the benchmark is difficult to answer based on the limited information you've shared.

The Numpy functions supported by Numba are convenient and robust, but you can often gain a little extra performance by implementing a specific function for your application.

The parallel=True doesn't do anything as shown by the warning.

Using np.clip you could perhaps gain a little by using the out= keyword if you're willing to modify the input (in place).

Overall I get the best performance using numba.vectorize, as is often the case in my experience.

from numba import njit, vectorize
import numpy as np

def clip1(x, l, u):
    return x.clip(l, u)

@njit(fastmath=True)
def clip2(x, l, u):
    return x.clip(l, u)
    
@njit(fastmath=True)
def clip3(x, l, u):
    return np.clip(x, l, u, out=x)
    
@vectorize
def clip4(x, l, u):
    return max(min(x, u), l)

On my machine, with a warm-up (excluding compilation), this results in:

clip1: 7.19 µs ± 546   ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
clip2: 2.88 µs ±  35.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
clip3: 2.54 µs ± 177   ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
clip4: 1.2  µs ±  39.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

huangapple
  • 本文由 发表于 2023年2月24日 11:49:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/75552452.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定