2023年2月24日 11:49:23go评论92阅读模式

英文:

How to make numpy clip run faster?

问题

I have a custom machine learning objective function which is a kind of a linear bounded function and mainly use numpy.clip. During training, the objective function will be run a whole of times. The training time thus depends on how fast numpy.clip can run.

So, my question is 'is there anything I can do to make numpy.clip run faster?'

So far, I tried numba but I basically get no improvement at all (example below).

import timeit
from numba import jit
import numpy as np
def clip(x, l, u):
    return x.clip(l, u)
@jit(nopython=True, parallel=True, fastmath=True)
def clip2(x, l, u):
    return x.clip(l, u)
x = np.random.rand(1000)
l = -np.random.rand(1000)
u = np.random.rand(1000)
timeit.timeit(lambda: clip(x, l, u))
# >>> 7.600710524711758
timeit.timeit(lambda: clip2(x, l, u))
# >>> 23.19934402871877

Is there anything wrong with the way I use numba or it really cannot help in this case?

Is there any other approach worth a try?

One note is for my use case, the vector length for x, l, and u in clip (defined above) is mainly around 1000. So, I really want to optimize for such a particular case.

Thanks so much for your help.

英文:

So, my question is 'is there anything I can do to make numpy.clip run faster?'

So far, I tried numba but I basically get no improvement at all (example below).

import timeit
from numba import jit
import numpy as np
def clip(x, l, u):
    return x.clip(l, u)
@jit(nopython=True, parallel=True, fastmath=True)
def clip2(x, l, u):
    return x.clip(l, u)
x = np.random.rand(1000)
l = -np.random.rand(1000)
u = np.random.rand(1000)
timeit.timeit(lambda: clip(x, l, u))
&gt;&gt;&gt; 7.600710524711758
timeit.timeit(lambda: clip2(x, l, u))
&gt;&gt;&gt; 23.19934402871877

Is there anything wrong with the way I use numba or it really cannot help in this case?

Is there any other approach worth a try?

One note is for my use case, the vector length for x, l and u in clip (defined above) is mainly around 1000. So, I really want to optimize for such a particular case.

Thanks so much for your help.

答案1

得分: 4

如评论中指出的，Numba在第一次调用函数时引入了一些编译开销（针对特定的数据类型签名）。是否应将其包括在基准测试中很难根据您分享的有限信息来回答。

Numba支持的NumPy函数非常方便和强大，但通常通过为您的应用程序实现特定函数可以获得更高的性能。

parallel=True 如警告所示，不会有任何效果。

使用np.clip，如果您愿意就地修改输入，可能可以通过使用out=关键字稍微提高性能。

总体而言，根据我的经验，使用numba.vectorize可以获得最佳性能。

在我的计算机上，经过热身（不包括编译），结果如下：

clip1: 7.19 微秒 ± 546   纳秒每次循环（平均值 ± 7 次运行的标准差，每次循环 100,000 次）
clip2: 2.88 微秒 ±  35.9 纳秒每次循环（平均值 ± 7 次运行的标准差，每次循环 100,000 次）
clip3: 2.54 微秒 ± 177   纳秒每次循环（平均值 ± 7 次运行的标准差，每次循环 100,000 次）
clip4: 1.2  微秒 ±  39.3 纳秒每次循环（平均值 ± 7 次运行的标准差，每次循环 100,000 次）

英文:

As pointed out in the comments, Numba introduces some compilation overhead the first time the function is called (for a particular datatype signature). Whether that should be included in the benchmark is difficult to answer based on the limited information you've shared.

The Numpy functions supported by Numba are convenient and robust, but you can often gain a little extra performance by implementing a specific function for your application.

The parallel=True doesn't do anything as shown by the warning.

Using np.clip you could perhaps gain a little by using the out= keyword if you're willing to modify the input (in place).

Overall I get the best performance using numba.vectorize, as is often the case in my experience.

from numba import njit, vectorize
import numpy as np
def clip1(x, l, u):
    return x.clip(l, u)
@njit(fastmath=True)
def clip2(x, l, u):
    return x.clip(l, u)
    
@njit(fastmath=True)
def clip3(x, l, u):
    return np.clip(x, l, u, out=x)
    
@vectorize
def clip4(x, l, u):
    return max(min(x, u), l)

On my machine, with a warm-up (excluding compilation), this results in:

clip1: 7.19 &#181;s &#177; 546   ns per loop (mean &#177; std. dev. of 7 runs, 100,000 loops each)
clip2: 2.88 &#181;s &#177;  35.9 ns per loop (mean &#177; std. dev. of 7 runs, 100,000 loops each)
clip3: 2.54 &#181;s &#177; 177   ns per loop (mean &#177; std. dev. of 7 runs, 100,000 loops each)
clip4: 1.2  &#181;s &#177;  39.3 ns per loop (mean &#177; std. dev. of 7 runs, 100,000 loops each)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何让numpy的clip函数运行更快？

问题

答案1

从文件中获取特定文本。

Pyqt 应用在不同的监视器上使用 matplotlib 绘图时出现奇怪行为。

在标点符号后如果它位于行尾，如何添加空格。

在pandas中格式化一个包含两行的表格。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。