2023年4月4日 09:35:40go评论89阅读模式

英文:

Speeding Up np.sum With multiprocessing

问题

如果我有一个大小在10^8到10^9之间的NumPy数组，是否有可能比np.sum更快地计算其总和？

我尝试过使用multiprocessing和fork，但无论工作进程的数量（1-4）如何，似乎都比只调用np.sum慢。我在一台配备2 GHz双核英特尔Core i5处理器的Mac上使用Python 3.8。不确定如果我有更多的CPU核心是否会有不同的结果。

我的代码：

import concurrent.futures
import multiprocessing as mp
import time
from concurrent.futures.process import ProcessPoolExecutor
import numpy as np
# 基于：https://luis-sena.medium.com/sharing-big-numpy-arrays-across-python-processes-abf0dc2a0ab2
def np_sum_global(start, stop):
    return np.sum(data[start:stop])
def benchmark():
    st = time.time()
    ARRAY_SIZE = int(3e8)
    print("数组大小 =", ARRAY_SIZE)
    global data
    data = np.random.random(ARRAY_SIZE)
    print("生成数组时间 =", time.time() - st)
    print("CPU核心数 =", mp.cpu_count())
    for trial in range(5):
        print("试验 =", trial)
        st = time.time()
        s = np.sum(data)
        print("方法1", time.time() - st, s)
        for NUM_WORKERS in range(1, 5):
            st = time.time()
            futures = []
            with ProcessPoolExecutor(max_workers=NUM_WORKERS) as executor:
                for i in range(0, NUM_WORKERS):
                    futures.append(
                        executor.submit(
                            np_sum_global,
                            ARRAY_SIZE * i // NUM_WORKERS,
                            ARRAY_SIZE * (i + 1) // NUM_WORKERS,
                        )
                    )
            futures, _ = concurrent.futures.wait(futures)
            s = sum(future.result() for future in futures)
            print("工作进程数 =", NUM_WORKERS, time.time() - st, s)
        print()
if __name__ == "__main__":
    mp.set_start_method("fork")
    benchmark()

输出：

数组大小 = 300000000
生成数组时间 5.1455769538879395
CPU核心数 = 4
试验 = 0
方法1 0.29593801498413086 150004049.39847052
工作进程数 = 1 1.8904719352722168 150004049.39847052
工作进程数 = 2 1.2082111835479736 150004049.39847034
工作进程数 = 3 1.2650330066680908 150004049.39847082
工作进程数 = 4 1.233708143234253 150004049.39847046
试验 = 1
方法1 0.5861320495605469 150004049.39847052
工作进程数 = 1 1.801928997039795 150004049.39847052
工作进程数 = 2 1.165492057800293 150004049.39847034
工作进程数 = 3 1.2669389247894287 150004049.39847082
工作进程数 = 4 1.2941789627075195 150004049.39847043
试验 = 2
方法1 0.44912219047546387 150004049.39847052
工作进程数 = 1 1.8038971424102783 150004049.39847052
工作进程数 = 2 1.1491520404815674 150004049.39847034
工作进程数 = 3 1.3324410915374756 150004049.39847082
工作进程数 = 4 1.4198641777038574 150004049.39847046
试验 = 3
方法1 0.5163640975952148 150004049.39847052
工作进程数 = 1 3.248213052749634 150004049.39847052
工作进程数 = 2 2.5148861408233643 150004049.39847034
工作进程数 = 3 1.0224149227142334 150004049.39847082
工作进程数 = 4 1.20924711227417 150004049.39847046
试验 = 4
方法1 1.2363107204437256 150004049.39847052
工作进程数 = 1 1.8627309799194336 150004049.39847052
工作进程数 = 2 1.233341932296753 150004049.39847034
工作进程数 = 3 1.3235111236572266 150004049.39847082
工作进程数 = 4 1.344843864440918 150004049.39847046

我查阅了一些链接：

英文:

If I have a numpy array of size 10^8 to 10^9, is it possible to compute its sum faster than np.sum?

I've tried using multiprocessing with fork, but it seems to be slower than just calling np.sum, regardless of the number of workers (1-4). I'm using Python 3.8 on a Mac with a 2 GHz Dual-Core Intel Core i5 processor. Not sure whether the results would be different if I had more CPUs.

My code:

import concurrent.futures
import multiprocessing as mp
import time
from concurrent.futures.process import ProcessPoolExecutor
import numpy as np
# based on: https://luis-sena.medium.com/sharing-big-numpy-arrays-across-python-processes-abf0dc2a0ab2
def np_sum_global(start, stop):
    return np.sum(data[start:stop])
def benchmark():
    st = time.time()
    ARRAY_SIZE = int(3e8)
    print(&quot;array size =&quot;, ARRAY_SIZE)
    global data
    data = np.random.random(ARRAY_SIZE)
    print(&quot;generated&quot;, time.time() - st)
    print(&quot;CPU Count =&quot;, mp.cpu_count())
    for trial in range(5):
        print(&quot;TRIAL =&quot;, trial)
        st = time.time()
        s = np.sum(data)
        print(&quot;method 1&quot;, time.time() - st, s)
        for NUM_WORKERS in range(1, 5):
            st = time.time()
            futures = []
            with ProcessPoolExecutor(max_workers=NUM_WORKERS) as executor:
                for i in range(0, NUM_WORKERS):
                    futures.append(
                        executor.submit(
                            np_sum_global,
                            ARRAY_SIZE * i // NUM_WORKERS,
                            ARRAY_SIZE * (i + 1) // NUM_WORKERS,
                        )
                    )
            futures, _ = concurrent.futures.wait(futures)
            s = sum(future.result() for future in futures)
            print(&quot;workers =&quot;, NUM_WORKERS, time.time() - st, s)
        print()
if __name__ == &quot;__main__&quot;:
    mp.set_start_method(&quot;fork&quot;)
    benchmark()

Output:

array size = 300000000
generated 5.1455769538879395
CPU Count = 4
TRIAL = 0
method 1 0.29593801498413086 150004049.39847052
workers = 1 1.8904719352722168 150004049.39847052
workers = 2 1.2082111835479736 150004049.39847034
workers = 3 1.2650330066680908 150004049.39847082
workers = 4 1.233708143234253 150004049.39847046
TRIAL = 1
method 1 0.5861320495605469 150004049.39847052
workers = 1 1.801928997039795 150004049.39847052
workers = 2 1.165492057800293 150004049.39847034
workers = 3 1.2669389247894287 150004049.39847082
workers = 4 1.2941789627075195 150004049.39847043
TRIAL = 2
method 1 0.44912219047546387 150004049.39847052
workers = 1 1.8038971424102783 150004049.39847052
workers = 2 1.1491520404815674 150004049.39847034
workers = 3 1.3324410915374756 150004049.39847082
workers = 4 1.4198641777038574 150004049.39847046
TRIAL = 3
method 1 0.5163640975952148 150004049.39847052
workers = 1 3.248213052749634 150004049.39847052
workers = 2 2.5148861408233643 150004049.39847034
workers = 3 1.0224149227142334 150004049.39847082
workers = 4 1.20924711227417 150004049.39847046
TRIAL = 4
method 1 1.2363107204437256 150004049.39847052
workers = 1 1.8627309799194336 150004049.39847052
workers = 2 1.233341932296753 150004049.39847034
workers = 3 1.3235111236572266 150004049.39847082
workers = 4 1.344843864440918 150004049.39847046

Some links I've looked at:

答案1

得分: 1

这是一个使用Numba进行性能基准测试的示例。它首先需要编译代码，这会导致第一次运行速度较慢。后续运行通常比NumPy快两到三倍。因此，是否值得使用Numba取决于你运行代码的频率。

import numba
import numpy as np
import time
# 基于：https://luis-sena.medium.com/sharing-big-numpy-arrays-across-python-processes-abf0dc2a0ab2
@numba.jit(nopython=True, parallel=True, cache=True)
def numba_sum(data):
    return np.sum(data)
def benchmark():
    st = time.time()
    ARRAY_SIZE = int(3e8)
    print("数组大小 =", ARRAY_SIZE)
    global data
    data = np.random.random(ARRAY_SIZE)
    print("生成了", time.time() - st)
    for trial in range(5):
        print("试验 =", trial)
        st = time.time()
        s = np.sum(data)
        print("方法 1", time.time() - st, s)
        print("试验 =", trial)
        st = time.time()
        s = numba_sum(data)
        print("方法 2", time.time() - st, s)
if __name__ == "__main__":
    benchmark()

希望这对你有帮助。

英文:

Here's a benchmark with numba. It first has to compile the code which makes the first run a lot slower. The next runs are about twice to 3 times faster than numpy. So it depends on how often you run the code, if numba is worth it for you or not.

import numba
import numpy as np
import time
# based on: https://luis-sena.medium.com/sharing-big-numpy-arrays-across-python-processes-abf0dc2a0ab2
@numba.jit(nopython=True, parallel=True, cache=True)
def numba_sum(data):
    return np.sum(data)
def benchmark():
    st = time.time()
    ARRAY_SIZE = int(3e8)
    print(&quot;array size =&quot;, ARRAY_SIZE)
    global data
    data = np.random.random(ARRAY_SIZE)
    print(&quot;generated&quot;, time.time() - st)
    for trial in range(5):
        print(&quot;TRIAL =&quot;, trial)
        st = time.time()
        s = np.sum(data)
        print(&quot;method 1&quot;, time.time() - st, s)
        print(&quot;TRIAL =&quot;, trial)
        st = time.time()
        s = numba_sum(data)
        print(&quot;method 2&quot;, time.time() - st, s)
if __name__ == &quot;__main__&quot;:
    benchmark()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

加速 np.sum 使用多进程

问题

答案1

将一个值随机分配给一个固定大小的值列表

使用matplotlib根据亮度绘制颜色。

使用Python和Openpyxl从特定单元格开始向现有的Excel文件追加多个列表的数量。

Pandas：更改重复项的索引

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。