加速 np.sum 使用多进程

huangapple go评论64阅读模式
英文:

Speeding Up np.sum With multiprocessing

问题

如果我有一个大小在10^8到10^9之间的NumPy数组,是否有可能比np.sum更快地计算其总和?

我尝试过使用multiprocessingfork,但无论工作进程的数量(1-4)如何,似乎都比只调用np.sum慢。我在一台配备2 GHz双核英特尔Core i5处理器的Mac上使用Python 3.8。不确定如果我有更多的CPU核心是否会有不同的结果。

我的代码:

import concurrent.futures
import multiprocessing as mp
import time
from concurrent.futures.process import ProcessPoolExecutor

import numpy as np

# 基于:https://luis-sena.medium.com/sharing-big-numpy-arrays-across-python-processes-abf0dc2a0ab2

def np_sum_global(start, stop):
    return np.sum(data[start:stop])

def benchmark():
    st = time.time()
    ARRAY_SIZE = int(3e8)
    print("数组大小 =", ARRAY_SIZE)
    global data
    data = np.random.random(ARRAY_SIZE)
    print("生成数组时间 =", time.time() - st)
    print("CPU核心数 =", mp.cpu_count())

    for trial in range(5):
        print("试验 =", trial)
        st = time.time()
        s = np.sum(data)
        print("方法1", time.time() - st, s)

        for NUM_WORKERS in range(1, 5):
            st = time.time()
            futures = []
            with ProcessPoolExecutor(max_workers=NUM_WORKERS) as executor:
                for i in range(0, NUM_WORKERS):
                    futures.append(
                        executor.submit(
                            np_sum_global,
                            ARRAY_SIZE * i // NUM_WORKERS,
                            ARRAY_SIZE * (i + 1) // NUM_WORKERS,
                        )
                    )
            futures, _ = concurrent.futures.wait(futures)
            s = sum(future.result() for future in futures)
            print("工作进程数 =", NUM_WORKERS, time.time() - st, s)
        print()

if __name__ == "__main__":
    mp.set_start_method("fork")
    benchmark()

输出:

数组大小 = 300000000
生成数组时间 5.1455769538879395
CPU核心数 = 4
试验 = 0
方法1 0.29593801498413086 150004049.39847052
工作进程数 = 1 1.8904719352722168 150004049.39847052
工作进程数 = 2 1.2082111835479736 150004049.39847034
工作进程数 = 3 1.2650330066680908 150004049.39847082
工作进程数 = 4 1.233708143234253 150004049.39847046

试验 = 1
方法1 0.5861320495605469 150004049.39847052
工作进程数 = 1 1.801928997039795 150004049.39847052
工作进程数 = 2 1.165492057800293 150004049.39847034
工作进程数 = 3 1.2669389247894287 150004049.39847082
工作进程数 = 4 1.2941789627075195 150004049.39847043

试验 = 2
方法1 0.44912219047546387 150004049.39847052
工作进程数 = 1 1.8038971424102783 150004049.39847052
工作进程数 = 2 1.1491520404815674 150004049.39847034
工作进程数 = 3 1.3324410915374756 150004049.39847082
工作进程数 = 4 1.4198641777038574 150004049.39847046

试验 = 3
方法1 0.5163640975952148 150004049.39847052
工作进程数 = 1 3.248213052749634 150004049.39847052
工作进程数 = 2 2.5148861408233643 150004049.39847034
工作进程数 = 3 1.0224149227142334 150004049.39847082
工作进程数 = 4 1.20924711227417 150004049.39847046

试验 = 4
方法1 1.2363107204437256 150004049.39847052
工作进程数 = 1 1.8627309799194336 150004049.39847052
工作进程数 = 2 1.233341932296753 150004049.39847034
工作进程数 = 3 1.3235111236572266 150004049.39847082
工作进程数 = 4 1.344843864440918 150004049.39847046

我查阅了一些链接:

英文:

If I have a numpy array of size 10^8 to 10^9, is it possible to compute its sum faster than np.sum?

I've tried using multiprocessing with fork, but it seems to be slower than just calling np.sum, regardless of the number of workers (1-4). I'm using Python 3.8 on a Mac with a 2 GHz Dual-Core Intel Core i5 processor. Not sure whether the results would be different if I had more CPUs.

My code:

import concurrent.futures
import multiprocessing as mp
import time
from concurrent.futures.process import ProcessPoolExecutor

import numpy as np

# based on: https://luis-sena.medium.com/sharing-big-numpy-arrays-across-python-processes-abf0dc2a0ab2


def np_sum_global(start, stop):
    return np.sum(data[start:stop])


def benchmark():
    st = time.time()
    ARRAY_SIZE = int(3e8)
    print("array size =", ARRAY_SIZE)
    global data
    data = np.random.random(ARRAY_SIZE)
    print("generated", time.time() - st)
    print("CPU Count =", mp.cpu_count())

    for trial in range(5):
        print("TRIAL =", trial)
        st = time.time()
        s = np.sum(data)
        print("method 1", time.time() - st, s)

        for NUM_WORKERS in range(1, 5):
            st = time.time()
            futures = []
            with ProcessPoolExecutor(max_workers=NUM_WORKERS) as executor:
                for i in range(0, NUM_WORKERS):
                    futures.append(
                        executor.submit(
                            np_sum_global,
                            ARRAY_SIZE * i // NUM_WORKERS,
                            ARRAY_SIZE * (i + 1) // NUM_WORKERS,
                        )
                    )
            futures, _ = concurrent.futures.wait(futures)
            s = sum(future.result() for future in futures)
            print("workers =", NUM_WORKERS, time.time() - st, s)
        print()


if __name__ == "__main__":
    mp.set_start_method("fork")
    benchmark()

Output:

array size = 300000000
generated 5.1455769538879395
CPU Count = 4
TRIAL = 0
method 1 0.29593801498413086 150004049.39847052
workers = 1 1.8904719352722168 150004049.39847052
workers = 2 1.2082111835479736 150004049.39847034
workers = 3 1.2650330066680908 150004049.39847082
workers = 4 1.233708143234253 150004049.39847046

TRIAL = 1
method 1 0.5861320495605469 150004049.39847052
workers = 1 1.801928997039795 150004049.39847052
workers = 2 1.165492057800293 150004049.39847034
workers = 3 1.2669389247894287 150004049.39847082
workers = 4 1.2941789627075195 150004049.39847043

TRIAL = 2
method 1 0.44912219047546387 150004049.39847052
workers = 1 1.8038971424102783 150004049.39847052
workers = 2 1.1491520404815674 150004049.39847034
workers = 3 1.3324410915374756 150004049.39847082
workers = 4 1.4198641777038574 150004049.39847046

TRIAL = 3
method 1 0.5163640975952148 150004049.39847052
workers = 1 3.248213052749634 150004049.39847052
workers = 2 2.5148861408233643 150004049.39847034
workers = 3 1.0224149227142334 150004049.39847082
workers = 4 1.20924711227417 150004049.39847046

TRIAL = 4
method 1 1.2363107204437256 150004049.39847052
workers = 1 1.8627309799194336 150004049.39847052
workers = 2 1.233341932296753 150004049.39847034
workers = 3 1.3235111236572266 150004049.39847082
workers = 4 1.344843864440918 150004049.39847046

Some links I've looked at:

答案1

得分: 1

这是一个使用Numba进行性能基准测试的示例。它首先需要编译代码,这会导致第一次运行速度较慢。后续运行通常比NumPy快两到三倍。因此,是否值得使用Numba取决于你运行代码的频率。

import numba
import numpy as np
import time

# 基于:https://luis-sena.medium.com/sharing-big-numpy-arrays-across-python-processes-abf0dc2a0ab2

@numba.jit(nopython=True, parallel=True, cache=True)
def numba_sum(data):
    return np.sum(data)

def benchmark():
    st = time.time()
    ARRAY_SIZE = int(3e8)
    print("数组大小 =", ARRAY_SIZE)
    global data
    data = np.random.random(ARRAY_SIZE)
    print("生成了", time.time() - st)

    for trial in range(5):
        print("试验 =", trial)
        st = time.time()
        s = np.sum(data)
        print("方法 1", time.time() - st, s)
        print("试验 =", trial)
        st = time.time()
        s = numba_sum(data)
        print("方法 2", time.time() - st, s)

if __name__ == "__main__":
    benchmark()

希望这对你有帮助。

英文:

Here's a benchmark with numba. It first has to compile the code which makes the first run a lot slower. The next runs are about twice to 3 times faster than numpy. So it depends on how often you run the code, if numba is worth it for you or not.

import numba
import numpy as np
import time

# based on: https://luis-sena.medium.com/sharing-big-numpy-arrays-across-python-processes-abf0dc2a0ab2

@numba.jit(nopython=True, parallel=True, cache=True)
def numba_sum(data):
    return np.sum(data)

def benchmark():
    st = time.time()
    ARRAY_SIZE = int(3e8)
    print("array size =", ARRAY_SIZE)
    global data
    data = np.random.random(ARRAY_SIZE)
    print("generated", time.time() - st)

    for trial in range(5):
        print("TRIAL =", trial)
        st = time.time()
        s = np.sum(data)
        print("method 1", time.time() - st, s)
        print("TRIAL =", trial)
        st = time.time()
        s = numba_sum(data)
        print("method 2", time.time() - st, s)


if __name__ == "__main__":
    benchmark()

huangapple
  • 本文由 发表于 2023年4月4日 09:35:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/75924868.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定