Python多进程为什么不能将处理时间减少到4核CPU的1/4?

huangapple go评论109阅读模式
英文:

Why didn't Python multi-processing reduce processing time to 1/4 on a 4-cores CPU

问题

在CPython中,由于全局解释器锁(GIL)的存在,多线程无法并行使用多个CPU核心。为了突破这个限制,可以使用多进程(multiprocessing)。以下是您提供的Python代码的翻译:

from math import sqrt
from time import time
from threading import Thread
from multiprocessing import Process

def time_recorder(job_name):
    """记录运行函数的时间消耗"""
    def deco(func):
        def wrapper(*args, **kwargs):
            print(f"运行 {job_name}")
            start_epoch = time()
            func(*args, **kwargs)
            end_epoch = time()
            time_consume = end_epoch - start_epoch
            print(f"{job_name} 的时间消耗: {time_consume}")
        return wrapper
    return deco

def calc_sqrt():
    """CPU密集型任务"""
    i = 2147483647
    for j in range(20 * 1000 * 1000):
        i -= 1
        sqrt(i)

@time_recorder("一个接一个")
def one_by_one():
    for _ in range(8):
        calc_sqrt()

@time_recorder("多线程")
def multi_thread():
    t_list = list()
    for i in range(8):
        t = Thread(name=f'worker-{i}', target=calc_sqrt)
        t.start()
        t_list.append(t)
    for t in t_list:
        t.join()

@time_recorder("多进程")
def multi_process():
    p_list = list()
    for i in range(8):
        p = Process(name=f"worker-{i}", target=calc_sqrt)
        p.start()
        p_list.append(p)
    for p in p_list:
        p.join()

def main():
    one_by_one()

    print('-' * 40)
    multi_thread()

    print('-' * 40)
    multi_process()

if __name__ == '__main__':
    main()

至于您提出的关于多进程性能在不同环境下的差异问题,可能有多种因素影响:

  1. CPU硬件差异: 不同的CPU架构和型号可能在多进程性能方面表现不同。某些CPU可能更适合多进程任务,而其他CPU则可能受限于不同的因素。

  2. 操作系统差异: 不同操作系统对多进程的支持也有所不同。某些操作系统可能更好地优化了多进程任务的执行,而其他可能没有这种程度的优化。

  3. GIL的影响: 在CPython中,GIL限制了多线程并行性,但多进程不受其限制。这可能导致在多线程情况下,性能提升不明显,而在多进程情况下,性能提升更为明显。

总之,多进程性能受多种因素的影响,包括硬件、操作系统和应用程序本身的特性。您在不同环境下观察到的性能差异可能是由这些因素的相互作用引起的。

英文:

Multi-threading in CPython cannot use more than one CPU in parallel because the existence of GIL. To break this limitation, we can use multiprocessing. I'm writing Python code to demonstrate that. Here is my code:

from math import sqrt
from time import time
from threading import Thread
from multiprocessing import Process
def time_recorder(job_name):
"""Record time consumption of running a function"""
def deco(func):
def wrapper(*args, **kwargs):
print(f"Run {job_name}")
start_epoch = time()
func(*args, **kwargs)
end_epoch = time()
time_consume = end_epoch - start_epoch
print(f"Time consumption of {job_name}: {time_consume}")
return wrapper
return deco
def calc_sqrt():
"""Consume the CPU"""
i = 2147483647
for j in range(20 * 1000 * 1000):
i -= 1
sqrt(i)
@time_recorder("one by one")
def one_by_one():
for _ in range(8):
calc_sqrt()
@time_recorder("multi-threading")
def multi_thread():
t_list = list()
for i in range(8):
t = Thread(name=f'worker-{i}', target=calc_sqrt)
t.start()
t_list.append(t)
for t in t_list:
t.join()
@time_recorder("multi-processing")
def multi_process():
p_list = list()
for i in range(8):
p = Process(name=f"worker-{i}", target=calc_sqrt)
p.start()
p_list.append(p)
for p in p_list:
p.join()
def main():
one_by_one()
print('-' * 40)
multi_thread()
print('-' * 40)
multi_process()
if __name__ == '__main__':
main()

Function "calc_sqrt()" is the CPU-consuming job, which calculates square root for 20 million times. Decorator "time_recorder" calculates the running time of the decorated functions. And there are 3 functions which run the CPU-consuming job one by one, in multiple threads and in multiple processes respectively.

By running the above code on my laptop, I got the following output:

Run one by one
Time consumption of one by one: 39.31295585632324
----------------------------------------
Run multi-threading
Time consumption of multi-threading: 39.36112403869629
----------------------------------------
Run multi-processing
Time consumption of multi-processing: 23.380358457565308

Time consumption of "one_by_one()" and "multi_thread()" are almost the same, which are as expected. But time consumption of "multi_process()" is a little bit confusing. My laptop has an Intel Core i5-7300U CPU, which has 2 cores, 4 threads. Task manager simply shows that there are 4 (logic) CPUs in my computer. Task manager also shows that the CPU usage of all the 4 CPUs are 100% during the execution. But the processing time didn't reduce to 1/4 but rather 1/2, why? The operating system of my laptop is Windows 10 64-bit.

Later, I tried this program on a Linux virtual machine, and got the following output, which is more reasonable:

Run one by one
Time consumption of one by one: 33.78603768348694
----------------------------------------
Run multi-threading
Time consumption of multi-threading: 34.396817684173584
----------------------------------------
Run multi-processing
Time consumption of multi-processing: 8.470374584197998

This time, processing time with multi-processing reduced to 1/4 of that with multi-threading. Host of this Linux server equipped with an Intel Xeon E5-2670, which has 8 cores and 16 threads. The host OS is CentOS 7. The VM is assigned with 4 vCPUs and the OS is Debian 10.

The questions are:

  • why didn't the processing time of the multi-processing job reduce to 1/4 but rather to just 1/2 on my laptop?
  • Is it a CPU issue, which means that the 4 threads of Intel Core i5-7300U are not "real parallel" and may impact each other, and Intel Xeon E5-2670 doesn't have that issue?
  • Or is it an OS issue, which means that Windows 10 doesn't support multi-processing well, processes may impact each other when running in parallel?

答案1

得分: 3

根据@Pingu在评论中所说,速度的提升非常依赖于您的计算机核心数量。您的计算机只有两个物理核心(4个硬件线程,也称为逻辑核心),这些核心可能部分被操作系统线程占用。不仅拥有更多核心的计算机在多进程处理方面可能更高效,而且操作系统的管理将占用更少的总CPU,对性能的影响也较小。

以下是一个可以让您更改要执行N_TASKS并发调用calc_sqrt的线程/进程数量的测试代码实现:

from math import sqrt
from time import time
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

N_WORKERS = 8
N_TASKS = 32

def time_recorder(job_name):
    """记录运行函数的时间消耗"""
    def deco(func):
        def wrapper(*args, **kwargs):
            print(f"Run {job_name}")
            start_epoch = time()
            out = func(*args, **kwargs)
            end_epoch = time()
            time_consume = end_epoch - start_epoch
            print(f"Time consumption of {job_name}: {time_consume:.6}s")
            return out
        return wrapper
    return deco

def calc_sqrt(_):
    i = 2147483647
    for _ in range(5 * 1000 * 1000):
        i -= 1
        sqrt(i)

@time_recorder("一个接一个")
def one_by_one():
    _ = [calc_sqrt(_) for _ in range(N_TASKS)]

@time_recorder("多线程")
def multi_thread():
    with ThreadPoolExecutor(max_workers=N_WORKERS) as e:
        _ = e.map(calc_sqrt, range(N_TASKS))

@time_recorder("多进程")
def multi_process():
    with ProcessPoolExecutor(max_workers=N_WORKERS) as e:
        _ = e.map(calc_sqrt, range(N_TASKS), chunksize=1)

def main():
    one_by_one()

    print('-' * 40)
    multi_thread()

    print('-' * 40)
    multi_process()

if __name__ == '__main__':
    main()

在我的计算机上(M1 Pro MacBook Pro 14英寸),不同线程/进程数量的大致计时如下:

线程/进程数量 顺序执行 多线程 多进程
1 10秒 10秒 10秒
2 10秒 10秒 5.5秒
4 10秒 10秒 2.8秒
6 10秒 10秒 2.2秒
8 10秒 10秒 1.8秒
10 10秒 10秒 1.8秒
12 10秒 10秒 1.8秒

可以看到性能与多进程变体中核心数量成正比。这与您的2核心计算机上观察到的性能提升大致相符,以及在4核心计算机上几乎提升了4倍。

您可以观察到在8个核心时性能饱和(使用10个并发进程没有改善),这表明我的计算机可能有8个物理核心。

请注意,CPU的物理核心和硬件线程(也称为超线程)之间存在差异。Core i5-7300U CPU具有4个硬件线程,但这不等同于4个(物理)核心的计算机。超线程可以提高CPU的多进程处理能力,但通常低于增加更多物理核心的效果。例如,英特尔声称由于超线程而导致的性能提升为15%至30%,远远不及您在CPU规格上看到的“2核心/4线程”时可能想象的2倍性能提升。

英文:

As said by @Pingu in comments, the speed gain very much depends on the number of cores of your machine. Your machine only has two physical cores (4 hardware threads, also called logical cores), which are probably partly occupied by your OS threads. Not only a machine with more cores will likely be more performant at multiprocessing but the OS bookkeeping will occupy less total CPU and will have a less significant impact on the performance.

Here is an implementation of your test code that let you change the number of threads/process on which to execute N_TASKS concurrent calls to calc_sqrt:

from math import sqrt
from time import time
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor


N_WORKERS = 8
N_TASKS = 32

def time_recorder(job_name):
    """Record time consumption of running a function"""
    def deco(func):
        def wrapper(*args, **kwargs):
            print(f"Run {job_name}")
            start_epoch = time()
            out = func(*args, **kwargs)
            end_epoch = time()
            time_consume = end_epoch - start_epoch
            print(f"Time consumption of {job_name}: {time_consume:.6}s")
            return out
        return wrapper
    return deco


def calc_sqrt(_):
    i = 2147483647
    for _ in range(5 * 1000 * 1000):
        i -= 1
        sqrt(i)


@time_recorder("one by one")
def one_by_one():
    _ = [calc_sqrt(_) for _ in range(N_TASKS)]


@time_recorder("multi-threading")
def multi_thread():
    with ThreadPoolExecutor(max_workers=N_WORKERS) as e:
        _ = e.map(calc_sqrt, range(N_TASKS))


@time_recorder("multi-processing")
def multi_process():
    with ProcessPoolExecutor(max_workers=N_WORKERS) as e:
        _ = e.map(calc_sqrt, range(N_TASKS), chunksize=1)


def main():
    one_by_one()

    print('-' * 40)
    multi_thread()

    print('-' * 40)
    multi_process()


if __name__ == '__main__':
    main()

On my machine (M1 Pro MacBook Pro 14") here are the approximate timings for different number of threads/processes:

# threads/processes Sequential Multithreading Multiprocessing
1 10s 10s 10s
2 10s 10s 5.5s
4 10s 10s 2.8s
6 10s 10s 2.2s
8 10s 10s 1.8s
10 10s 10s 1.8s
12 10s 10s 1.8s

As you can see the performance is quite proportional to the number of cores on the multiprocessing variant. This is roughly the behavior you can observe on your machines: near a 2x performance gain on your 2 cores machine and almost 4x on the 4 cores one.

You can observe a saturation at 8 cores (there is no improvement with 10 concurrent processes), which indicates that my machine likely has 8 physical cores.

Note that there is a difference between a CPU physical cores and hardware threads (also called hyper-threading). The Core i5-7300U CPU has 4 hardware threads but this is not equivalent to a 4 (physical) cores machine. Hyper-threading can improve the performance of a CPU multiprocessing capability but it is generally lower than adding more physical cores. For instance, Intel claims a 15% to 30% performance gain due to hyper-threading, which is far from the 2x performance gain you could imagine when reading "2-cores / 4-threads" on the CPU specs.

答案2

得分: 1

以下是已翻译的内容:

  1. "The parallelism between two logical cores sharing a physical core is complicated." - 两个逻辑核心共享一个物理核心之间的并行性很复杂。
  2. "Intel brands their implementation of SMT as 'hyperthreading'." - 英特尔将他们的SMT实现称为“超线程”。
  3. "There's only one set of execution units, and the front-end alternate cycles between threads when both aren't stalled." - 只有一个执行单元组,前端在两个线程都不处于停滞状态时交替切换。
  4. "Out-of-order exec in the back-end does happen on uops from both logical cores (confusingly called 'hardware threads') at the same time." - 后端的乱序执行同时发生在两个逻辑核心(令人困惑地称为“硬件线程”)的uops上。
  5. "If you wrote this in C, you'd get no benefit from hyperthreading for square roots, since the FP div/sqrt unit is slowish compared to everything else, very easy for one thread to max it out." - 如果您用C编写这个,对于平方根,您将得不到超线程的好处,因为与其他一切相比,FP div/sqrt单元速度较慢,一个线程很容易将其耗尽。
  6. "But this is Python; it's taking about 40 seconds for 20M square roots on a single core, taking about 2 microseconds per sqrt!!!" - 但这是Python;在单个核心上,计算2000万次平方根大约需要40秒,每个平方根大约需要2微秒!!!
  7. "So either interpreter overhead is even larger than usual, or it's doing some fancy integer square root thing instead of taking advantage of 'double' for small-enough integers." - 所以要么解释器的开销比通常更大,要么它正在执行一些复杂的整数平方根操作,而不是利用“double”来处理足够小的整数。
  8. "Hyperthreading is useful when a single thread can't max out the execution resources of a single core, especially because of branch mispredictions, cache misses, or low instructions-per-cycle due to data dependencies." - 当单个线程无法充分利用单个核心的执行资源时,特别是由于分支错误预测、缓存未命中或由于数据依赖性导致的每周期指令数低时,超线程非常有用。
  9. "Apparently the workload you picked isn't like that. Many do get some speedup." - 显然,您选择的工作负载不是这样的。许多情况下确实会有一些加速。
  10. "Related:" - 相关链接:
  11. "* https://stackoverflow.com/questions/74152562/what-is-the-difference-between-hyperthreading-and-multithreading-does-amd-zen-u - very little difference, and my answer has a bunch of links with more computer-architecture detail about how it works." - * https://stackoverflow.com/questions/74152562/what-is-the-difference-between-hyperthreading-and-multithreading-does-amd-zen-u - 很少有差异,我的答案中有一些链接,提供了更多关于它如何工作的计算机架构细节。
  12. "* https://stackoverflow.com/questions/23078766/is-hyperthreading-smt-a-flawed-concept - no, the answers explain when / why it's useful." - * https://stackoverflow.com/questions/23078766/is-hyperthreading-smt-a-flawed-concept - 不,答案解释了何时/为什么它有用。
英文:

The parallelism between two logical cores sharing a physical core is complicated. https://en.wikipedia.org/wiki/Simultaneous_multithreading (Intel brands their implementation of SMT as "hyperthreading"). There's only one set of execution units, and the front-end alternate cycles between threads when both aren't stalled. Out-of-order exec in the back-end does happen on uops from both logical cores (confusingly called "hardware threads") at the same time.

If you wrote this in C, you'd get no benefit from hyperthreading for square roots, since the FP div/sqrt unit is slowish compared to everything else, very easy for one thread to max it out. (assuming it compiles to a loop doing cvtsi2sd and sqrtsd double-precision square root, which has plenty of precision). Unlike most instructions division (and square root) aren't fully pipelined on modern CPUs: the execution unit can't start working on a new one every clock cycle. And there's only one such execution unit on your Kaby Lake CPU.

But this is Python; it's taking about 40 seconds for 20M square roots on a single core, taking about 2 microseconds per sqrt!!! At 3.5GHz, that's 7000 clock cycles per square root, vs. an average throughput of one per 4.5 cycles for the sqrtsd asm instruction on Kaby Lake (https://uops.info/, check the SSE or AVX instruction set).

So either interpreter overhead is even larger than usual, or its doing some fancy integer square root thing instead of taking advantage of double for small-enough integers. (Python integers are arbitrary precision). So it's just a coincidence that hardware FPU sqrt throughput would be the bottleneck for a C program, Python is obviously not doing that, or doing so much else around it that the HW div/sqrt unit is busy for a trivial amount of time. Unless a lot of the time Python is spending is on integer division, which is also not fully pipelined.


Hyperthreading is useful when a single thread can't max out the execution resources of a single core, especially because of branch mispredictions, cache misses, or low instructions-per-cycle due to data dependencies (e.g. one long chain of FP adds or multiplies, so there's no instruction-level parallelism for the CPU to find).

Apparently the workload you picked isn't like that. Many do get some speedup. (Often not linear algebra stuff, though; good BLAS libraries can max out the FMA execution units with one thread for things like matmul, and having 2 threads per core competing for the same cache tends to make things worse.)


Related:

huangapple
  • 本文由 发表于 2023年2月27日 14:52:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/75577468.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定