Python多进程比串行慢2倍,不管chunksize如何?

huangapple go评论71阅读模式
英文:

Python multiprocessing 2x slower than serial regardless of chunksize?

问题

primal_body_selector函数中,我想要并行运行第146至150行的代码段。您尝试使用多进程来提高性能,但是发现在使用ThreadPool时反而变得比以前慢。您猜测这可能与序列化Mtuples有关,但是M是一个32x32的矩阵,而tuples只有179个长度为3的元组,因此数据量应该不算大。

我无法直接分析代码性能问题,但我可以提供一些建议:

  1. 考虑CPU密集型任务:如果您的mutual_information函数确实主要涉及矩阵运算,并且计算量很大,那么使用多进程可能会更合适,因为它可以利用多核处理器的能力。但是,请确保这确实是CPU密集型任务。

  2. 查看GIL:Python的全局解释器锁(Global Interpreter Lock,GIL)会限制多线程并行执行。如果您的任务受到GIL的限制,那么使用多进程(multiprocessing)而不是多线程(ThreadPool)可能会更有效。

  3. 内存使用:确保内存不是瓶颈。如果您的任务需要大量内存,可能会导致性能下降。但是,根据您的描述,Mtuples的大小似乎不是问题。

  4. 调整并行数量:尝试不同数量的进程/线程,以找到性能最佳的配置。有时候,过多的并发也会导致性能下降。

  5. 使用性能分析工具:使用性能分析工具,如cProfile,来查看您的代码中哪些部分最耗时。这将有助于确定性能问题的具体原因。

请注意,根据任务的性质和您的计算机硬件,最佳的并行策略可能会有所不同。最终,优化性能可能需要一些试验和调整。希望这些建议能帮助您找到问题所在并提高代码的性能。

英文:

I am trying to modify the code found here to use multiprocessing: https://github.com/Sensory-Information-Processing-Lab/infotuple/blob/master/body_metrics.py

In the primal_body_selector function, I want to run lines 146-150 in parallel:

    for i in range(len(tuples)):
        a = tuples[i][0]
        B = tuples[i][1:]

        infogains[i] = mutual_information(M, a, B, M.shape[0]/10, dist_std, mu)

I believe this could lead to significant performance gains because the mutual_information function (code here) is mainly just matrix math so multiprocessing should really help.

However, when I try to use a simple pool = ThreadPool(processes=8) at the top of the file (called from a separate main() method, so pool is initialized on import) and then run the below command in place of the loop code listed above:

    def infogains_task_function(i, infogains, M, tuples, dist_std, mu):
        a = tuples[i][0]
        B = tuples[i][1:]

        infogains[i] = mutual_information(M, a, B, M.shape[0], dist_std, mu)

................

    # inside primal_body_selector
    pool.starmap(infogains_task_function,
                 [(i, infogains, M, tuples, dist_std, mu) for i in range(len(tuples))],
                 chunksize=80)

This code chunk is twice as slow as before (2 vs 4 seconds) as measured by time.time(). Why is that? Regardless of which chunk size I pick (tried 1, 20, 40, 80), it's twice as slow.

I originally thought serializing M and tuples could be the reason, but M is a 32x32 matrix and tuples is 179 tuples of length 3 each so it's really not that much data right?

Any help would be greatly appreciated.

答案1

得分: 1

多进程和多线程都不是万能的解决方案... 你说得对,多进程是在多处理器系统上进行重型计算的好工具(或者在功能上与多核处理器相同)。

问题在于,在多个线程或进程上共享操作会增加一些复杂性:你必须共享或复制一些内存,在某个时间点收集结果并同步一切。所以对于简单的任务来说,开销比收益要高。

更糟糕的是,如果你仔细地手动拆分任务,你可以减少开销。但当你使用通用工具(即使是像Python标准库这样精心制作的工具)时,你应该意识到它的创建者必须考虑到许多用例并在他们的代码中包含许多测试... 这又增加了复杂性。但手动方式会大幅增加开发(和测试)成本...

你应该记住的是:对于简单的任务,使用简单的工具,只有在真正需要时才使用多种多样的工具。一些真实用例包括:

  • 高负载的运营服务器。额外的开发成本可以通过支持大量负载而不崩溃来平衡。
  • 非常重的计算(气象或海洋预测模型):当单次运行的长度超过几个小时时,必须采取行动 Python多进程比串行慢2倍,不管chunksize如何?
  • 最重要的是:多种多样的工具是优化工具。优化总是有成本的,你必须仔细考虑什么真正需要它,什么可以做,使用基准测试来确保增加的复杂性是值得的 - 没有什么是显而易见的....

顺便说一句,对于像矩阵运算这样的简单计算任务,numpy/scipy可能比原始的Python处理更适合...

英文:

Neither multiprocessing nor multithreading are magical silver bullets... You are right that multiprocessing is a nice tool for heavy computations on multi-processor systems (or multi-core processors which is functionaly the same).

The problem is that sharing operations on a number of threads or processes adds some complexity: you have to share or copy some memory, gather the results at some time and synchronize everything. So for simple tasks, the overhead is higher than the gain.

Worse, if you carefully split your tasks manually you may reduce the overhead. But when you use a generic tool (even a nicely crafted one like the Python standard library), you should be aware that its creators had to take care of many use cases and include a number of tests in their code... again with added complexity. But the manual way dramatically increases the development (and testing) cost...

What should you remember of that: use simple tools for simple tasks, on only go with multi-x things when they are really required. Some real use cases:

  • heavy loaded operational servers. The extra development cost is balanced by the ability to support heavy loads without crashes
  • really heavy computations (meteorological or oceanographic forecast models): when the lenght of a single run exceeds several hours, things have to be done Python多进程比串行慢2倍,不管chunksize如何?
  • and the most important: multi-x things are optimization tools. Optimization always have a cost and you must carefully think about what really requires it, what can be done, and use benchmarks to make sure that the added complexity was worth it - nothing is ever evident here....

BTW, for simple computation tasks like matrix operations, numpy/scipy are probably far better suited than raw Python processing...

huangapple
  • 本文由 发表于 2023年7月17日 14:08:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76701873.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定