英文:
What is the difference between distribute-work-synchronize and fork-join parallel programming methodologies
问题
在这篇维基百科的文章中:
https://en.wikipedia.org/wiki/Go_(programming_language)#Suitability_for_parallel_programming
据称,Go语言专家使用了分发-工作-同步模式来组织他的并行程序,而非专家使用了fork-join模式:
https://upload.wikimedia.org/wikipedia/commons/thumb/f/f1/Fork_join.svg/2000px-Fork_join.svg.png
我在学校中熟悉了fork-join模式,但我想知道分发-工作-同步模式是什么,以及它与我熟悉的经典fork-join模式有什么区别?
当我使用fork-join模式时,通常会运行与处理器核心数量相同的线程,但在这篇论文中,他们说Go语言专家也是这样做的,并且他们简要提到创建新线程的开销是Go语言专家优化代码的一种方式,但似乎没有详细说明。
英文:
In this article on wikipedia:
https://en.wikipedia.org/wiki/Go_(programming_language)#Suitability_for_parallel_programming
It is claimed that the go expert used a distribute-work-synchronize pattern to organize his parallel programs, while the non-expert used a fork-join:
https://upload.wikimedia.org/wikipedia/commons/thumb/f/f1/Fork_join.svg/2000px-Fork_join.svg.png
I am familiar with the fork-join from school, but I was wondering what the distribute-work-synchronize pattern is, and what the differences are between it and the classical fork-join model that i'm familiar with?
When I do fork join I typically run as many threads as I do cores, but in the paper they say the go expert did this as well, and they briefly mention the overhead of creating new threads being one of the ways the go expert optimized the code, but don't seem to go into detail.
答案1
得分: 4
我会非常谨慎地对待你从https://en.wikipedia.org/wiki/Go_(programming_language)#Suitability_for_parallel_programming引用的陈述,不要将其视为Go编程的普遍真理。
我认为在研究中描述的分发-工作-同步方法是将问题分解为子任务的方法,这些子任务主要由硬件的并行化能力确定,而不是问题自然地分解为较小任务的方式。
这种方法可能会在特定问题和专业知识方面给您带来一些性能优势,但即使对于尴尬并行问题,应用起来也可能并不简单。此外,这种方法更依赖于您使用的具体硬件(例如2-32个核心与64-1024个核心,常规CAS与LL/SC),特定问题的规模(例如适合L1缓存还是勉强适合RAM),最重要的是,依赖于您对该方法和解决问题的工具的专业知识。
上述是标准的“过早优化是万恶之源”/“简单、足够和正确胜过复杂、超快和带有隐患的错误”的建议,但论文中引用的实际实验也给出了一些例子,说明您应该根据自己的判断选择哪种方法。
- 该研究使用的是Go 1.0.3版本。在调度、垃圾回收和goroutine/channel性能方面进行的重大改进可能已经使结果过时。
1.1. 具有讽刺意味的是,论文提到,当使用Chapel 1.6(而不是1.5)时,其中一种Chapel解决方案的专家版本要慢大约68%。
-
该研究并未声称提供具有统计意义的结果-对于每个平台,一个非专家解决了6个符合特定蓝图的合成问题,然后根据单个专家的建议重写了他的解决方案。
-
专家不应对在特定上下文之外应用他的建议负责。对于这些特定问题,如果您是Golang团队的高级软件工程师(Luuk van Dijk),并且您的替代方案是使用纯粹的分而治之和Go 1.0.3,那么分发-工作-同步是更好的方法。
> 当我进行分叉并加入时,我通常会运行与核心数相同数量的线程,但在论文中,他们说Go专家也是这样做的,并且他们简要提到创建新线程的开销是Go专家优化代码的一种方式,但似乎没有详细说明。
我认为创建新线程的开销与递归树底部任务的增多有关。
我认为那些属于主定理的第2和第3种情况的算法会受到特别影响,但即使是属于第1种情况的算法(其中在树的叶级别上执行的工作最重要,即生成的线程的开销最小),也会受到影响。例如,在自顶向下的归并排序中,为每对元素的比较创建一个新的逻辑任务可能是多余的。
在我看来,运行与核心数相同数量的线程,但在每个递归级别上自然地划分工作,是分发-工作-同步和纯粹的分而治之之间的一个很好的折衷方案。
您仍然需要付出一些复杂性的代价(可能也包括运行时间),以便在K个工作线程上调度您的N个任务。在运行时,这个代价可能会导致您错过并行化的机会,例如由于缓存竞争、次优调度、线程或核心亲和力等原因。然而,这可能会在语言平台层面、分叉-合并框架层面(如果您使用框架)或操作系统层面上抽象出来。在这种情况下,您的解决方案不需要完全适应语言平台、问题规模或机器硬件的变化,并且您应该能够从底层优化中受益,而无需修改您的解决方案。
我认为,自定义的分发-工作-同步解决方案的复杂性增加和可移植性降低只有在您能够证明自然的分而治之解决方案不足以解决问题并且您意识到后果时才值得。
英文:
I would be very careful taking the statement you've quoted from https://en.wikipedia.org/wiki/Go_(programming_language)#Suitability_for_parallel_programming as a general truth for Go programming.
I assume that what's described in the study as distribute-work-synchronize is the approach of dividing a problem into subtasks, that are determined mainly by the parallelization that can be achieved in hardware, and less by the natural way the problem decomposes into smaller tasks.
This approach may give you some performance benefits depending on your specific problem and your expertise, but it may not be trivial to apply even for embarrassingly-parallel problems. In addition, this approach is more dependant on the specific hardware you're using (e.g. 2-32 vs 64-1024 cores, regular CAS vs LL/SC), on the specific problem size (e.g. fits in L1 vs barely fits in RAM), and most importantly, on your expertise with that approach and with the tool that you're using to solve your problem.
The above is standard "premature optimization is the root of all evil" / "simple, adequate and correct trumps complex, super-fast and with an insidious bug" advice, but the actual experiment cited in the paper also gives some examples why you should use your own judgment which approach to use.
- The study used Go 1.0.3. The significant improvements done on the scheduling, garbage collection and goroutine/channel performance fronts may have effectively made the results obsolete.
1.1. Ironically, the paper mentions how for one of the Chapel solutions, expert version was ~68% slower when Chapel 1.6 (instead of 1.5) was used.
-
The study does not claim to provide statistically significant results - for each of the 4 platforms, a single non-expert solved 6 synthetic problems that fit a specific blueprint, then rewrote his solutions according to the advice of a single expert.
-
The expert should not be held responsible for applying his advice outside of the specific context. distribute-work-synchronize was the better approach for these specific problems, if you are a Staff Software Engineer on the Golang team (Luuk van Dijk), and your alternative was using plain divide-and-conquer with Go 1.0.3.
> When I do fork join I typically run as many threads as I do cores, but in the paper they say the go expert did this as well, and they briefly mention the overhead of creating new threads being one of the ways the go expert optimized the code, but don't seem to go into detail.
I'm assuming the overhead of creating new threads is related to the proliferation of tasks when approaching the bottom of the recursion tree.
I would think that algorithms that fall into Case 2 and 3 of the Master Theorem will be particularly affected, but even algorithms that fall in Case 1 (where the work done on the leaf level of the tree is most significant, i.e. the overhead of the spawned threads is diluted the most) will be affected. For example, making a new logical task for the comparison of each pair of elements in a top-down merge sort is probably redundant.
IMHO running as many threads as you do cores, but logically dividing the work naturally/on each recursion level is a great compromise between my understanding for distribute-work-synchronize and a plain divide-and-conquer.
You are still paying some price in complexity (and probably, but not necessarily, in running time) for the scheduling of your N tasks on the K worker threads. It is possible for this price to cost you in missed parallelization opportunities at runtime, for example because of cache thrashing, because of sub-optmal scheduling, because of thread or core affinity in play, etc.
However this may be abstracted away from your program at the language platform level, or at the Fork-Join framework level (in case you are using a framework), or maybe at the OS level.
In this case, your solution is not fully responsible for adapting to changes in your language platform, in your problem size, or in your machine hardware, and you should be able to benefit from optimizations in the underlying layers without touching your solution.
My bet is that the increased complexity and the reduced portability of a custom-tailored distribute-work-synchronize solution is worth it only if you can prove that your natural divide-and-conquer solution is not enough and you are aware of the consequences.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论