英文:
Most efficient number of goroutines on this machine
问题
所以我有一个由我编写的并发快速排序实现。它看起来像这样:
func Partition(A []int, p int, r int) int {
index := MedianOf3(A, p, r)
swapArray(A, index, r)
x := A[r]
j := p - 1
i := p
for i < r {
if A[i] <= x {
j++
tmp := A[j]
A[j] = A[i]
A[i] = tmp
}
i++
}
swapArray(A, j+1, r)
return j + 1
}
func ConcurrentQuicksort(A []int, p int, r int) {
wg := sync.WaitGroup{}
if p < r {
q := Partition(A, p, r)
select {
case sem <- true:
wg.Add(1)
go func() {
ConcurrentQuicksort(A, p, q-1)
<-sem
wg.Done()
}()
default:
Quicksort(A, p, q-1)
}
select {
case sem <- true:
wg.Add(1)
go func() {
ConcurrentQuicksort(A, q+1, r)
<-sem
wg.Done()
}()
default:
Quicksort(A, q+1, r)
}
}
wg.Wait()
}
func Quicksort(A []int, p int, r int) {
if p < r {
q := Partition(A, p, r)
Quicksort(A, p, q-1)
Quicksort(A, q+1, r)
}
}
我有一个sem
缓冲通道,用于限制运行的goroutine数量(如果达到该数量,我不会设置另一个goroutine,而是对子数组进行普通的快速排序)。起初我使用了100,然后改为50、20。基准测试结果稍微好了一些。但是在切换到10之后,情况开始恶化,时间变长了。所以对于我的硬件来说,存在某个任意的数字,使得算法运行最高效。
在我实现这个算法时,我实际上看到了一些关于最佳goroutine数量的SO问题,但现在我找不到了(愚蠢的Chrome历史记录并没有保存所有访问过的网站)。你知道如何计算这样的数量吗?最好的情况是我不需要硬编码,让程序自己计算。
附注:我有一个非并发的快速排序,速度大约比这个慢1.7倍。正如你在我的代码中看到的,当运行的goroutine数量超过我之前设置的数量时,我会调用Quicksort
。我想过使用ConcurrentQuicksort
,但不使用go
关键字调用它,只是简单地调用它,也许如果其他goroutine完成了它们的工作,我调用的ConcurrentQuicksort
会开始启动goroutine,加快进程(因为你可以看到Quicksort
只会启动递归快速排序,而没有goroutine)。我尝试了这样做,实际上时间比常规的快速排序慢了大约10%。你知道为什么会这样吗?
英文:
So I do have a concurrent quicksort implementation written by me. It looks like this:
func Partition(A []int, p int, r int) int {
index := MedianOf3(A, p, r)
swapArray(A, index, r)
x := A[r]
j := p - 1
i := p
for i < r {
if A[i] <= x {
j++
tmp := A[j]
A[j] = A[i]
A[i] = tmp
}
i++
}
swapArray(A, j+1, r)
return j + 1
}
func ConcurrentQuicksort(A []int, p int, r int) {
wg := sync.WaitGroup{}
if p < r {
q := Partition(A, p, r)
select {
case sem <- true:
wg.Add(1)
go func() {
ConcurrentQuicksort(A, p, q-1)
<-sem
wg.Done()
}()
default:
Quicksort(A, p, q-1)
}
select {
case sem <- true:
wg.Add(1)
go func() {
ConcurrentQuicksort(A, q+1, r)
<-sem
wg.Done()
}()
default:
Quicksort(A, q+1, r)
}
}
wg.Wait()
}
func Quicksort(A []int, p int, r int) {
if p < r {
q := Partition(A, p, r)
Quicksort(A, p, q-1)
Quicksort(A, q+1, r)
}
}
I have a sem
buffered channel, which I use to limit the number of goroutines running (if its reaches that number, I dont set up another goroutine, I just do the normal quicksort on the subarray). First I started with 100, then I've changed to 50, 20. The benchmarks would get slightly better. But after switching to 10, it started to go back, times started to get bigger. So there is some arbitrary number, at least for my hardware, that makes the algorithm run most efficient.
When I was implementing this, I actually saw some SO question about the number of goroutines that would be the best and now I cannot find it (stupid Chrome history actually saves not all visited sites). Do you know how to calculate such a things? And it would be the best if I didn't have to hardcode it, just let the program do it itself.
P.S I have nonconcurrent Quicksort, which runs about 1.7x slower than this. As you can see in my code, I do Quicksort
, when the number of running goroutines exceeds the number I've set up earlier. I thought what about using a ConcurrentQuicksort
, but not calling it with go
keyword, just simply calling it, and maybe if other goroutines finish their job, the ConcurrentQuicksort
which I called would start to launch up goroutines, speeding up the process (cuz as you can see Quicksort
would only launch recursive quicksorts, without goroutines). I did that, and actually the time was like 10% slower than the regular Quicksort. Do you know why would that happen?
答案1
得分: 10
你必须对这些东西进行一些实验,但我认为主要问题不是“同时运行的goroutine”。正如@reticentroot提供的答案所说,同时运行大量的goroutine并不一定是一个问题。
我认为你应该关注的是“goroutine启动的总数”。当前的实现理论上可以启动一个goroutine来对只有几个元素的数组进行排序,而这个goroutine在启动/协调上花费的时间比实际排序时间更多。
理想情况下,你只启动所需数量的goroutine来充分利用所有的CPU。如果你的工作项大小相近,核心的繁忙程度相近,那么每个核心启动一个任务是完美的。
在这里,任务的大小不均匀,所以你可能会将排序分成比CPU核心数量稍多一些的任务,并将它们分配出去。(在生产环境中,你通常会使用工作池来分发工作,而不是为每个任务启动一个新的goroutine,但我认为我们可以在这里跳过这个步骤。)
为了得到一个可行的任务数量,即足够使所有核心保持繁忙,但不会产生太多开销,你可以设置一个最小大小(初始数组大小/100或其他值),只有大于该大小的数组才进行排序。
稍微详细地说,每次将任务发送到后台都会带来一些开销。首先:
- 每个goroutine的启动都会花费一些时间来设置堆栈和进行调度器的记录
- 每次任务切换都会在调度器中花费一些时间,并且在两个goroutine查看不同的代码或数据时可能会产生缓存未命中
- 你自己的协调代码(通道发送和
sync
操作)也需要时间
其他因素可能会阻止理想的加速效果:你可能会遇到系统范围的限制,例如内存带宽,一些同步开销可能会随着核心数量的增加而增加,有时你可能会遇到一些更棘手的问题,比如伪共享和非一致性内存访问。但是设置、切换和协调的开销是一个很好的起点。
超过协调开销的好处是,当其他CPU闲置时,它们可以完成工作。
我认为(但没有测试),在50个goroutine时你的问题可能是:1)你早就达到了几乎完全利用的状态,所以添加更多任务只会增加更多的协调工作,而不会加快速度;2)你正在为“微小”的排序创建goroutine,这些排序可能在设置和协调上花费的时间比实际排序时间更多。而在10个goroutine时,你的问题可能是你无法实现完全的CPU利用率。
如果你愿意,你可以通过计算在不同goroutine限制下的总goroutine启动次数(使用原子全局计数器)和在不同限制下测量CPU利用率(例如在Linux/UNIX下使用time
实用程序运行程序)来测试这些理论。
对于这样的分而治之问题,我建议的方法是“只为足够大的子问题分叉出一个goroutine”(对于快速排序来说,这意味着足够大的子数组)。你可以尝试不同的限制:也许只有在子数组大于原始数组的1/64或某个静态阈值(如1000个元素)时才启动goroutine。
我猜你设计这个排序例程只是为了练习,但是你可以采取各种措施来使你的排序更快或更能应对奇怪的输入。标准库的排序在小的子数组上退化为插入排序,并在导致快速排序问题的不寻常数据模式上使用堆排序。
你还可以考虑使用其他算法,如基数排序来进行全部或部分排序,我曾经尝试过。那个排序库也是并行的。我最终在将子数组交给其他goroutine进行排序之前,将最小的截止点设为127个元素,并使用一个固定的goroutine池和带缓冲的通道来在它们之间传递任务。那时候,这种方法在实际中产生了不错的加速效果,尽管当时可能不是最佳方法,而且我几乎可以确定它在当今的Go调度器上也不是最佳方法。实验是有趣的!
英文:
You have to experiment a bit with this stuff, but I don't think the main concern is goroutines running at once. As the answer @reticentroot linked to says, it's not necessarily a problem to run a lot of simultaneous goroutines.
I think your main concern should be total number of goroutine launches. The current implementation could theoretically start a goroutine to sort just a few items, and that goroutine would spend a lot more time on startup/coordination than actual sorting.
The ideal is you only start as many goroutines as you need to get good utilization of all your CPUs. If your work items are ~equal size and your cores are ~equally busy, then starting one task per core is perfect.
Here, tasks aren't evenly sized, so you might split the sort into somewhat more tasks than you have CPUs and distribute them. (In production you would typically use a worker pool to distribute work without starting a new goroutine for every task, but I think we can get away with skipping that here.)
To get a workable number of tasks--enough to keep all cores busy, but not so many that you create lots of overhead--you can set a minimum size (initial array size/100 or whatever), and only split off sorts of arrays larger than that.
In slightly more detail, there is a bit of cost every time you send a task off to the background. For starters:
- Each goroutine launch spends a little time setting up the stack and doing scheduler bookkeeping
- Each task switch spends some time in the scheduler and may incur cache misses when the two goroutines are looking at different code or data
- Your own coordination code (channel sends and
sync
ops) takes time
Other things can prevent ideal speedups from happening: you could hit a systemwide limit on e.g. memory bandwidth as Volker pointed out, some sync costs can increase as you add cores, and you can run into various trickier issues sometimes. But the setup, switching, and coordination costs are a good place to start.
The benefit that can outweigh the coordination costs is, of course, other CPUs getting work done when they'd otherwise sit idle.
I think, but haven't tested, that your problems at 50 goroutines are 1) you already reached nearly-full utilization long ago, so adding more tasks adds more coordination work without making things go faster, and 2) you're creating goroutines for tiny sorts, which may spend more of their time setting up and coordinating than they actually do sorting. And at 10 goroutines your problem might be that you're no longer achieving full CPU utilization.
If you wanted, you could test those theories by counting the number of total goroutine launches at various goroutine limits (in an atomic global counter) and measuring CPU utilization at various limits (e.g. by running your program under the Linux/UNIX time
utility).
The approach I'd suggest for a divide-and-conquer problem like this is only fork off a goroutine for large enough subproblems (for quicksort, that means large enough subarrays). You can try different limits: maybe you only start goroutines for pieces that are more than 1/64th of the original array, or pieces above some static threshold like 1000 items.
And you meant this sort routine as an exercise, I suspect, but there are various things you can do to make your sorts faster or more robust against weird inputs. The standard libary sort falls back to insertion sort for small subarrays and uses heapsort for the unusual data patterns that cause quicksort problems.
You can also look at other algorithms like radix sort for all or part of the sorting, which I played with. That sorting library is also parallel. I wound up using a minimum cutoff of 127 items before I'd hand a subarray off for other goroutines to sort, and I used an arrangement with a fixed pool of goroutines and a buffered chan to pass tasks between them. That produced decent practical speedups at the time, though it was likely not the best approach at the time and I'm almost sure it's not on today's Go scheduler. Experimentation is fun!
答案2
得分: 0
如果操作是CPU受限的,我的实验表明最佳选择是CPU的数量。
英文:
If the operation is CPU bounded, my experiments show that the optimal is the number of CPUs.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论