英文:
Worse performance when using go routines in parallel quicksort implementation
问题
注意:我将为您翻译以下内容:
注意:“Go-lang并行段比串行段运行速度慢”问题涉及竞态条件,而这个问题有另一个问题,所以我认为它不是重复的。
我正在尝试找到以下情况的解释:
使用Go协程并行运行快速排序会导致运行时间显着增加。
以下是代码后的基准测试:
package c9sort
import (
"time"
)
var runInParllel bool
func Quicksort(nums []int, parallel bool) ([]int, int) {
started := time.Now()
ch := make(chan int)
runInParllel = parallel
go quicksort(nums, ch)
sorted := make([]int, len(nums))
i := 0
for next := range ch {
sorted[i] = next
i++
}
return sorted, int(time.Since(started).Nanoseconds() / 1000000)
}
func quicksort(nums []int, ch chan int) {
// Choose first number as pivot
pivot := nums[0]
// Prepare secondary slices
smallerThanPivot := make([]int, 0)
largerThanPivot := make([]int, 0)
// Slice except pivot
nums = nums[1:]
// Go over slice and sort
for _, i := range nums {
switch {
case i <= pivot:
smallerThanPivot = append(smallerThanPivot, i)
case i > pivot:
largerThanPivot = append(largerThanPivot, i)
}
}
var ch1 chan int
var ch2 chan int
// Now do the same for the two slices
if len(smallerThanPivot) > 1 {
ch1 = make(chan int, len(smallerThanPivot))
if runInParllel {
go quicksort(smallerThanPivot, ch1)
} else {
quicksort(smallerThanPivot, ch1)
}
}
if len(largerThanPivot) > 1 {
ch2 = make(chan int, len(largerThanPivot))
if runInParllel {
go quicksort(largerThanPivot, ch2)
} else {
quicksort(largerThanPivot, ch2)
}
}
// Wait until the sorting finishes for the smaller slice
if len(smallerThanPivot) > 1 {
for i := range ch1 {
ch <- i
}
} else if len(smallerThanPivot) == 1 {
ch <- smallerThanPivot[0]
}
ch <- pivot
if len(largerThanPivot) > 1 {
for i := range ch2 {
ch <- i
}
} else if len(largerThanPivot) == 1 {
ch <- largerThanPivot[0]
}
close(ch)
}
对于一个包含500,000个整数的随机排列的基准测试:
运行100次
非并行平均时间 - 1866毫秒
并行平均时间 - 2437毫秒
对于任何解释,我都会非常感激。我知道goroutine可能不适合这种类型的并行性,但我想理解原因。
提前感谢您的帮助。
英文:
Note: The "Go-lang parallel segment runs slower than series segment" question dealt with race conditions, this one has another issue, so imho it's not a duplicate.
I'm trying to find an explanation for the following situation:
Running parallel quicksort results in a significantly longer runtime when done using go routines.
Benchmarks are after the code:
package c9sort
import (
"time"
)
var runInParllel bool
func Quicksort(nums []int, parallel bool) ([]int, int) {
started := time.Now()
ch := make(chan int)
runInParllel = parallel
go quicksort(nums, ch)
sorted := make([]int, len(nums))
i := 0
for next := range ch {
sorted[i] = next
i++
}
return sorted, int(time.Since(started).Nanoseconds() / 1000000)
}
func quicksort(nums []int, ch chan int) {
// Choose first number as pivot
pivot := nums[0]
// Prepare secondary slices
smallerThanPivot := make([]int, 0)
largerThanPivot := make([]int, 0)
// Slice except pivot
nums = nums[1:]
// Go over slice and sort
for _, i := range nums {
switch {
case i <= pivot:
smallerThanPivot = append(smallerThanPivot, i)
case i > pivot:
largerThanPivot = append(largerThanPivot, i)
}
}
var ch1 chan int
var ch2 chan int
// Now do the same for the two slices
if len(smallerThanPivot) > 1 {
ch1 = make(chan int, len(smallerThanPivot))
if runInParllel {
go quicksort(smallerThanPivot, ch1)
} else {
quicksort(smallerThanPivot, ch1)
}
}
if len(largerThanPivot) > 1 {
ch2 = make(chan int, len(largerThanPivot))
if runInParllel {
go quicksort(largerThanPivot, ch2)
} else {
quicksort(largerThanPivot, ch2)
}
}
// Wait until the sorting finishes for the smaller slice
if len(smallerThanPivot) > 1 {
for i := range ch1 {
ch <- i
}
} else if len(smallerThanPivot) == 1 {
ch <- smallerThanPivot[0]
}
ch <- pivot
if len(largerThanPivot) > 1 {
for i := range ch2 {
ch <- i
}
} else if len(largerThanPivot) == 1 {
ch <- largerThanPivot[0]
}
close(ch)
}
Benchmarks for a random perm of 500000 integers:
Ran 100 times
Non parallel average - 1866ms
Parallel average - 2437ms
Any explanation would be appreciated. I know goroutines may not be best for this kind of parallelism, but I'm trying to understand the reason.
Thank you in advance.
答案1
得分: 6
一般来说,线程间的协调是有成本的,因此只有当任务的规模达到一定程度时,将任务发送到另一个goroutine才能加快速度。所以不要发送单个项。
对于像快速排序这样的分而治之的算法,可以有一些并行化的方法。一般来说:当递归时,如果数组的一半足够大,可以在一个goroutine中开始“对数组的一半进行排序”的任务。通过“如果足够大”部分来减少协调开销。
更详细地说,可以按照以下步骤进行:
-
编写一个非并行的
qsort(data []int)
函数。 -
将
qsort
函数更改为可选地接受一个*sync.WaitGroup
,我们将其称为wg
,用于发出完成信号。在每个return
之前添加if wg != nil { wg.Done() }
。 -
在
qsort
递归的地方,让它检查它要排序的数据的一半是否足够大(比如超过500个项?),然后:- 如果足够大,启动另一个任务:
wg.Add(1); go qsort(half, wg)
- 如果不够大,在此处进行排序:
qsort(half, nil)
- 如果足够大,启动另一个任务:
-
封装
qsort
函数,将并行机制隐藏起来,例如,让quicksort(data)
函数执行wg := new(sync.WaitGroup)
,进行初始的wg.Add(1)
,调用qsort(data, wg)
,然后使用wg.Wait()
等待所有后台排序完成。
这并不是一种最优的策略,因为即使在有足够的任务使核心保持繁忙之后,它仍会不断地派生新的goroutine。而且在将某些任务交给后台时,还可以进行很多调整。但重点是,仅对于大型子任务使用另一个线程是一种在不受协调开销影响的情况下并行化快速排序的方法。
这里有一个使用草率的快速排序演示(你需要在本地运行它以获取时间)--可能会有错误,对于已排序的数据来说肯定是二次的;重点是理解并行分而治之的思想。
还有一种自底向上的策略--先排序片段,然后合并--例如在这个并行排序包中使用的方法。该包使用的原地合并代码比较复杂。
(如果你还没有设置GOMAXPROCS,可以在main
函数中使用类似runtime.GOMAXPROCS(runtime.NumCPU())
的方式进行设置。)
最后,看一下Go的sort包。其中有很多代码,但非常清晰,一旦理解了它,就可以使用一个“真实”的、完整的排序实现进行实验。
英文:
The general answer is that inter-thread coordination has a cost, so sending a task off to another goroutine can only speed things up if the task is at least a certain size. So don't send single items.
For a divide-and-conquer algorithm like quicksort, the ways to parallelize can be interesting. Generally: when you recurse, you can start the "sort one half of the array" task in a goroutine if it's large enough. The "if it's large enough" part is how you reduce coordination overhead.
In more detail, that looks like:
-
write a non-parallel
qsort(data []int)
-
change
qsort
to optionally take a*sync.WaitGroup
we'll callwg
for signaling that it's done. Addif wg != nil { wg.Done() }
before each of itsreturn
s. -
where
qsort
recurses, have it check whether the half of the data it's going to sort is large-ish (say, over 500 items?), and- if it's large, start another task:
wg.Add(1); go qsort(half, wg)
- if it's not, sort here and now:
qsort(half, nil)
- if it's large, start another task:
-
wrap
qsort
in to hide the parallel machinery from the user: like, havequicksort(data)
dowg := new(sync.WaitGroup)
, do an initialwg.Add(1)
, callqsort(data, wg)
, and dowg.Wait()
to wait for all of the background sorts to complete.
This isn't an optimal strategy, because it keeps forking off new goroutines even after there are enough tasks to keep your cores busy. And there is a lot of tuning one could do around just how you hand off some tasks to the background. But the big point is, using another thread only for large subtasks is a way to parallelize quicksort without getting smashed by coordination overhead.
Here's a demo with an embarrassingly slapdash quicksort (you have run it locally to get the timing)--good chance of bugs, definitely is quadratic on already-sorted data; the point is to get the idea of parallel divide-and-conquer across.
There is also a bottom-up strategy--sort pieces then merge--used in, for instance, this parallel sort package. The in-place merge code that package uses is tricky, though.
(And if you haven't set GOMAXPROCS, set it with something like runtime.GOMAXPROCS(runtime.NumCPU())
in main
.)
Finally, look at Go's sort package. There's a lot of code, but it's pretty clear, and once you get it you can do your experiments with a "real", fleshed-out sort implementation.
答案2
得分: 2
原来很简单。由于我在一台新机器上,GOMAXPROCS变量没有设置。
新的基准测试结果如预期,更偏向于并行实现:
设置为核心数的两倍:
使用16个goroutines
运行100次
非并行平均值 - 1980
并行平均值 - 1133
设置为核心数:
使用8个goroutines
运行100次
非并行平均值 - 2004
并行平均值 - 1197
顺便说一下,这个结果相当一致。核心数的两倍平均值总是稍微好一些。
针对更大的集合(1000000)的基准测试:
使用8个goroutines
运行100次
非并行平均值 - 3748
并行平均值 - 2265
使用两倍的核心数:
使用16个goroutines
运行100次
非并行平均值 - 3817
并行平均值 - 2012
英文:
Turns out it was very simple. As I'm on a new machine, the GOMAXPROCS variable wasn't set.
The new benchmark favors, as predicted, the parallel implementation:
Set to double the number of cores:
Using 16 goroutines
Ran 100 times
Non parallel average - 1980
Parallel average - 1133
Set to the number of cores:
Using 8 goroutines
Ran 100 times
Non parallel average - 2004
Parallel average - 1197
By the way, this is fairly consistent. The average for double the number of cores is always a bit better.
Benchmark for a larger collection (1000000):
Using 8 goroutines
Ran 100 times
Non parallel average - 3748
Parallel average - 2265
With double:
Using 16 goroutines
Ran 100 times
Non parallel average - 3817
Parallel average - 2012
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论