英文:
Why does this program run faster when it's allocated fewer threads?
问题
我有一个相当简单的Go程序,用于计算随机斐波那契数,以测试我编写的工作池中观察到的一些奇怪行为。当我分配一个线程时,程序在1.78秒内完成。当我分配4个线程时,它在9.88秒内完成。
代码如下:
var workerWG sync.WaitGroup
func worker(fibNum chan int) {
for {
var tgt = <-fibNum
workerWG.Add(1)
var a, b float64 = 0, 1
for i := 0; i < tgt; i++ {
a, b = a+b, a
}
workerWG.Done()
}
}
func main() {
rand.Seed(time.Now().UnixNano())
runtime.GOMAXPROCS(1) // 有疑问的一行
var fibNum = make(chan int)
for i := 0; i < 4; i++ {
go worker(fibNum)
}
for i := 0; i < 500000; i++ {
fibNum <- rand.Intn(1000)
}
workerWG.Wait()
}
如果我将runtime.GOMAXPROCS(1)
替换为4
,程序运行时间会延长四倍。
这是怎么回事?为什么将更多可用的线程添加到工作池会减慢整个池的速度?
我个人的理论是,这与工作线程的处理时间小于线程管理的开销有关,但我不确定。我对以下测试的保留是由于以下代码:
for {
<-fibNum
time.Sleep(500 * time.Millisecond)
}
无论是一个可用线程还是四个可用线程都需要相同的时间。
英文:
I have a fairly simple Go program designed to compute random Fibonacci numbers to test some strange behavior I observed in a worker pool I wrote. When I allocate one thread, the program finishes in 1.78s. When I allocate 4, it finishes in 9.88s.
The code is as follows:
var workerWG sync.WaitGroup
func worker(fibNum chan int) {
for {
var tgt = <-fibNum
workerWG.Add(1)
var a, b float64 = 0, 1
for i := 0; i < tgt; i++ {
a, b = a+b, a
}
workerWG.Done()
}
}
func main() {
rand.Seed(time.Now().UnixNano())
runtime.GOMAXPROCS(1) // LINE IN QUESTION
var fibNum = make(chan int)
for i := 0; i < 4; i++ {
go worker(fibNum)
}
for i := 0; i < 500000; i++ {
fibNum <- rand.Intn(1000)
}
workerWG.Wait()
}
If I replace runtime.GOMAXPROCS(1)
with 4
, the program takes four times as long to run.
What's going on here? Why does adding more available threads to a worker pool slow the entire pool down?
My personal theory is that it has to do with the processing time of the worker being less than the overhead of thread management, but I'm not sure. My reservation is caused by the following test:
When I replace the worker
function with the following code:
for {
<-fibNum
time.Sleep(500 * time.Millisecond)
}
both one available thread and four available threads take the same amount of time.
答案1
得分: 5
我修改了你的程序,如下所示:
package main
import (
"math/rand"
"runtime"
"sync"
"time"
)
var workerWG sync.WaitGroup
func worker(fibNum chan int) {
for tgt := range fibNum {
var a, b float64 = 0, 1
for i := 0; i < tgt; i++ {
a, b = a+b, a
}
}
workerWG.Done()
}
func main() {
rand.Seed(time.Now().UnixNano())
runtime.GOMAXPROCS(1) // 有疑问的一行代码
var fibNum = make(chan int)
for i := 0; i < 4; i++ {
go worker(fibNum)
workerWG.Add(1)
}
for i := 0; i < 500000; i++ {
fibNum <- rand.Intn(100000)
}
close(fibNum)
workerWG.Wait()
}
我清理了等待组的使用方式。
我将 rand.Intn(1000)
改为 rand.Intn(100000)
。
在我的机器上运行结果如下:
$ time go run threading.go (GOMAXPROCS=1)
real 0m20.934s
user 0m20.932s
sys 0m0.012s
$ time go run threading.go (GOMAXPROCS=8)
real 0m10.634s
user 0m44.184s
sys 0m1.928s
这意味着在你的原始代码中,工作量与同步(通道读写)相比是微不足道的。减速是由于需要在线程之间进行同步,而不是只在之间执行非常少量的工作。
实际上,与计算斐波那契数列到1000相比,同步是昂贵的。这就是为什么人们倾向于不鼓励进行微基准测试的原因。增加这个数字可以得到更好的视角。但更好的做法是对实际工作进行基准测试,包括IO、系统调用、处理、计算、写输出、格式化等。
编辑:作为一个实验,我将工作线程数增加到8,并将GOMAXPROCS设置为8,结果如下:
$ time go run threading.go
real 0m4.971s
user 0m35.692s
sys 0m0.044s
英文:
I revised your program to look like the following:
package main
import (
"math/rand"
"runtime"
"sync"
"time"
)
var workerWG sync.WaitGroup
func worker(fibNum chan int) {
for tgt := range fibNum {
var a, b float64 = 0, 1
for i := 0; i < tgt; i++ {
a, b = a+b, a
}
}
workerWG.Done()
}
func main() {
rand.Seed(time.Now().UnixNano())
runtime.GOMAXPROCS(1) // LINE IN QUESTION
var fibNum = make(chan int)
for i := 0; i < 4; i++ {
go worker(fibNum)
workerWG.Add(1)
}
for i := 0; i < 500000; i++ {
fibNum <- rand.Intn(100000)
}
close(fibNum)
workerWG.Wait()
}
- I cleaned up the wait group usage.
- I changed
rand.Intn(1000)
torand.Intn(100000)
On my machine that produces:
$ time go run threading.go (GOMAXPROCS=1)
real 0m20.934s
user 0m20.932s
sys 0m0.012s
$ time go run threading.go (GOMAXPROCS=8)
real 0m10.634s
user 0m44.184s
sys 0m1.928s
This means that in your original code, the work performed vs synchronization (channel read/write) was negligible. The slowdown came from having to synchronize across threads instead of one and only perform a very small amount of work inbetween.
In essence, synchronization is expensive compared to calculating fibonacci numbers up to 1000. This is why people tend to discourage micro-benchmarks. Upping that number gives a better perspective. But an even better idea is to benchmark actual work being done i.e. including IO, syscalls, processing, crunching, writing output, formatting, etc.
Edit: As an experiment, I upped the number of workers to 8 with GOMAXPROCS set to 8 and the result was:
$ time go run threading.go
real 0m4.971s
user 0m35.692s
sys 0m0.044s
答案2
得分: 1
@thwd编写的代码是正确的,符合Go语言的习惯用法。
由于sync.WaitGroup的原子性质,你的代码被序列化了。无论是workerWG.Add(1)
还是workerWG.Done()
,都会阻塞直到能够原子性地更新内部计数器。
- 由于工作负载在0到1000个递归调用之间,单个核心的瓶颈足以将等待组计数器的数据竞争降到最低。
- 在多个核心上,处理器会花费大量时间旋转以修复等待组调用的冲突。再加上等待组计数器保持在一个核心上,现在就增加了核心之间的通信(占用更多的周期)。
简化代码的一些建议:
- 对于少量固定数量的goroutine,使用完整的通道(
chan struct{}
以避免分配)更加经济高效。 - 使用发送通道关闭作为goroutine的终止信号,并让它们发出已退出的信号(使用waitgroup或通道)。然后,关闭完整的通道以释放它们供垃圾回收使用。
- 如果需要waitgroup,请积极地将调用次数最小化。这些调用必须在内部进行序列化,因此额外的调用会强制添加同步。
英文:
The code written by @thwd is correct and idiomatic Go.
Your code was being serialized due to the atomic nature of sync.WaitGroup. Both workerWG.Add(1)
and workerWG.Done()
will block until they're able to atomically update the internal counter.
- Since the workload is between 0 and 1000 recursive calls, the bottleneck of a single core was enough to keep data races on the waitgroup counter to a minimum.
- On multiple cores, the processor spends a lot of time spinning to fix the collisions of waitgroup calls. Add that to the fact that the waitgroup counter is kept on one core and you now have added communication between cores (taking up even more cycles).
A couple hints for simplifying code:
- For a small, set number of goroutines, a complete channel (
chan struct{}
to avoid allocations) is cheaper to use. - Use the send channel close as a kill signal for goroutines and have them signal that they've exited (waitgroup or channel). Then, close to complete channel to free them up for the GC.
- If you need a waitgroup, aggressively minimize the number of calls to it. Those calls must be internally serialized, so extra calls forces added synchronization.
答案3
得分: -1
你的worker
中的主要计算例程不允许调度程序运行。
可以手动调用调度程序,例如:
for i := 0; i < tgt; i++ {
a, b = a+b, a
if i%300 == 0 {
runtime.Gosched()
}
}
当从一个线程切换到两个线程时,可以将墙上时钟减少30%。
这样的人工微基准测试很难做对。
英文:
Your main computation routine in worker
does not allow the scheduler to run.
Calling the scheduler manually like
for i := 0; i < tgt; i++ {
a, b = a+b, a
if i%300 == 0 {
runtime.Gosched()
}
}
Reduces wall clock by 30% when switching from one to two threads.
Such artificial microbenchmarks are really hard to get right.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论