英文:
Concurrent code slower than sequential code on parallel problem?
问题
我写了一些代码来执行蒙特卡洛模拟。我首先写的是这个顺序版本:
func simulationSequential(experiment func() bool, numTrials int) float64 {
ocurrencesEvent := 0
for trial := 0; trial < numTrials; trial++ {
eventHappend := experiment()
if eventHappend {
ocurrencesEvent++
}
}
return float64(ocurrencesEvent) / float64(numTrials)
}
然后,我想到可以并发地运行一些实验,并利用我的笔记本电脑的多个核心来更快地得到结果。所以,我写了以下版本:
func simulationConcurrent(experiment func() bool, numTrials, nGoroutines int) float64 {
ch := make(chan int)
var wg sync.WaitGroup
// 启动多个 goroutine 进行工作
for i := 0; i < nGoroutines; i++ {
wg.Add(1)
go func() {
localOcurrences := 0
for j := 0; j < numTrials/nGoroutines; j++ {
eventHappend := experiment()
if eventHappend {
localOcurrences++
}
}
ch <- localOcurrences
wg.Done()
}()
}
// 当所有 goroutine 完成时关闭通道
go func() {
wg.Wait()
close(ch)
}()
// 累积每个实验的结果
ocurrencesEvent := 0
for localOcurrences := range ch {
ocurrencesEvent += localOcurrences
}
return float64(ocurrencesEvent) / float64(numTrials)
}
令我惊讶的是,当我对这两个版本进行基准测试时,我发现顺序版本比并发版本更快,而且随着我减少 goroutine 的数量,并发版本的性能反而变得更好。为什么会这样呢?我原以为并发版本会更快,因为这是一个高度可并行化的问题。
这是我的基准测试代码:
func tossEqualToSix() bool {
// 模拟掷一个六面骰子
roll := rand.Intn(6) + 1
if roll != 6 {
return false
}
return true
}
const (
numsSimBenchmark = 1_000_000
numGoroutinesBenckmark = 10
)
func BenchmarkSimulationSequential(b *testing.B) {
for i := 0; i < b.N; i++ {
simulationSequential(tossEqualToSix, numsSimBenchmark)
}
}
func BenchmarkSimulationConcurrent(b *testing.B) {
for i := 0; i < b.N; i++ {
simulationConcurrent(tossEqualToSix, numsSimBenchmark, numGoroutinesBenckmark)
}
}
以下是结果:
goos: linux
goarch: amd64
pkg: github.com/jpuriol/montecarlo
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
BenchmarkSimulationSequential-8 36 30453588 ns/op
BenchmarkSimulationConcurrent-8 9 117462720 ns/op
PASS
ok github.com/jpuriol/montecarlo 2.478s
你可以从Github上下载我的代码。
英文:
I wrote some code to execute Monte Carlo simulations. The first thing I wrote was this sequential version:
func simulationSequential(experiment func() bool, numTrials int) float64 {
ocurrencesEvent := 0
for trial := 0; trial < numTrials; trial++ {
eventHappend := experiment()
if eventHappend {
ocurrencesEvent++
}
}
return float64(ocurrencesEvent) / float64(numTrials)
}
Then, I figured I could run some of the experiments concurrently and get a result faster using my laptop's multiple cores. So, I wrote the following version:
func simulationConcurrent(experiment func() bool, numTrials, nGoroutines int) float64 {
ch := make(chan int)
var wg sync.WaitGroup
// Launch work in multiple goroutines
for i := 0; i < nGoroutines; i++ {
wg.Add(1)
go func() {
localOcurrences := 0
for j := 0; j < numTrials/nGoroutines; j++ {
eventHappend := experiment()
if eventHappend {
localOcurrences++
}
}
ch <- localOcurrences
wg.Done()
}()
}
// Close the channel when all the goroutines are done
go func() {
wg.Wait()
close(ch)
}()
// Acummulate the results of each experiment
ocurrencesEvent := 0
for localOcurrences := range ch {
ocurrencesEvent += localOcurrences
}
return float64(ocurrencesEvent) / float64(numTrials)
}
To my surprise, when I run benchmarks on the two versions, I get that the sequential is faster than the concurrent one, with the concurrent version getting better as I decrease the number of goroutines. Why does this happen? I thought the concurrent version will be faster since this is a highly parallelizable problem.
Here is my benchmark's code:
func tossEqualToSix() bool {
// Simulate the toss of a six-sided die
roll := rand.Intn(6) + 1
if roll != 6 {
return false
}
return true
}
const (
numsSimBenchmark = 1_000_000
numGoroutinesBenckmark = 10
)
func BenchmarkSimulationSequential(b *testing.B) {
for i := 0; i < b.N; i++ {
simulationSequential(tossEqualToSix, numsSimBenchmark)
}
}
func BenchmarkSimulationConcurrent(b *testing.B) {
for i := 0; i < b.N; i++ {
simulationConcurrent(tossEqualToSix, numsSimBenchmark, numGoroutinesBenckmark)
}
}
And the results:
goos: linux
goarch: amd64
pkg: github.com/jpuriol/montecarlo
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
BenchmarkSimulationSequential-8 36 30453588 ns/op
BenchmarkSimulationConcurrent-8 9 117462720 ns/op
PASS
ok github.com/jpuriol/montecarlo 2.478s
You can download my code from Github.
答案1
得分: 3
我想详细说明一下我的评论,并将其与代码和基准结果一起发布。
Examine函数使用rand
包中的包级别rand函数。这些函数在底层使用rand.Rand
的globalRand
实例。例如func Intn(n int) int { return globalRand.Intn(n) }
。由于随机数生成器不是线程安全的,因此globalRand
以以下方式实例化:
/*
* Top-level convenience functions
*/
var globalRand = New(&lockedSource{src: NewSource(1).(*rngSource)})
type lockedSource struct {
lk sync.Mutex
src *rngSource
}
func (r *lockedSource) Int63() (n int64) {
r.lk.Lock()
n = r.src.Int63()
r.lk.Unlock()
return
}
...
这意味着所有对rand.Intn
的调用都受到全局锁的保护。结果是examine函数"按顺序工作",因为有锁的存在。更具体地说,每次调用rand.Intn
都不会在前一次调用完成之前开始生成随机数。
这是重新设计的代码。每个examine函数都有自己的随机生成器。假设单个examine函数由一个goroutine使用,因此不需要锁保护。
package main
import (
"math/rand"
"sync"
"testing"
"time"
)
func simulationSequential(experimentFuncFactory func() func() bool, numTrials int) float64 {
experiment := experimentFuncFactory()
ocurrencesEvent := 0
for trial := 0; trial < numTrials; trial++ {
eventHappend := experiment()
if eventHappend {
ocurrencesEvent++
}
}
return float64(ocurrencesEvent) / float64(numTrials)
}
func simulationConcurrent(experimentFuncFactory func() func() bool, numTrials, nGoroutines int) float64 {
ch := make(chan int)
var wg sync.WaitGroup
// Launch work in multiple goroutines
for i := 0; i < nGoroutines; i++ {
wg.Add(1)
go func() {
experiment := experimentFuncFactory()
localOcurrences := 0
for j := 0; j < numTrials/nGoroutines; j++ {
eventHappend := experiment()
if eventHappend {
localOcurrences++
}
}
ch <- localOcurrences
wg.Done()
}()
}
// Close the channel when all the goroutines are done
go func() {
wg.Wait()
close(ch)
}()
// Acummulate the results of each experiment
ocurrencesEvent := 0
for localOcurrences := range ch {
ocurrencesEvent += localOcurrences
}
return float64(ocurrencesEvent) / float64(numTrials)
}
func tossEqualToSix() func() bool {
prng := rand.New(rand.NewSource(time.Now().UnixNano()))
return func() bool {
// Simulate the toss of a six-sided die
roll := prng.Intn(6) + 1
if roll != 6 {
return false
}
return true
}
}
const (
numsSimBenchmark = 5_000_000
numGoroutinesBenchmark = 10
)
func BenchmarkSimulationSequential(b *testing.B) {
for i := 0; i < b.N; i++ {
simulationSequential(tossEqualToSix, numsSimBenchmark)
}
}
func BenchmarkSimulationConcurrent(b *testing.B) {
for i := 0; i < b.N; i++ {
simulationConcurrent(tossEqualToSix, numsSimBenchmark, numGoroutinesBenchmark)
}
}
基准测试结果如下:
goos: darwin
goarch: arm64
pkg: scratchpad
BenchmarkSimulationSequential-8 20 55142896 ns/op
BenchmarkSimulationConcurrent-8 82 12944360 ns/op
英文:
I thought I would elaborate on my comment and post it with code and benchmark result.
Examine function uses package level rand functions from rand
package. These functions under the hood uses globalRand
instance of rand.Rand
. For example func Intn(n int) int { return globalRand.Intn(n) }
. As the random number generator is not thread safe, the globalRand
is instantiated in following way:
/*
* Top-level convenience functions
*/
var globalRand = New(&lockedSource{src: NewSource(1).(*rngSource)})
type lockedSource struct {
lk sync.Mutex
src *rngSource
}
func (r *lockedSource) Int63() (n int64) {
r.lk.Lock()
n = r.src.Int63()
r.lk.Unlock()
return
}
...
This means that all invocations of rand.Intn
are guarded by the global lock. The consequence is that examine function "works sequentially", because of the lock. More specifically, each call to rand.Intn
will not start generating a random number before the previous call completes.
Here is redesigned code. Each examine function has its own random generator. The assumption is that single examine function is used by one goroutine, so it does not require lock protection.
package main
import (
"math/rand"
"sync"
"testing"
"time"
)
func simulationSequential(experimentFuncFactory func() func() bool, numTrials int) float64 {
experiment := experimentFuncFactory()
ocurrencesEvent := 0
for trial := 0; trial < numTrials; trial++ {
eventHappend := experiment()
if eventHappend {
ocurrencesEvent++
}
}
return float64(ocurrencesEvent) / float64(numTrials)
}
func simulationConcurrent(experimentFuncFactory func() func() bool, numTrials, nGoroutines int) float64 {
ch := make(chan int)
var wg sync.WaitGroup
// Launch work in multiple goroutines
for i := 0; i < nGoroutines; i++ {
wg.Add(1)
go func() {
experiment := experimentFuncFactory()
localOcurrences := 0
for j := 0; j < numTrials/nGoroutines; j++ {
eventHappend := experiment()
if eventHappend {
localOcurrences++
}
}
ch <- localOcurrences
wg.Done()
}()
}
// Close the channel when all the goroutines are done
go func() {
wg.Wait()
close(ch)
}()
// Acummulate the results of each experiment
ocurrencesEvent := 0
for localOcurrences := range ch {
ocurrencesEvent += localOcurrences
}
return float64(ocurrencesEvent) / float64(numTrials)
}
func tossEqualToSix() func() bool {
prng := rand.New(rand.NewSource(time.Now().UnixNano()))
return func() bool {
// Simulate the toss of a six-sided die
roll := prng.Intn(6) + 1
if roll != 6 {
return false
}
return true
}
}
const (
numsSimBenchmark = 5_000_000
numGoroutinesBenchmark = 10
)
func BenchmarkSimulationSequential(b *testing.B) {
for i := 0; i < b.N; i++ {
simulationSequential(tossEqualToSix, numsSimBenchmark)
}
}
func BenchmarkSimulationConcurrent(b *testing.B) {
for i := 0; i < b.N; i++ {
simulationConcurrent(tossEqualToSix, numsSimBenchmark, numGoroutinesBenchmark)
}
}
Benchmark result are as follows:
goos: darwin
goarch: arm64
pkg: scratchpad
BenchmarkSimulationSequential-8 20 55142896 ns/op
BenchmarkSimulationConcurrent-8 82 12944360 ns/op
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论