Go语言的并行段比串行段运行得更慢。

huangapple go评论88阅读模式
英文:

Go-lang parallel segment runs slower than series segment

问题

我已经构建了一个在Go语言中非常计算密集的流行病数学模型。我现在正在构建一套系统来测试我的模型,在更改输入并期望得到不同输出的情况下进行测试。我首先构建了一个串行版本,逐步增加HIV患病率,并观察对HIV死亡率的影响。运行时间约为200毫秒。

然后我使用通道创建了一个"并行"版本,但运行时间更长,约为400毫秒。由于我们将运行数百万次不同输入的模拟,所以这些微小的改进非常重要,希望能尽可能提高效率。以下是并行版本的代码:

ch := make(chan ChData)
var q float64
for q = 0.0; q < 1000; q++ {
    go func(q float64, inputs *costanalysis.Inputs, ch chan ChData) {
        inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] = inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] * float32(math.Pow(1.00001, q))
        results := costAnalysisHandler(inputs)
        fmt.Println(results.HivDeaths[20])
        ch <- ChData{int(q), results.HivDeaths[20]}
    }(q, inputs, ch)
}
for q = 0.0; q < 1000; q++ {
    theResults := <-ch
    fmt.Println(theResults)
}

如果对此有任何想法,将非常感激。

英文:

I have built an epidemic mathematics model which is fairly computationally intense in Go. I'm trying now to build a set of systems to test my model, where I change an input and expect a different output. I built a version in series to slowly increase HIV prevalence and see effects on HIV deaths. It takes ~200 milliseconds to run.

for q = 0.0; q &lt; 1000; q++ {

	inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] = inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] * float32(math.Pow(1.00001, q))
	results := costAnalysisHandler(inputs)
	fmt.Println(results.HivDeaths[20])

}

Then I made a "parallel" version using channels, and it takes longer, ~400 milliseconds to run. These small changes are important as we will be running millions of runs with different inputs, so would like to make it as efficient as possible. Here is the parallel version:

ch := make(chan ChData)
var q float64
for q = 0.0; q &lt; 1000; q++ {
    go func(q float64, inputs *costanalysis.Inputs, ch chan ChData) {
		inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] = inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] * float32(math.Pow(1.00001, q))
		results := costAnalysisHandler(inputs)
		fmt.Println(results.HivDeaths[20])
		ch &lt;- ChData{int(q), results.HivDeaths[20]}
	}(q, inputs, ch)
}
for q = 0.0; q &lt; 1000; q++ {
	theResults := &lt;-ch
	fmt.Println(theResults)
}

Any thoughts are very much appreciated.

答案1

得分: 4

启动和与后台任务通信都会带来额外开销。如果程序执行时间为200毫秒,那么你在成本分析上花费的时间可能等于通信成本,但如果协调成本真的影响到你的应用程序,一种常见的方法是一次性处理大块的工作——例如,让每个goroutine为10个q值的范围进行分析,而不仅仅是一个值。(编辑:正如@Innominate所说,创建一个处理作业对象队列的goroutine“工作池”也是一种常见的方法。)

此外,你粘贴的代码存在竞态条件。因为你传递给函数的是一个指针,所以每次启动一个goroutine时,Inputs结构体的内容不会被复制。因此,并行运行的goroutine将从同一个Inputs实例中读取和写入。

为了避免竞态条件,可以为每个分析创建一个全新的Inputs实例,带有自己的数组等。如果这样做会浪费大量内存或导致大量冗余复制,可以考虑以下方法:1)回收Inputs实例,2)将可以安全共享的只读数据分离出来(也许有一些固定的国家数据),或者3)将一些相对较大的数组改为costAnalysisHandler内的局部变量,而不是需要传递的内容(也许它只需要接收初始的HIV患病率并返回t=20时的HIV死亡人数,其他的都是局部变量和栈上的数据)。

这对于当前的Go语言来说并不适用,但在问题最初发布时是适用的:只有在使用runtime.GOMAXPROCS()并设置所需并发级别时,才会真正并行运行,例如runtime.GOMAXPROCS(runtime.NumCPU())

最后,只有在进行一些较大的分析并且确实存在性能问题时,才需要担心所有这些;如果等待0.2秒是性能工作能为你节省的全部,那么这并不值得。

英文:

There's overhead to starting and communicating with background tasks. The time spent on your cost analyses <strike>probably dwarfs</strike> equals the cost of communication if the program was taking 200ms, but if coordination cost ever does kill your app, a common approach is to hand off largish chunks of work at a time--e.g., make each goroutine do analyses for a range of 10 q values instead of just one. (Edit: And as @Innominate says, making a "worker pool" of goroutines that process a queue of job objects is another common approach.)

Also, the code you pasted has a race condition. The contents of your Inputs struct don't get copied each time you spawn a goroutine, because you're passing your function a pointer. So goroutines running in parallel will read from and write to the same Inputs instance.

Simply making a brand new Inputs instance for each analysis, with its own arrays, etc. would avoid the race. If that ended up wasting tons of memory or causing lots of redundant copies, you could 1) recycle Inputs instances, 2) separate out read-only data that can safely be shared (maybe there's country data that's fixed, dunno), or 3) change some of the relatively big arrays to be local variables within costAnalysisHandler rather than stuff that needs to be passed around (maybe it could just take initial HIV prevalence and return HIV deaths at t=20, and everything else is local and on the stack).

This doesn't apply to Go today, but did when the question was originally posted: nothing is really running in parallel unless you call runtime.GOMAXPROCS() with your desired concurrency level, e.g., runtime.GOMAXPROCS(runtime.NumCPU()).

Finally, you should only worry about all of this if you're doing some larger analysis and actually have a performance problem; if .2 seconds of waiting is all that performance work can save you here, it's not worth it.

答案2

得分: 1

并行化计算密集型的一组计算需要确保并行计算能够在您的机器上真正并行运行。如果不能并行运行,那么创建goroutine、通道和从通道读取的额外开销将使程序运行变慢。

我猜这就是问题所在。

在运行代码之前,尝试将GOMAXPROCS环境变量设置为您拥有的CPU数量。或者在开始并行计算之前调用runtime.GOMAXRPROCS(runtime.NumCPU())。

英文:

Parallelizing a computationally intensive set of calculations requires that the parallel computations can actually run in parallel on your machine. If they don't then the extra overhead of creating goroutines, channels and reading off the channel will make the program run slower.

I'm guessing that is the problem here.

Try setting the GOMAXPROCS environment variable to the number of CPU's you have before running your code. Or call runtime.GOMAXRPROCS(runtime.NumCPU()) before you start the parallell computations.

答案3

得分: 1

我看到了两个与并行性能相关的问题,

第一个问题是,你必须设置GOMAXPROCS才能让Go运行时使用多个CPU/核心。通常情况下,你会将其设置为机器上的处理器数量,但理想的设置可能会有所不同。

第二个问题有点棘手,就是你的代码似乎无法很好地并行化。仅仅启动一千个goroutine并期望它们能够自行解决并不会得到好的结果。你应该考虑使用某种工作池,同时运行有限数量的计算(一个好的起始数量可以设置为与GOMAXPROCS相同),而不是一次性尝试做1000个。

参考:http://golang.org/doc/faq#Why_no_multi_CPU

英文:

I see two issues related to parallel performance,

The first and more obvious one is that you must set GOMAXPROCS in order to get the Go runtime to use more than one cpu/core. Typically one would set it for the number of processors in the machine but the ideal setting can vary.

The second problem is a bit trickier, which is that your code doesn't appear to be parallelizing very well. Simply starting a thousand goroutines and assuming they'll work it out isn't going to give good results. You should probably be using some kind of worker pool, running a limited number of simultaneous computations(a good starting number would be to set it the same as GOMAXPROCS) rather than trying to do 1000 at once.

See: http://golang.org/doc/faq#Why_no_multi_CPU

huangapple
  • 本文由 发表于 2013年12月19日 06:57:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/20670242.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定