英文:
How to detect what is preventing multiple cores being used in golang?
问题
所以,我有一段并发的代码,它应该在每个CPU/核心上运行。
有两个包含输入/输出值的大型向量
var (
input = make([]float64, rowCount)
output = make([]float64, rowCount)
)
这些向量已经填充好了,我想计算每个输入-输出对之间的距离(误差)。由于这些对是独立的,可能的并发版本如下所示:
var d float64 // 要计算的误差
// 为每个CPU设置一个工作线程
ch := make(chan float64)
nw := runtime.NumCPU()
for w := 0; w < nw; w++ {
go func(id int) {
var wd float64
// 例如,nw = 4
// worker0, i = 0, 4, 8, 12...
// worker1, i = 1, 5, 9, 13...
// worker2, i = 2, 6, 10, 14...
// worker3, i = 3, 7, 11, 15...
for i := id; i < rowCount; i += nw {
res := compute(input[i])
wd += distance(res, output[i])
}
ch <- wd
}(w)
}
// 计算总距离
for w := 0; w < nw; w++ {
d += <-ch
}
这个想法是为每个CPU/核心设置一个单独的工作线程,每个工作线程处理一部分行。
我遇到的问题是,这段代码的运行速度并不比串行代码快。
我正在使用Go 1.7,所以runtime.GOMAXPROCS
应该已经设置为runtime.NumCPU()
,但即使显式设置它也不能提高性能。
distance
只是(a-b)*(a-b)
;compute
稍微复杂一些,但应该是可重入的,并且只用于读取全局数据(并使用math.Pow
和math.Sqrt
函数);- 没有其他goroutine在运行。
所以,除了读取全局数据(input/output),我不知道是否还有其他锁/互斥体(例如,没有使用math/rand
)。
我还使用了-race
进行了编译,但没有发现任何问题。
我的主机有4个虚拟核心,但当我运行这段代码时,CPU使用率为102%,但我预期应该在380%左右,因为在过去使用其他使用所有核心的Go代码时是这样的。
我想进行调查,但我不知道运行时如何分配线程和调度goroutine。
我该如何调试这种问题?pprof
能帮助我吗?runtime
包呢?
提前感谢。
英文:
So, I have a piece of code that is concurrent and it's meant to be run onto each CPU/core.
There are two large vectors with input/output values
var (
input = make([]float64, rowCount)
output = make([]float64, rowCount)
)
these are filled and I want to compute the distance (error) between each input-output pair. Being the pairs independent, a possible concurrent version is the following:
var d float64 // Error to be computed
// Setup a worker "for each CPU"
ch := make(chan float64)
nw := runtime.NumCPU()
for w := 0; w < nw; w++ {
go func(id int) {
var wd float64
// eg nw = 4
// worker0, i = 0, 4, 8, 12...
// worker1, i = 1, 5, 9, 13...
// worker2, i = 2, 6, 10, 14...
// worker3, i = 3, 7, 11, 15...
for i := id; i < rowCount; i += nw {
res := compute(input[i])
wd += distance(res, output[i])
}
ch <- wd
}(w)
}
// Compute total distance
for w := 0; w < nw; w++ {
d += <-ch
}
The idea is to have a single worker for each CPU/core, and each worker processes a subset of the rows.
The problem I'm having is that this code is no faster than the serial code.
Now, I'm using Go 1.7 so runtime.GOMAXPROCS
should be already set to runtime.NumCPU()
, but even setting it explicitly does not improves performances.
- distance is just
(a-b)*(a-b)
; - compute is a bit more complex, but should be reentrant and use global data only for reading (and uses
math.Pow
andmath.Sqrt
functions); - no other goroutine is running.
So, besides accessing the global data (input/output) for reading, there are no locks/mutexes that I am aware of (not using math/rand
, for example).
I also compiled with -race
and nothing emerged.
My host has 4 virtual cores, but when I run this code I get (using htop) CPU usage to 102%, but I expected something around 380%, as it happened in the past with other go code that used all the cores.
I would like to investigate, but I don't know how the runtime allocates threads and schedule goroutines.
How can I debug this kind of issues? Can pprof
help me in this case? What about the runtime
package?
Thanks in advance
答案1
得分: 1
抱歉,但最后我测量错误了。@JimB是正确的,我有一个小漏洞,但不足以证明这种程度的减速。
我的期望值太高了:我并发执行的函数只在程序开始时调用,因此性能改进只是微小的。
在将该模式应用于程序的其他部分后,我得到了预期的结果。我错误地评估了哪个部分最重要。
无论如何,与此同时,我学到了很多有趣的东西,所以非常感谢所有试图帮助我的人!
英文:
Sorry, but in the end I got the measurement wrong. @JimB was right, and I had a minor leak, but not so much to justify a slowdown of this magnitude.
My expectations were too high: the function I was making concurrent was called only at the beginning of the program, therefore the performance improvement was just minor.
After applying the pattern to other sections of the program, I got the expected results. My mistake in evaluation which section was the most important.
Anyway, I learned a lot of interesting things meanwhile, so thanks a lot to all the people trying to help!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论