英文:
Speedup problems with go
问题
我用Go语言编写了一个非常简单的程序来测试并行程序的性能。我编写了一个非常简单的程序,通过除法试验来分解一个大的半素数。由于没有涉及通信,我预期会有几乎完美的加速。然而,该程序似乎扩展性非常差。
我使用系统的time
命令在一个8核(真实的,而不是HT)计算机上,使用1、2、4和8个进程对程序进行了计时。我分解的数字是“28808539627864609”。以下是我的结果:
<pre>
核心 时间(秒) 加速比
1 60.0153 1
2 47.358 1.27
4 34.459 1.75
8 28.686 2.10
</pre>
如何解释这样糟糕的加速比?这是我的程序中的一个错误,还是Go运行时的问题?我如何获得更好的性能?我不是在谈论算法本身(我知道有更好的算法来分解半素数),而是关于我如何并行化它的方式。
这是我的程序的源代码:
package main
import (
"big"
"flag"
"fmt"
"runtime"
)
func factorize(n *big.Int, start int, step int, c chan *big.Int) {
var m big.Int
i := big.NewInt(int64(start))
s := big.NewInt(int64(step))
z := big.NewInt(0)
for {
m.Mod(n, i)
if m.Cmp(z) == 0{
c <- i
}
i.Add(i, s)
}
}
func main() {
var np *int = flag.Int("n", 1, "进程数")
flag.Parse()
runtime.GOMAXPROCS(*np)
var n big.Int
n.SetString(flag.Arg(0), 10) // 使用命令行给出的数字
c := make(chan *big.Int)
for i:=0; i<*np; i++ {
go factorize(&n, 2+i, *np, c)
}
fmt.Println(<-c)
}
编辑
问题似乎与Mod
函数有关。将其替换为Rem
可以获得更好但仍然不完美的性能和加速比。将其替换为QuoRem
可以获得3倍更快的性能和完美的加速比。结论:在Go语言中,内存分配似乎会影响并行性能。为什么?你有关于这个问题的参考资料吗?
英文:
I wrote a very simple program in go to test performances of a parallel program. I wrote a very simple program that factorizes a big semiprime number by division trials. Since no communications are involved, I expected an almost perfect speedup. However, the program seems to scale very badly.
I timed the program with 1, 2, 4, and 8 processes, running on a 8 (real, not HT) cores computer, using the system time
command. The number I factorized is "28808539627864609". Here are my results:
<pre>
cores time (sec) speedup
1 60.0153 1
2 47.358 1.27
4 34.459 1.75
8 28.686 2.10
</pre>
How to explain such bad speedups? Is it a bug in my program, or is it a problem with go runtime? How could I get better performances? I'm not talking about the algorithm by itself (I know there are better algorithms to factorize semiprime numbers), but about the way I parallelized it.
Here is the source code of my program:
package main
import (
"big"
"flag"
"fmt"
"runtime"
)
func factorize(n *big.Int, start int, step int, c chan *big.Int) {
var m big.Int
i := big.NewInt(int64(start))
s := big.NewInt(int64(step))
z := big.NewInt(0)
for {
m.Mod(n, i)
if m.Cmp(z) == 0{
c <- i
}
i.Add(i, s)
}
}
func main() {
var np *int = flag.Int("n", 1, "Number of processes")
flag.Parse()
runtime.GOMAXPROCS(*np)
var n big.Int
n.SetString(flag.Arg(0), 10) // Uses number given on command line
c := make(chan *big.Int)
for i:=0; i<*np; i++ {
go factorize(&n, 2+i, *np, c)
}
fmt.Println(<-c)
}
EDIT
Problem really seems to be related to Mod
function. Replacing it by Rem
gives better but still imperfect performances and speedups. Replacing it by QuoRem
gives 3 times faster performances, and perfect speedup. Conclusion: it seems memory allocation kills parallel performances in Go. Why? Do you have any references about this?
答案1
得分: 3
Big.Int方法通常需要分配内存,通常用于保存计算结果。问题在于只有一个堆,所有内存操作都是串行的。在这个程序中,数字相对较小,对于像Mod和Add这样的(可并行化)计算所需的时间相对较小,与重复分配所有小小的内存块的不可并行化操作相比。
至于加速它,显而易见的答案是如果不需要使用Big.Ints就不要使用。你的示例数字恰好适合64位。但是,如果你打算使用非常大的大数进行计算,问题本身将会消失。你将花费更多时间进行计算,而在堆中花费的时间相对较少。
顺便说一下,你的程序中有一个bug,虽然与性能无关。当你找到一个因子并在通道上返回结果时,你发送的是指向局部变量i的指针。这是可以的,但是你没有跳出循环。goroutine中的循环继续递增i,当主goroutine继续从通道中取出指针并跟随它时,该值几乎肯定是错误的。
英文:
Big.Int methods generally have to allocate memory, usually to hold the result of the computation. The problem is that there is just one heap and all memory operations are serialized. In this program, the numbers are fairly small and the (parallelizable) computation time needed for things like Mod and Add is small compared to the non-parallelizable operations of repeatedly allocating all the tiny little bits of memory.
As far as speeding it up, there is the obvious answer of don't use big.Ints if you don't have to. Your example number happens to fit in 64 bits. If you plan on working with really big big numbers though, the problem will kind of go away on its own. You will spend much more time doing computations, and the time spent in the heap will be relatively much less.
There is a bug in your program, by the way, although it's not related to performance. When you find a factor and return the result on the channel, you send a pointer to the local variable i. This is fine, except that you don't break out of the loop then. The loop in the goroutine continues incrementing i and by the time the main goroutine gets around to fishing the pointer out of the channel and following it, the value is almost certain to be wrong.
答案2
得分: 3
发送i
通过通道后,应该用新分配的big.Int
替换i
:
if m.Cmp(z) == 0 {
c <- i
i = new(big.Int).Set(i)
}
这是必要的,因为不能保证fmt.Println
在第fmt.Println(<-c)
行接收到整数时会处理它。fmt.Println
通常不会导致goroutine切换,所以如果i
不被新分配的big.Int
替换,并且运行时切换回在函数factorize
中执行的for循环,那么for循环将在打印之前覆盖i
- 在这种情况下,程序将不会打印出通过通道发送的第一个整数。
fmt.Println
可能导致goroutine切换的事实意味着函数factorize
中的for循环在main
goroutine从通道c
接收到整数之后和main
goroutine终止之前可能会消耗大量的CPU时间。类似这样:
运行factorize()
在main()中接收<-c
调用fmt.Println()
继续运行factorize() //消耗不必要的CPU时间
从fmt.Println()返回
从main()返回并终止程序
小型多核加速的另一个原因是内存分配。函数(*Int).Mod
在内部使用(*Int).QuoRem
,每次调用都会创建一个新的big.Int
。为了避免内存分配,直接使用QuoRem
:
func factorize(n *big.Int, start int, step int, c chan *big.Int) {
var q, r big.Int
i := big.NewInt(int64(start))
s := big.NewInt(int64(step))
z := big.NewInt(0)
for {
q.QuoRem(n, i, &r)
if r.Cmp(z) == 0 {
c <- i
i = new(big.Int).Set(i)
}
i.Add(i, s)
}
}
不幸的是,Go release r60.3
中的goroutine调度器存在一个bug,阻止了该代码使用所有CPU核心。当使用-n=2
(GOMAXPROCS=2)启动程序时,运行时只会使用1个线程。
Go的每周发布版具有更好的运行时,并且如果将n=2
传递给程序,它可以利用2个线程。这在我的机器上大约加速了1.9倍。
多核减速的另一个潜在因素已经在用户“High Performance Mark”的答案中提到。如果程序将工作分成多个子任务,并且结果仅来自1个子任务,这意味着其他子任务可能会做一些“额外的工作”。使用n>=2
运行程序可能总体上比使用n=1
运行程序消耗更多的CPU时间。
要了解正在做多少额外的工作,您可能希望(以某种方式)在程序退出函数main()
时打印出所有goroutine中所有i
的值。
英文:
After sending i
through the channel, i
should be replaced with a newly allocated big.Int
:
if m.Cmp(z) == 0 {
c <- i
i = new(big.Int).Set(i)
}
This is necessary because there is no guarantee when fmt.Println
will process the integer received on line fmt.Println(<-c)
. It isn't usual for fmt.Println
to cause goroutine switching, so if i
isn't replaced with a newly allocated big.Int
and the run-time switches back to executing the for-loop in function factorize
then the for-loop will overwrite i
before it is printed - in which case the program won't print out the 1st integer sent through the channel.
The fact that fmt.Println
can cause goroutine switching means that the for-loop in function factorize
may potentially consume a lot of CPU time between the moment when the main
goroutine receives from channel c
and the moment when the main
goroutine terminates. Something like this:
run factorize()
<-c in main()
call fmt.Println()
continue running factorize() // Unnecessary CPU time consumed
return from fmt.Println()
return from main() and terminate program
Another reason for the small multi-core speedup is memory allocation. The function (*Int).Mod
is internally using (*Int).QuoRem
and will create a new big.Int
each time it is called. To avoid the memory allocation, use QuoRem
directly:
func factorize(n *big.Int, start int, step int, c chan *big.Int) {
var q, r big.Int
i := big.NewInt(int64(start))
s := big.NewInt(int64(step))
z := big.NewInt(0)
for {
q.QuoRem(n, i, &r)
if r.Cmp(z) == 0 {
c <- i
i = new(big.Int).Set(i)
}
i.Add(i, s)
}
}
Unfortunately, the goroutine scheduler in Go release r60.3
contains a bug which prevents this code to use all CPU cores. When the program is started with -n=2
(GOMAXPROCS=2), the run-time will utilize only 1 thread.
Go weekly release has a better run-time and can utilize 2 threads if n=2
is passed to the program. This gives a speedup of approximately 1.9 on my machine.
Another potential contributing factor to multi-core slowdown has been mentioned in the answer by user "High Performance Mark". If the program is splitting the work into multiple sub-tasks and the result comes only from 1 sub-task, it means that the other sub-tasks may do some "extra work". Running the program with n>=2
may in total consume more CPU time than running the program with n=1
.
The learn how much extra work is being done, you may want to (somehow) print out values of all i
's in all goroutines at the moment the program is exiting the function main()
.
答案3
得分: 0
我不会读取go
,所以这可能是对一个问题的回答,而不是你所问的问题。如果是这样,请按照您的意愿进行投票或删除。
如果你要绘制“分解整数n所需的时间”与“n”的图表,你会得到一个上下波动相当随机的图表。对于你选择的任何n,都会有一个在1到n范围内的整数,在一个处理器上需要最长的时间来分解。如果你的并行化策略是将n个整数分布在p个处理器上,其中一个处理器将至少需要分解最难的整数的时间,然后再分解其余的负载。
也许你做过类似的事情?
英文:
I don't read go
so this is probably the answer to a question which is not what you asked. If so, downvote or delete as you wish.
If you were to make a plot of 'time to factorise integer n' against 'n' you would get a plot that goes up and down somewhat randomly. For any n you choose there will be an integer in the range 1..n that takes longest to factorise on one processor. If your parallelisation strategy is to distribute the n integers across p processors one of those processors will take at least the time to factorise the hardest integer, then the time to factorise the rest of its load.
Perhaps you've done something similar ?
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论