英文:
No speedup in multithread program
问题
我在使用Go语言并发性时发现了一些我不太理解的东西。
我写了一个并行矩阵乘法的程序,也就是说,每个任务计算产品矩阵的一行,将源矩阵的相应行和列相乘。
这是Java程序
public static double[][] parallelMultiply(int nthreads, final double[][] m1, final double[][] m2) {
final int n = m1.length, m = m1[0].length, l = m2[0].length;
assert m1[0].length == m2.length;
double[][] r = new double[n][];
ExecutorService e = Executors.newFixedThreadPool(nthreads);
List<Future<double[]>> results = new LinkedList<Future<double[]>>();
for (int ii = 0; ii < n; ++ii) {
final int i = ii;
Future<double[]> result = e.submit(new Callable<double[]>() {
public double[] call() throws Exception {
double[] row = new double[l];
for (int j = 0; j < l; ++j) {
for (int k = 0; k < m; ++k) {
row[j] += m1[i][k]*m2[k][j];
}
}
return row;
}
});
results.add(result);
}
try {
e.shutdown();
e.awaitTermination(1, TimeUnit.HOURS);
int i = 0;
for (Future<double[]> result : results) {
r[i] = result.get();
++i;
}
} catch (Exception ex) {
ex.printStackTrace();
return null;
}
return r;
}
这是Go程序
type Matrix struct {
n, m int
data [][]float64
}
func New(n, m int) *Matrix {
data := make([][]float64, n)
for i, _ := range data {
data[i] = make([]float64, m)
}
return &Matrix{n, m, data}
}
func (m *Matrix) Get(i, j int) float64 {
return m.data[i][j]
}
func (m *Matrix) Set(i, j int, v float64) {
m.data[i][j] = v
}
func MultiplyParallel(m1, m2 *Matrix) *Matrix {
r := New(m1.n, m2.m)
c := make(chan interface{}, m1.n)
for i := 0; i < m1.n; i++ {
go func(i int) {
innerLoop(r, m1, m2, i)
c <- nil
}(i)
}
for i := 0; i < m1.n; i++ {
<-c
}
return r
}
func innerLoop(r, m1, m2 *Matrix, i int) {
for j := 0; j < m2.m; j++ {
s := 0.0
for k := 0; k < m1.m; k++ {
s = s + m1.Get(i, k) * m2.Get(k, j)
}
r.Set(i, j, s)
}
}
当我使用nthreads=1和nthreads=2的Java程序时,在我的双核N450 Atom netbook上几乎有两倍的加速。
当我使用GOMAXPROCS=1和GOMAXPROCS=2的Go程序时,根本没有加速!
尽管Java代码使用了额外的存储来存储“Future”,然后将它们的值收集到结果矩阵中,而不是在工作代码中直接更新数组(这是Go版本所做的),但它在多核上的性能要比Go版本快得多。
尤其有趣的是,使用GOMAXPROCS=2的Go版本会加载两个核心(htop显示两个处理器的100%负载),但计算时间与GOMAXPROCS=1的情况相同(htop只在一个核心上显示100%负载)。
另一个问题是,即使在简单的单线程乘法中,Java程序也比Go程序更快,但这并不完全出乎意料(考虑到这里的基准测试),并且不应该影响多核性能乘数。
我在这里做错了什么?有没有办法加速Go程序?
更新:
看来我找到了我在这里做错的地方。我使用System.currentTimeMillis()
检查Java程序的时间,使用time
shell命令检查Go程序的时间。我错误地将zsh输出中的“user”时间误认为是程序的工作时间,而不是“total”时间。现在我再次检查了计算速度,它给了我近乎两倍的加速(尽管稍微小于Java的):
% time env GOMAXPROCS=2 ./4-2-go -n 500 -q
env GOMAXPROCS=2 ./4-2-go -n 500 -q 22,34s user 0,04s system 99% cpu 22,483 total
% time env GOMAXPROCS=2 ./4-2-go -n 500 -q -p
env GOMAXPROCS=2 ./4-2-go -n 500 -q -p 24,09s user 0,10s system 184% cpu 13,080 total
看来我需要更加注意。
尽管如此,相同情况下Java程序的时间要少五倍。但这是另一个问题了,我想。
英文:
I was playing with Go language concurrency and found something which is kinda opaque to me.
I wrote parallel matrix multiplication, that is, each task computes single line of product matrix, multiplying corresponding rows and columns of source matrices.
Here is Java program
public static double[][] parallelMultiply(int nthreads, final double[][] m1, final double[][] m2) {
final int n = m1.length, m = m1[0].length, l = m2[0].length;
assert m1[0].length == m2.length;
double[][] r = new double[n][];
ExecutorService e = Executors.newFixedThreadPool(nthreads);
List<Future<double[]>> results = new LinkedList<Future<double[]>>();
for (int ii = 0; ii < n; ++ii) {
final int i = ii;
Future<double[]> result = e.submit(new Callable<double[]>() {
public double[] call() throws Exception {
double[] row = new double[l];
for (int j = 0; j < l; ++j) {
for (int k = 0; k < m; ++k) {
row[j] += m1[i][k]*m2[k][j];
}
}
return row;
}
});
results.add(result);
}
try {
e.shutdown();
e.awaitTermination(1, TimeUnit.HOURS);
int i = 0;
for (Future<double[]> result : results) {
r[i] = result.get();
++i;
}
} catch (Exception ex) {
ex.printStackTrace();
return null;
}
return r;
}
and this is Go program
type Matrix struct {
n, m int
data [][]float64
}
func New(n, m int) *Matrix {
data := make([][]float64, n)
for i, _ := range data {
data[i] = make([]float64, m)
}
return &Matrix{n, m, data}
}
func (m *Matrix) Get(i, j int) float64 {
return m.data[i][j]
}
func (m *Matrix) Set(i, j int, v float64) {
m.data[i][j] = v
}
func MultiplyParallel(m1, m2 *Matrix) *Matrix {
r := New(m1.n, m2.m)
c := make(chan interface{}, m1.n)
for i := 0; i < m1.n; i++ {
go func(i int) {
innerLoop(r, m1, m2, i)
c <- nil
}(i)
}
for i := 0; i < m1.n; i++ {
<-c
}
return r
}
func innerLoop(r, m1, m2 *Matrix, i int) {
for j := 0; j < m2.m; j++ {
s := 0.0
for k := 0; k < m1.m; k++ {
s = s + m1.Get(i, k) * m2.Get(k, j)
}
r.Set(i, j, s)
}
}
When I use Java program with nthreads=1 and nthreads=2 there is nearly double speedup on my dual-core N450 Atom netbook.
When I use Go program with GOMAXPROCS=1 and GOMAXPROCS=2 there is no speedup at all!
Even though Java code uses additional storage for Future
s and then collectes their values to the result matrix instead of direct array update in the worker code (that's what Go version does), it performs much more faster on several cores than Go version.
Especially funny is that Go version with GOMAXPROCS=2 loads both cores (htop displays 100% load on both processors while program works), but the time of computation is the same as with GOMAXPROCS=1 (htop displays 100% load only on one core in this case).
Another concern is that Java program is faster than Go one even in simple single-thread multiplication, but that is not exactly unexpected (taking benchmarks from here into account) and should not affect multicore performance multiplier.
What I'm doing incorrectly here? Is there a way to speedup Go program?
UPD:
it seems i found what I'm doing incorrectly. I was checking time of java program using System.currentTimeMillis()
and Go program using time
shell command. I mistakingly took 'user' time from zsh output as program working time instead of 'total' one. Now i double-checked the computation speed and it gives me nearly double speedup too (though it is slighlty lesser than Java's):
% time env GOMAXPROCS=2 ./4-2-go -n 500 -q
env GOMAXPROCS=2 ./4-2-go -n 500 -q 22,34s user 0,04s system 99% cpu 22,483 total
% time env GOMAXPROCS=2 ./4-2-go -n 500 -q -p
env GOMAXPROCS=2 ./4-2-go -n 500 -q -p 24,09s user 0,10s system 184% cpu 13,080 total
Seems I have to be more attentive.
Still java program gives five time lesser times on the same case. But it is a matter for another question I think.
答案1
得分: 11
您可能正在经历伪共享的影响。简而言之,如果两个数据片段恰好落在同一个CPU缓存行上,从在不同CPU核心上执行的线程修改这两个数据片段将触发昂贵的缓存一致性协议。
这种缓存“乒乓”非常难以诊断,并且可能发生在逻辑上完全不相关的数据上,只是因为它们恰好在内存中靠得足够近。100%的CPU负载是伪共享的典型表现 - 您的核心确实以100%的工作量工作,只是它们不是在您的程序上工作 - 它们正在工作以同步它们的缓存。
在Java程序中,直到将线程私有数据“整合”到最终结果中的时候,才会避免伪共享。我对Go不太熟悉,但根据您自己的话来判断,线程直接写入公共数组,这正是可能触发伪共享的事情。这是一个例子,说明在多线程环境中,一个完全有效的单线程推理会产生完全相反的结果!
关于这个主题的更深入讨论,我强烈推荐阅读Herb Sutter的文章:消除伪共享,或者观看讲座:机器架构:你的编程语言从未告诉过你的事情(以及相关的PDF幻灯片)。
英文:
You are probably experiencing the effects of false sharing. In a nutshell, if two pieces of data happen to fall onto the same CPU cache line, modifying these two pieces of data from threads that execute on different CPU cores will trigger the expensive cache coherency protocol.
This kind of cache "ping-pong" is extremely hard to diagnose, and can happen on logically completely unrelated data, just because they happen to be placed close enough in memory. The 100% CPU load is typical of false sharing - your cores really are working 100%, they are just not working on your program - they are working on synchronizing their caches.
The fact that in Java program you have a thread-private data until the time comes to "integrate" it into the final result is what saves you from false sharing. I'm not familiar with Go, but judging on your own words, threads are writing directly to the common array, which is exactly the kind of thing that could trigger the false sharing. This is an example how a perfectly valid single-threaded reasoning does exactly the opposite in the multi-threaded environment!
For more in-depth discussion on the topic, I warmly recommend Herb Sutter's article: Eliminate False Sharing, or a lecture: Machine Architecture: Things Your Programming Language Never Told You (and associated PDF slides).
答案2
得分: 1
如果您能在Linux环境中运行这些代码,您可以使用perf来测量伪共享效应。
英文:
If you are able to run these code in Linux environment you can use perf to measure the false sharing effect.
答案3
得分: 0
对于Linux、Windows 32位和64位系统,还有AMD的CodeXL和CodeAnalyst。它们将比英特尔的应用程序更详细地分析在AMD处理器上运行的应用程序,因为适用的性能寄存器是不同的。
英文:
For Linux, Windows 32 and ditto 64 there are also AMD's CodeXL and CodeAnalyst. They will profile an application running on an AMD processor in much greater detail than one from intel since the applicable performance registers are different.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论