What's costing Go a factor of 4 in performance in this array access microbenchmark (relative to GCC)?

huangapple go评论90阅读模式
英文:

What's costing Go a factor of 4 in performance in this array access microbenchmark (relative to GCC)?

问题

我编写了这个微基准测试来更好地理解Go的性能特性,以便在何时使用它时能够做出明智的选择。

从性能开销的角度来看,我认为这将是Go的理想场景:

  • 循环内没有分配/释放操作
  • 数组访问明显在边界内(边界检查可以被移除)

然而,相对于在AMD64上使用gcc -O3,我发现速度上存在4倍的差异。为什么会这样呢?

(使用shell进行计时。每个测试需要几秒钟,因此启动时间可以忽略不计)

package main

import "fmt"

func main() {
    fmt.Println("started")

    var n int32 = 1024 * 32

    a := make([]int32, n, n)
    b := make([]int32, n, n)

    var it, i, j int32

    for i = 0; i < n; i++ {
        a[i] =  i
        b[i] = -i
    }

    var r int32 = 10
    var sum int32 = 0

    for it = 0; it < r; it++ {
        for i = 0; i < n; i++ {
            for j = 0; j < n; j++ {
                sum += (a[i] + b[j]) * (it + 1)
            }
        }
    }
    fmt.Printf("n = %d, r = %d, sum = %d\n", n, r, sum)
}

C版本:

#include <stdio.h>
#include <stdlib.h>


int main() {
    printf("started\n");

    int32_t n = 1024 * 32;

    int32_t* a = malloc(sizeof(int32_t) * n);
    int32_t* b = malloc(sizeof(int32_t) * n);

    for(int32_t i = 0; i < n; ++i) {
        a[i] =  i;
        b[i] = -i;
    }

    int32_t r = 10;
    int32_t sum = 0;

    for(int32_t it = 0; it < r; ++it) {
        for(int32_t i = 0; i < n; ++i) {
            for(int32_t j = 0; j < n; ++j) {
                sum += (a[i] + b[j]) * (it + 1);
            }
        }
    }
    printf("n = %d, r = %d, sum = %d\n", n, r, sum);

    free(a);
    free(b);
}

更新:

  • 如建议的那样使用range,可以将Go的速度提高2倍。
  • 另一方面,使用-march=native可以将C的速度提高2倍,在我的测试中。(而-mno-sse会导致编译错误,显然与-O3不兼容)
  • 在这里,GCCGO与GCC相当(并且不需要range)。
英文:

I wrote this microbenchmark to better understand Go's performance characteristics, so that I would be able to make intelligent choices as to when to use it.

I thought this would be the ideal scenario for Go, from the performance overhead point-of-view:

  • no allocations / deallocations inside the loop
  • array access clearly within bounds (bounds checks could be removed)

Still, I'm seeing an exactly 4-fold difference in speed relative to gcc -O3 on AMD64. Why is that?

(Timed using the shell. Each takes a few seconds, so the startup is negligible)

package main

import &quot;fmt&quot;

func main() {
    fmt.Println(&quot;started&quot;);

    var n int32 = 1024 * 32

    a := make([]int32, n, n)
    b := make([]int32, n, n)

    var it, i, j int32

    for i = 0; i &lt; n; i++ {
        a[i] =  i
        b[i] = -i
    }

    var r int32 = 10
    var sum int32 = 0

    for it = 0; it &lt; r; it++ {
        for i = 0; i &lt; n; i++ {
            for j = 0; j &lt; n; j++ {
                sum += (a[i] + b[j]) * (it + 1)
            }
        }
    }
    fmt.Printf(&quot;n = %d, r = %d, sum = %d\n&quot;, n, r, sum)
}

The C version:

#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;


int main() {
    printf(&quot;started\n&quot;);

    int32_t n = 1024 * 32;

    int32_t* a = malloc(sizeof(int32_t) * n);
    int32_t* b = malloc(sizeof(int32_t) * n);

    for(int32_t i = 0; i &lt; n; ++i) {
        a[i] =  i;
        b[i] = -i;
    }

    int32_t r = 10;
    int32_t sum = 0;

    for(int32_t it = 0; it &lt; r; ++it) {
        for(int32_t i = 0; i &lt; n; ++i) {
            for(int32_t j = 0; j &lt; n; ++j) {
                sum += (a[i] + b[j]) * (it + 1);
            }
        }
    }
    printf(&quot;n = %d, r = %d, sum = %d\n&quot;, n, r, sum);

    free(a);
    free(b);
}

Updates:

  • Using range, as suggested, speeds Go up by a factor of 2.
  • On the other hand, -march=native speeds C up by a factor of 2, in my tests. (And -mno-sse gives a compile error, apparently incompatible with -O3)
  • GCCGO seems comparable to GCC here (and does not need range)

答案1

得分: 2

观察C程序和Go程序的汇编输出,至少在我使用的Go和GCC版本(分别为1.19.6和12.2.0)上,最明显的区别是GCC对C程序进行了自动向量化,而Go编译器似乎无法做到这一点。

这也很好地解释了为什么性能会增加四倍,因为GCC在不针对特定架构时使用的是SSE而不是AVX,这意味着32位操作的标量指令宽度增加了四倍。实际上,对我来说,添加-march=native可以再增加两倍的性能,因为这使得GCC在我的CPU上输出AVX代码。

我对Go不够熟悉,无法告诉你Go编译器是否本质上无法进行自动向量化,还是仅仅是这个特定程序因某种原因而无法进行自动向量化,但无论如何,这似乎是根本原因。

英文:

Looking at the assembler output of the C program vs the Go program, at least on the versions of Go and GCC that I use (1.19.6 and 12.2.0, respectively), the immediate and obvious difference is that GCC has auto-vectorized the C program, whereas the Go compiler does not seem to have been capable of that.

That also explains fairly well why you're seeing exactly a four-fold increase in performance, since GCC, when not targeting a specific architecture, uses SSE rather than AVX, meaning four times the width of scalar instructions for 32-bit operations. In fact, adding -march=native adds another two-fold performance increase for me, since that makes GCC output AVX code on my CPU.

I'm not intimate enough with Go to be able to tell you whether the Go compiler is intrinsically incapable of autovectorization or if it's just this particular program that trips it up for some reason, but nevertheless that seems to be the fundamental reason.

huangapple
  • 本文由 发表于 2023年3月1日 04:33:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/75597025.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定