2023年3月1日 04:33:29go评论99阅读模式

英文:

What's costing Go a factor of 4 in performance in this array access microbenchmark (relative to GCC)?

问题

我编写了这个微基准测试来更好地理解Go的性能特性，以便在何时使用它时能够做出明智的选择。

从性能开销的角度来看，我认为这将是Go的理想场景：

循环内没有分配/释放操作
数组访问明显在边界内（边界检查可以被移除）

然而，相对于在AMD64上使用gcc -O3，我发现速度上存在4倍的差异。为什么会这样呢？

（使用shell进行计时。每个测试需要几秒钟，因此启动时间可以忽略不计）

package main

import "fmt"

func main() {
    fmt.Println("started")

    var n int32 = 1024 * 32

    a := make([]int32, n, n)
    b := make([]int32, n, n)

    var it, i, j int32

    for i = 0; i < n; i++ {
        a[i] =  i
        b[i] = -i
    }

    var r int32 = 10
    var sum int32 = 0

    for it = 0; it < r; it++ {
        for i = 0; i < n; i++ {
            for j = 0; j < n; j++ {
                sum += (a[i] + b[j]) * (it + 1)
            }
        }
    }
    fmt.Printf("n = %d, r = %d, sum = %d\n", n, r, sum)
}

C版本：

#include <stdio.h>
#include <stdlib.h>


int main() {
    printf("started\n");

    int32_t n = 1024 * 32;

    int32_t* a = malloc(sizeof(int32_t) * n);
    int32_t* b = malloc(sizeof(int32_t) * n);

    for(int32_t i = 0; i < n; ++i) {
        a[i] =  i;
        b[i] = -i;
    }

    int32_t r = 10;
    int32_t sum = 0;

    for(int32_t it = 0; it < r; ++it) {
        for(int32_t i = 0; i < n; ++i) {
            for(int32_t j = 0; j < n; ++j) {
                sum += (a[i] + b[j]) * (it + 1);
            }
        }
    }
    printf("n = %d, r = %d, sum = %d\n", n, r, sum);

    free(a);
    free(b);
}

更新：

如建议的那样使用range，可以将Go的速度提高2倍。
另一方面，使用-march=native可以将C的速度提高2倍，在我的测试中。（而-mno-sse会导致编译错误，显然与-O3不兼容）
在这里，GCCGO与GCC相当（并且不需要range）。

英文:

I wrote this microbenchmark to better understand Go's performance characteristics, so that I would be able to make intelligent choices as to when to use it.

I thought this would be the ideal scenario for Go, from the performance overhead point-of-view:

no allocations / deallocations inside the loop
array access clearly within bounds (bounds checks could be removed)

Still, I'm seeing an exactly 4-fold difference in speed relative to gcc -O3 on AMD64. Why is that?

(Timed using the shell. Each takes a few seconds, so the startup is negligible)

package main

import &quot;fmt&quot;

func main() {
    fmt.Println(&quot;started&quot;);

    var n int32 = 1024 * 32

    a := make([]int32, n, n)
    b := make([]int32, n, n)

    var it, i, j int32

    for i = 0; i &lt; n; i++ {
        a[i] =  i
        b[i] = -i
    }

    var r int32 = 10
    var sum int32 = 0

    for it = 0; it &lt; r; it++ {
        for i = 0; i &lt; n; i++ {
            for j = 0; j &lt; n; j++ {
                sum += (a[i] + b[j]) * (it + 1)
            }
        }
    }
    fmt.Printf(&quot;n = %d, r = %d, sum = %d\n&quot;, n, r, sum)
}

The C version:

#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;


int main() {
    printf(&quot;started\n&quot;);

    int32_t n = 1024 * 32;

    int32_t* a = malloc(sizeof(int32_t) * n);
    int32_t* b = malloc(sizeof(int32_t) * n);

    for(int32_t i = 0; i &lt; n; ++i) {
        a[i] =  i;
        b[i] = -i;
    }

    int32_t r = 10;
    int32_t sum = 0;

    for(int32_t it = 0; it &lt; r; ++it) {
        for(int32_t i = 0; i &lt; n; ++i) {
            for(int32_t j = 0; j &lt; n; ++j) {
                sum += (a[i] + b[j]) * (it + 1);
            }
        }
    }
    printf(&quot;n = %d, r = %d, sum = %d\n&quot;, n, r, sum);

    free(a);
    free(b);
}

Updates:

Using range, as suggested, speeds Go up by a factor of 2.
On the other hand, -march=native speeds C up by a factor of 2, in my tests. (And -mno-sse gives a compile error, apparently incompatible with -O3)
GCCGO seems comparable to GCC here (and does not need range)

答案1

得分: 2

观察C程序和Go程序的汇编输出，至少在我使用的Go和GCC版本（分别为1.19.6和12.2.0）上，最明显的区别是GCC对C程序进行了自动向量化，而Go编译器似乎无法做到这一点。

这也很好地解释了为什么性能会增加四倍，因为GCC在不针对特定架构时使用的是SSE而不是AVX，这意味着32位操作的标量指令宽度增加了四倍。实际上，对我来说，添加-march=native可以再增加两倍的性能，因为这使得GCC在我的CPU上输出AVX代码。

我对Go不够熟悉，无法告诉你Go编译器是否本质上无法进行自动向量化，还是仅仅是这个特定程序因某种原因而无法进行自动向量化，但无论如何，这似乎是根本原因。

英文:

Looking at the assembler output of the C program vs the Go program, at least on the versions of Go and GCC that I use (1.19.6 and 12.2.0, respectively), the immediate and obvious difference is that GCC has auto-vectorized the C program, whereas the Go compiler does not seem to have been capable of that.

That also explains fairly well why you're seeing exactly a four-fold increase in performance, since GCC, when not targeting a specific architecture, uses SSE rather than AVX, meaning four times the width of scalar instructions for 32-bit operations. In fact, adding -march=native adds another two-fold performance increase for me, since that makes GCC output AVX code on my CPU.

I'm not intimate enough with Go to be able to tell you whether the Go compiler is intrinsically incapable of autovectorization or if it's just this particular program that trips it up for some reason, but nevertheless that seems to be the fundamental reason.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

What's costing Go a factor of 4 in performance in this array access microbenchmark (relative to GCC)?

问题

答案1

无法从MacOS交叉编译Golang库到Linux。

在GAE上访问管理员权限 – oauth2

将一个字节追加到字符串中。

Pulumi Go SDK for GCP：无法销毁 SQL 服务器。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论