为什么我的基准测试显示通过值和索引对切片进行范围遍历时速度相同?

huangapple go评论74阅读模式
英文:

Why does my benchmark show same fast performance for ranging over a slice by value vs. index?

问题

type Item struct {
    A int
    B [1024]byte
}

func BenchmarkRange1(b *testing.B) {
    s := make([]Item, 1024)
    for i := 0; i < b.N; i++ {
        for _, v := range s {
            _ = v.A
        }
    }
}

func BenchmarkRange2(b *testing.B) {
    s := make([]Item, 1024)
    for i := 0; i < b.N; i++ {
        for i := range s {
            _ = s[i].A
        }
    }
}

现在,让我们看一下基准测试的结果。

go test -bench=BenchmarkRange -benchmem main/copy
goos: darwin
goarch: amd64
pkg: main/copy
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkRange1-12       4577601               260.9 ns/op             0 B/op          0 allocs/op
BenchmarkRange2-12       4697178               254.9 ns/op             0 B/op          0 allocs/op
PASS
ok      main/copy       3.391s

在使用 range 遍历切片时,不是会复制元素吗?为什么性能相同?当我们通过值来遍历切片时,编译器做了什么优化?

当我使用编译选项 "-gcflags=-N" 禁用编译器优化时,我会得到预期的结果:

go test -bench=BenchmarkRange -benchmem -gcflags=-N main/copy
goos: darwin
goarch: amd64
pkg: main/copy
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkRange1-12         39004             29481 ns/op              27 B/op          0 allocs/op
BenchmarkRange2-12        777356              1572 ns/op               1 B/op          0 allocs/op
PASS
ok      main/copy       3.169s

谁能解释一下编译器是如何进行优化的?

英文:
type Item struct {
    A int
    B [1024]byte
}
 
func BenchmarkRange1(b *testing.B) {
    s := make([]Item, 1024)
    for i := 0; i &lt; b.N; i++ {
        for _, v := range s {
            _ = v.A
        }
    }
}
 
func BenchmarkRange2(b *testing.B) {
    s := make([]Item, 1024)
    for i := 0; i &lt; b.N; i++ {
        for i := range s {
            _ = s[i].A
        }
    }
}

Now, take a look at the result of the benchmark.

go test -bench=BenchmarkRange -benchmem main/copy
goos: darwin
goarch: amd64
pkg: main/copy
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkRange1-12       4577601               260.9 ns/op             0 B/op          0 allocs/op
BenchmarkRange2-12       4697178               254.9 ns/op             0 B/op          0 allocs/op
PASS
ok      main/copy       3.391s

Isn't it to copy elements when range slice by value? Why the performance is same? What optimization does the compiler do when we range the slice by value?

When I fobidden the optimization of compiler by compiling option "-gcflags=-N", I will get the expected result:

go test -bench=BenchmarkRange -benchmem -gcflags=-N main/copy
goos: darwin
goarch: amd64
pkg: main/copy
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkRange1-12         39004             29481 ns/op              27 B/op          0 allocs/op
BenchmarkRange2-12        777356              1572 ns/op               1 B/op          0 allocs/op
PASS
ok      main/copy       3.169s

Who can explain how the compiler to optimize.

答案1

得分: 1

使用默认优化,你在BenchmarkRange1BenchmarkRange2的内部循环中的代码被编译成一个空循环,循环次数为1024次,就像你写的内部循环一样:

	for i := 0; i < 1024; i++ {

	}

在你的两个示例中,编译器足够聪明,能够识别出你在内部循环中没有做任何事情(也就是说,没有使用vv.As[i]s[i].A)。

go.godbolt.org是一个很好的资源,可以查看Go编译器生成的汇编代码。例如,BenchmarkRange1中的内部循环被编译成以下代码(将AX寄存器清零,然后循环1024次):

        XORL    AX, AX
Range1_pc39:
        INCQ    AX
        CMPQ    AX, $1024
        JLT     Range1_pc39

你可以在这里查看完整的输出,还有方便的工具提示(通常解释了不同的汇编指令):
https://go.godbolt.org/z/raTPjTrYG

(为了让示例更简洁,我省略了testing包;//go:nosplit注释实际上并不需要,但稍微简化了生成的汇编代码)。

英文:

With the default optimizations, your inner loop in both both BenchmarkRange1 and BenchmarkRange2 is being compiled down to an empty loop with 1024 iterations, as if you had written your inner loop like:

	for i := 0; i &lt; 1024; i++ {

	}

In both of your examples, the compiler is smart enough to recognize that you aren't doing anything inside the inner loop (that is, not making use of v, v.A, s[i], or s[i].A).

go.godbolt.org is a great resource for looking at the assembly the Go compiler produces. For example, the inner loop in BenchmarkRange1 gets compiled down to the following (which zeros out AX, then loops 1024 times):

        XORL    AX, AX
Range1_pc39:
        INCQ    AX
        CMPQ    AX, $1024
        JLT     Range1_pc39

You can look at the complete output here, along with handy tooltips that (usually) explain the different assembly instructions:
https://go.godbolt.org/z/raTPjTrYG

(To make your example shorter, I dropped the testing package; the //go:nosplit comments aren't really needed, but slightly simplify the resulting assembly).

huangapple
  • 本文由 发表于 2022年3月17日 13:01:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/71507307.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定