英文:
Why does my benchmark show same fast performance for ranging over a slice by value vs. index?
问题
type Item struct {
A int
B [1024]byte
}
func BenchmarkRange1(b *testing.B) {
s := make([]Item, 1024)
for i := 0; i < b.N; i++ {
for _, v := range s {
_ = v.A
}
}
}
func BenchmarkRange2(b *testing.B) {
s := make([]Item, 1024)
for i := 0; i < b.N; i++ {
for i := range s {
_ = s[i].A
}
}
}
现在,让我们看一下基准测试的结果。
go test -bench=BenchmarkRange -benchmem main/copy
goos: darwin
goarch: amd64
pkg: main/copy
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkRange1-12 4577601 260.9 ns/op 0 B/op 0 allocs/op
BenchmarkRange2-12 4697178 254.9 ns/op 0 B/op 0 allocs/op
PASS
ok main/copy 3.391s
在使用 range 遍历切片时,不是会复制元素吗?为什么性能相同?当我们通过值来遍历切片时,编译器做了什么优化?
当我使用编译选项 "-gcflags=-N" 禁用编译器优化时,我会得到预期的结果:
go test -bench=BenchmarkRange -benchmem -gcflags=-N main/copy
goos: darwin
goarch: amd64
pkg: main/copy
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkRange1-12 39004 29481 ns/op 27 B/op 0 allocs/op
BenchmarkRange2-12 777356 1572 ns/op 1 B/op 0 allocs/op
PASS
ok main/copy 3.169s
谁能解释一下编译器是如何进行优化的?
英文:
type Item struct {
A int
B [1024]byte
}
func BenchmarkRange1(b *testing.B) {
s := make([]Item, 1024)
for i := 0; i < b.N; i++ {
for _, v := range s {
_ = v.A
}
}
}
func BenchmarkRange2(b *testing.B) {
s := make([]Item, 1024)
for i := 0; i < b.N; i++ {
for i := range s {
_ = s[i].A
}
}
}
Now, take a look at the result of the benchmark.
go test -bench=BenchmarkRange -benchmem main/copy
goos: darwin
goarch: amd64
pkg: main/copy
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkRange1-12 4577601 260.9 ns/op 0 B/op 0 allocs/op
BenchmarkRange2-12 4697178 254.9 ns/op 0 B/op 0 allocs/op
PASS
ok main/copy 3.391s
Isn't it to copy elements when range slice by value? Why the performance is same? What optimization does the compiler do when we range the slice by value?
When I fobidden the optimization of compiler by compiling option "-gcflags=-N", I will get the expected result:
go test -bench=BenchmarkRange -benchmem -gcflags=-N main/copy
goos: darwin
goarch: amd64
pkg: main/copy
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkRange1-12 39004 29481 ns/op 27 B/op 0 allocs/op
BenchmarkRange2-12 777356 1572 ns/op 1 B/op 0 allocs/op
PASS
ok main/copy 3.169s
Who can explain how the compiler to optimize.
答案1
得分: 1
使用默认优化,你在BenchmarkRange1
和BenchmarkRange2
的内部循环中的代码被编译成一个空循环,循环次数为1024次,就像你写的内部循环一样:
for i := 0; i < 1024; i++ {
}
在你的两个示例中,编译器足够聪明,能够识别出你在内部循环中没有做任何事情(也就是说,没有使用v
、v.A
、s[i]
或s[i].A
)。
go.godbolt.org是一个很好的资源,可以查看Go编译器生成的汇编代码。例如,BenchmarkRange1
中的内部循环被编译成以下代码(将AX寄存器清零,然后循环1024次):
XORL AX, AX
Range1_pc39:
INCQ AX
CMPQ AX, $1024
JLT Range1_pc39
你可以在这里查看完整的输出,还有方便的工具提示(通常解释了不同的汇编指令):
https://go.godbolt.org/z/raTPjTrYG
(为了让示例更简洁,我省略了testing包;//go:nosplit
注释实际上并不需要,但稍微简化了生成的汇编代码)。
英文:
With the default optimizations, your inner loop in both both BenchmarkRange1
and BenchmarkRange2
is being compiled down to an empty loop with 1024 iterations, as if you had written your inner loop like:
for i := 0; i < 1024; i++ {
}
In both of your examples, the compiler is smart enough to recognize that you aren't doing anything inside the inner loop (that is, not making use of v
, v.A
, s[i]
, or s[i].A
).
go.godbolt.org is a great resource for looking at the assembly the Go compiler produces. For example, the inner loop in BenchmarkRange1
gets compiled down to the following (which zeros out AX, then loops 1024 times):
XORL AX, AX
Range1_pc39:
INCQ AX
CMPQ AX, $1024
JLT Range1_pc39
You can look at the complete output here, along with handy tooltips that (usually) explain the different assembly instructions:
https://go.godbolt.org/z/raTPjTrYG
(To make your example shorter, I dropped the testing package; the //go:nosplit
comments aren't really needed, but slightly simplify the resulting assembly).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论