英文:
Array vs Slice: accessing speed
问题
这个问题是关于访问数组和切片元素的速度,而不是关于将它们作为参数传递给函数的效率。
我预期在大多数情况下,数组比切片更快,因为切片是描述数组连续部分的数据结构,所以访问切片元素时可能涉及到额外的步骤(间接访问其底层数组的元素)。
所以我编写了一个小测试来对一批简单操作进行基准测试。有4个基准函数,前两个测试全局切片和全局数组,另外两个测试局部切片和局部数组:
var gs = make([]byte, 1000) // 全局切片
var ga [1000]byte // 全局数组
func BenchmarkSliceGlobal(b *testing.B) {
for i := 0; i < b.N; i++ {
for j, v := range gs {
gs[j]++; gs[j] = gs[j] + v + 10; gs[j] += v
}
}
}
func BenchmarkArrayGlobal(b *testing.B) {
for i := 0; i < b.N; i++ {
for j, v := range ga {
ga[j]++; ga[j] = ga[j] + v + 10; ga[j] += v
}
}
}
func BenchmarkSliceLocal(b *testing.B) {
var s = make([]byte, 1000)
for i := 0; i < b.N; i++ {
for j, v := range s {
s[j]++; s[j] = s[j] + v + 10; s[j] += v
}
}
}
func BenchmarkArrayLocal(b *testing.B) {
var a [1000]byte
for i := 0; i < b.N; i++ {
for j, v := range a {
a[j]++; a[j] = a[j] + v + 10; a[j] += v
}
}
}
我多次运行了这个测试,这是典型的输出(go test -bench .*
):
BenchmarkSliceGlobal 300000 4210 ns/op
BenchmarkArrayGlobal 300000 4123 ns/op
BenchmarkSliceLocal 500000 3090 ns/op
BenchmarkArrayLocal 500000 3768 ns/op
分析结果:
访问全局切片比访问全局数组稍微慢一些,这符合我的预期:
4210
vs 4123
ns/op
但是访问局部切片比访问局部数组明显更快:
3090
vs 3768
ns/op
我的问题是:这是什么原因?
注意:
我尝试改变了以下几个因素,但结果没有改变:
- 数组/切片的大小(尝试了100、1000、10000)
- 基准函数的顺序
- 数组/切片的元素类型(尝试了
byte
和int
)
英文:
This question is about the speed of accessing elements of arrays and slices, not about the efficiency of passing them to functions as arguments.
I would expect arrays to be faster than slices in most cases because a slice is a data structure describing a contiguous section of an array and so there may be an extra step involved when accessing elements of a slice (indirectly the elements of its underlying array).
So I wrote a little test to benchmark a batch of simple operations. There are 4 benchmark functions, the first 2 test a global slice and a global array, the other 2 test a local slice and a local array:
var gs = make([]byte, 1000) // Global slice
var ga [1000]byte // Global array
func BenchmarkSliceGlobal(b *testing.B) {
for i := 0; i < b.N; i++ {
for j, v := range gs {
gs[j]++; gs[j] = gs[j] + v + 10; gs[j] += v
}
}
}
func BenchmarkArrayGlobal(b *testing.B) {
for i := 0; i < b.N; i++ {
for j, v := range ga {
ga[j]++; ga[j] = ga[j] + v + 10; ga[j] += v
}
}
}
func BenchmarkSliceLocal(b *testing.B) {
var s = make([]byte, 1000)
for i := 0; i < b.N; i++ {
for j, v := range s {
s[j]++; s[j] = s[j] + v + 10; s[j] += v
}
}
}
func BenchmarkArrayLocal(b *testing.B) {
var a [1000]byte
for i := 0; i < b.N; i++ {
for j, v := range a {
a[j]++; a[j] = a[j] + v + 10; a[j] += v
}
}
}
I ran the test multiple times, here is the typical output (go test -bench .*
):
BenchmarkSliceGlobal 300000 4210 ns/op
BenchmarkArrayGlobal 300000 4123 ns/op
BenchmarkSliceLocal 500000 3090 ns/op
BenchmarkArrayLocal 500000 3768 ns/op
Analyzing the results:
Accessing the global slice is slightly slower than accessing the global array which is as I expected:
4210
vs 4123
ns/op
But accessing the local slice is significantly faster than accessing the local array:
3090
vs 3768
ns/op
My question is: What is the reason for this?
Notes
I tried varying the following things but none changed the outcome:
- the size of the array/slice (tried 100, 1000, 10000)
- the order of the benchmark functions
- the element type of the array/slice (tried
byte
andint
)
答案1
得分: 18
比较BenchmarkArrayLocal
和BenchmarkSliceLocal
的amd64
汇编代码(太长无法在此贴出):
数组版本在每次数组访问操作中都要从内存中加载a
的地址:
LEAQ "".a+1000(SP),BX
而切片版本在从内存中加载一次后,仅在寄存器上进行计算:
LEAQ (DX)(SI*1),BX
这并不是最终结论,但可能是原因。原因是这两种方法在其他方面几乎完全相同。另一个值得注意的细节是,数组版本调用了runtime.duffcopy
,这是一个相当长的汇编程序,而切片版本则没有。
英文:
Comparing the amd64 assembly of both BenchmarkArrayLocal
and BenchmarkSliceLocal
(too long to fit in this post):
The array version loads the address of a
from memory multiple times, practically on every array-access operation:
LEAQ "".a+1000(SP),BX
Whereas the slice version is computing exclusively on registers after loading once from memory:
LEAQ (DX)(SI*1),BX
This is not conclusive but probably the cause. Reason being that both methods are otherwise virtually identical. One other notable detail is that the array version calls into runtime.duffcopy
, which is a quite long assembly routine, whereas the slice version doesn't.
答案2
得分: 5
Go版本1.8可以消除一些范围检查,所以差异变得更大。
BenchmarkSliceGlobal-4 500000 3220 ns/op
BenchmarkArrayGlobal-4 1000000 1287 ns/op
BenchmarkSliceLocal-4 1000000 1267 ns/op
BenchmarkArrayLocal-4 1000000 1301 ns/op
对于数组,我建议使用2的幂次方大小,并包括一个逻辑与操作。这样,你可以确保编译器消除了检查。因此,使用var ga [1024]byte
和ga[j & 1023]
。
英文:
Go version 1.8 can eliminate some range checks so the difference got bigger.
BenchmarkSliceGlobal-4 500000 3220 ns/op
BenchmarkArrayGlobal-4 1000000 1287 ns/op
BenchmarkSliceLocal-4 1000000 1267 ns/op
BenchmarkArrayLocal-4 1000000 1301 ns/op
For arrays I'd recommend to use sizes from powers of two and include a logical and operation. In that way you're sure the compiler eliminates the check. Thus var ga [1024]byte
with ga[j & 1023]
.
答案3
得分: 1
在go1.18和M1上,差异更大了。我以前确信数组比切片更快,但现在我有证据表明情况并非总是如此。
goos: darwin
goarch: arm64
BenchmarkSliceGlobal-8 926196 1257.0 ns/op 0 B/op 0 allocs/op
BenchmarkArrayGlobal-8 2110324 567.0 ns/op 0 B/op 0 allocs/op
BenchmarkSliceLocal-8 2275382 535.0 ns/op 0 B/op 0 allocs/op
BenchmarkArrayLocal-8 1802491 647.4 ns/op 0 B/op 0 allocs/op
英文:
on go1.18 and M1 the difference much bigger
i was sure that array is faster then slice but now i have proof it's not always the case
goos: darwin
goarch: arm64
BenchmarkSliceGlobal-8 926196 1257.0 ns/op 0 B/op 0 allocs/op
BenchmarkArrayGlobal-8 2110324 567.0 ns/op 0 B/op 0 allocs/op
BenchmarkSliceLocal-8 2275382 535.0 ns/op 0 B/op 0 allocs/op
BenchmarkArrayLocal-8 1802491 647.4 ns/op 0 B/op 0 allocs/op```
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论