数组与切片:访问速度

huangapple go评论84阅读模式
英文:

Array vs Slice: accessing speed

问题

这个问题是关于访问数组和切片元素的速度,而不是关于将它们作为参数传递给函数的效率。

我预期在大多数情况下,数组切片更快,因为切片是描述数组连续部分的数据结构,所以访问切片元素时可能涉及到额外的步骤(间接访问其底层数组的元素)。

所以我编写了一个小测试来对一批简单操作进行基准测试。有4个基准函数,前两个测试全局切片和全局数组,另外两个测试局部切片和局部数组:

var gs = make([]byte, 1000) // 全局切片
var ga [1000]byte           // 全局数组

func BenchmarkSliceGlobal(b *testing.B) {
    for i := 0; i < b.N; i++ {
        for j, v := range gs {
            gs[j]++; gs[j] = gs[j] + v + 10; gs[j] += v
        }
    }
}

func BenchmarkArrayGlobal(b *testing.B) {
    for i := 0; i < b.N; i++ {
        for j, v := range ga {
            ga[j]++; ga[j] = ga[j] + v + 10; ga[j] += v
        }
    }
}

func BenchmarkSliceLocal(b *testing.B) {
    var s = make([]byte, 1000)
    for i := 0; i < b.N; i++ {
        for j, v := range s {
            s[j]++; s[j] = s[j] + v + 10; s[j] += v
        }
    }
}

func BenchmarkArrayLocal(b *testing.B) {
    var a [1000]byte
    for i := 0; i < b.N; i++ {
        for j, v := range a {
            a[j]++; a[j] = a[j] + v + 10; a[j] += v
        }
    }
}

我多次运行了这个测试,这是典型的输出(go test -bench .*):

BenchmarkSliceGlobal      300000              4210 ns/op
BenchmarkArrayGlobal      300000              4123 ns/op
BenchmarkSliceLocal       500000              3090 ns/op
BenchmarkArrayLocal       500000              3768 ns/op

分析结果:

访问全局切片比访问全局数组稍微慢一些,这符合我的预期:
4210 vs 4123 ns/op

但是访问局部切片比访问局部数组明显更快:
3090 vs 3768 ns/op

我的问题是:这是什么原因?

注意:

我尝试改变了以下几个因素,但结果没有改变:

  • 数组/切片的大小(尝试了100、1000、10000)
  • 基准函数的顺序
  • 数组/切片的元素类型(尝试了byteint
英文:

This question is about the speed of accessing elements of arrays and slices, not about the efficiency of passing them to functions as arguments.

I would expect arrays to be faster than slices in most cases because a slice is a data structure describing a contiguous section of an array and so there may be an extra step involved when accessing elements of a slice (indirectly the elements of its underlying array).

So I wrote a little test to benchmark a batch of simple operations. There are 4 benchmark functions, the first 2 test a global slice and a global array, the other 2 test a local slice and a local array:

var gs = make([]byte, 1000) // Global slice
var ga [1000]byte           // Global array
func BenchmarkSliceGlobal(b *testing.B) {
for i := 0; i &lt; b.N; i++ {
for j, v := range gs {
gs[j]++; gs[j] = gs[j] + v + 10; gs[j] += v
}
}
}
func BenchmarkArrayGlobal(b *testing.B) {
for i := 0; i &lt; b.N; i++ {
for j, v := range ga {
ga[j]++; ga[j] = ga[j] + v + 10; ga[j] += v
}
}
}
func BenchmarkSliceLocal(b *testing.B) {
var s = make([]byte, 1000)
for i := 0; i &lt; b.N; i++ {
for j, v := range s {
s[j]++; s[j] = s[j] + v + 10; s[j] += v
}
}
}
func BenchmarkArrayLocal(b *testing.B) {
var a [1000]byte
for i := 0; i &lt; b.N; i++ {
for j, v := range a {
a[j]++; a[j] = a[j] + v + 10; a[j] += v
}
}
}

I ran the test multiple times, here is the typical output (go test -bench .*):

BenchmarkSliceGlobal      300000              4210 ns/op
BenchmarkArrayGlobal      300000              4123 ns/op
BenchmarkSliceLocal       500000              3090 ns/op
BenchmarkArrayLocal       500000              3768 ns/op

Analyzing the results:

Accessing the global slice is slightly slower than accessing the global array which is as I expected:
4210 vs 4123 ns/op

But accessing the local slice is significantly faster than accessing the local array:
3090 vs 3768 ns/op

My question is: What is the reason for this?

Notes

I tried varying the following things but none changed the outcome:

  • the size of the array/slice (tried 100, 1000, 10000)
  • the order of the benchmark functions
  • the element type of the array/slice (tried byte and int)

答案1

得分: 18

比较BenchmarkArrayLocalBenchmarkSliceLocalamd64汇编代码(太长无法在此贴出):

数组版本在每次数组访问操作中都要从内存中加载a的地址:

LEAQ	&quot;&quot;.a+1000(SP),BX

而切片版本在从内存中加载一次后,仅在寄存器上进行计算:

LEAQ	(DX)(SI*1),BX

这并不是最终结论,但可能是原因。原因是这两种方法在其他方面几乎完全相同。另一个值得注意的细节是,数组版本调用了runtime.duffcopy,这是一个相当长的汇编程序,而切片版本则没有。

英文:

Comparing the amd64 assembly of both BenchmarkArrayLocal and BenchmarkSliceLocal (too long to fit in this post):

The array version loads the address of a from memory multiple times, practically on every array-access operation:

LEAQ	&quot;&quot;.a+1000(SP),BX

Whereas the slice version is computing exclusively on registers after loading once from memory:

LEAQ	(DX)(SI*1),BX

This is not conclusive but probably the cause. Reason being that both methods are otherwise virtually identical. One other notable detail is that the array version calls into runtime.duffcopy, which is a quite long assembly routine, whereas the slice version doesn't.

答案2

得分: 5

Go版本1.8可以消除一些范围检查,所以差异变得更大。

BenchmarkSliceGlobal-4   	  500000	      3220 ns/op
BenchmarkArrayGlobal-4   	 1000000	      1287 ns/op
BenchmarkSliceLocal-4    	 1000000	      1267 ns/op
BenchmarkArrayLocal-4    	 1000000	      1301 ns/op

对于数组,我建议使用2的幂次方大小,并包括一个逻辑与操作。这样,你可以确保编译器消除了检查。因此,使用var ga [1024]bytega[j & 1023]

英文:

Go version 1.8 can eliminate some range checks so the difference got bigger.

BenchmarkSliceGlobal-4   	  500000	      3220 ns/op
BenchmarkArrayGlobal-4   	 1000000	      1287 ns/op
BenchmarkSliceLocal-4    	 1000000	      1267 ns/op
BenchmarkArrayLocal-4    	 1000000	      1301 ns/op

For arrays I'd recommend to use sizes from powers of two and include a logical and operation. In that way you're sure the compiler eliminates the check. Thus var ga [1024]byte with ga[j &amp; 1023].

答案3

得分: 1

在go1.18和M1上,差异更大了。我以前确信数组比切片更快,但现在我有证据表明情况并非总是如此。

goos: darwin
goarch: arm64
BenchmarkSliceGlobal-8   	  926196	      1257.0 ns/op	       0 B/op	       0 allocs/op
BenchmarkArrayGlobal-8   	 2110324	       567.0 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceLocal-8    	 2275382	       535.0 ns/op	       0 B/op	       0 allocs/op
BenchmarkArrayLocal-8    	 1802491	       647.4 ns/op	       0 B/op	       0 allocs/op
英文:

on go1.18 and M1 the difference much bigger
i was sure that array is faster then slice but now i have proof it's not always the case

goos: darwin
goarch: arm64
BenchmarkSliceGlobal-8   	  926196	      1257.0 ns/op	       0 B/op	       0 allocs/op
BenchmarkArrayGlobal-8   	 2110324	       567.0 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceLocal-8    	 2275382	       535.0 ns/op	       0 B/op	       0 allocs/op
BenchmarkArrayLocal-8    	 1802491	       647.4 ns/op	       0 B/op	       0 allocs/op```
</details>

huangapple
  • 本文由 发表于 2015年5月29日 16:49:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/30525184.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定