通过切片访问数组(2D切片)时,遇到了意外的性能问题。

huangapple go评论88阅读模式
英文:

Go: Unexpected performance when accessing an array through slice of slices (2D slice)

问题

我在Go语言中进行了一些矩阵乘法的性能实验,并遇到了一些意外的结果。

版本1:

func newMatrix(n int) [][]int {
    m := make([][]int, n)
    buf := make([]int, n*n)

    for i := range m {
        m[i] = buf[i*n : (i+1)*n]
    }

    return m
}

func mult1(m1, m2, res [][]int) [][]int {
    for i := range m1 {
        for k := range m1[0] {
            for j := range m2[0] {
                res[i][j] += m1[i][k] * m2[k][j]
            }
        }
    }

    return res
}

在这个版本中,我使用一个线性数组,并从乘法中进行索引。

乘法两个2048x2048的矩阵所需的执行时间如下:

版本1:35.550813801秒
版本2:19.090223468秒

版本2快了近两倍。

我使用了下面的方法来进行测量:

start := time.Now()
mult(m1, m2, m3)
stop := time.Now()

我知道使用切片会增加一层间接性,可能会影响缓存性能,但我没有预料到会有这么大的差异。不幸的是,我还没有找到任何适用于Mac的好工具,可以分析Go语言中的缓存效率,所以我不能确定这是否是导致性能差异的原因。

所以我想问的是,这种行为是否符合预期,或者是否有什么我忽略的地方?

软件和硬件:
Go版本1.4.2 darwin/amd64;OS X 10.10.3;2 GHz四核i7处理器。

英文:

I was doing some performance experimentation in Go with matrix multiplication and ran into some unexpected results.

Version 1:

func newMatrix(n int) [][]int {
    m := make([][]int, n)
    buf := make([]int, n*n)

    for i := range m {
	    m[i] = buf[i*n : (i+1)*n]
    }

    return m
}

func mult1(m1, m2, res [][]int) [][]int {
    for i := range m1 {
	    for k := range m1[0] {
		    for j := range m2[0] {
			    res[i][j] += m1[i][k] * m2[k][j]
		    }
	    }
    }

    return res
}

From the linear array i create multiple slices that represent the matrix rows.

Version 2:

func mult2(m1, m2, res []int, n int) []int {
    for i := 0; i < n; i++ {
	    for k := 0; k < n; k++ {
		    for j := 0; j < n; j++ {
			    res[i*n+j] += m1[i*n+k] * m2[k*n+j]
		    }
	    }
    }

    return res
}

In this version I simply use a linear array and index into it from the multiplication.

Multiplying 2 2048x2048 matrices give the following execution time:

 version 1: 35.550813801s
 version 2: 19.090223468s

Version 2 is almost twice as fast.

I used the approach below to take the measurements:

start := time.Now()
mult(m1, m2, m3)
stop := time.Now()

I am aware using slices give another layer of indirection which could impact the cache performance, however I didn't expect it would be such a big difference. Unfortunately I haven't found any good tool, that works with Mac, that can analyse cache efficiency in Go, so I can't say for sure if this is what's causing the performance difference.

So I guess I'm asking is this expected behavior or is there something I'm missing?

Software and hardware:
Go version 1.4.2 darwin/amd64; OS X 10.10.3; 2 GHz quad-core i7.

答案1

得分: 6

你版本1代码中的主要问题似乎是间接寻址。尽管两个版本中矩阵在内存中的布局相同,但使用间接寻址可能会导致以下问题:

  • 相同代码生成更多的指令。编译器可能难以确定何时使用打包版本的SIMD指令(例如SSE、AVX)。你可以通过转储汇编代码来验证这一点,查找XMM或YMM寄存器,并检查操作寄存器的指令是否是打包的。
  • 使编译器难以添加软件预取。由于间接寻址,编译器很难确定如何添加软件预取。你可以在汇编代码中查找vprefetch指令。
  • 硬件预取器的效率会降低,也是由于间接寻址。你首先需要访问行的起始地址,然后访问行的元素,因此很难观察到硬件预取器应该只获取连续的地址。这只能通过像perf这样的性能分析来测量。

因此,在版本1中,间接寻址是主要问题。我还建议在多次迭代中运行这两个代码,以消除版本1可能因上述原因而产生的缓存预热惩罚。

英文:

The main problem in your version 1 code seems to be indirect addressing. Even though the layout in memory for the matrices in both versions is the same, using indirect addressing can lead to:

  • More generated instructions for the same code. The compiler could have trouble in determining when to use packed versions of SIMD instructions (e.g. SSE, AVX). You can verify this by dumping the assembly code, look for XMM or YMM registers and check if the instructions operating on the registers are packed.
  • You make it difficult for the compiler to add software prefetches. Because indirect addressing, it's difficult for the compiler to detect how to add software prefetches. You can look for vprefetch instructions in the assembly code.
  • The hardware prefetcher will be less efficient also due to indirect addressing. You first need to access the line start address and then access the line elements so it's hard to observe that the hardware prefetcher should just fetch consecutive addresses. That's only measurable through profiling like perf.

So in case of version 1, indirect addressing is the main issue. I also recommend running the 2 codes in multiple iterations to remove the cache warming penalty which might be higher for version 1 because of what I explained above.

答案2

得分: -1

很抱歉,我没有足够的声望将此作为评论发布,但除了VAndrei的观点之外,值得注意的是两个提供的示例在使用for循环时有所不同。在s/i := range m1/i := 0; i < n; i++/之后,第一个示例的执行情况如何?

检查"list mult1"和"list mult2"在pprof中的输出可能也很有用。有一个很好的教程可以快速入门Go的pprof:Russ Cox的Go程序性能分析教程

英文:

Unfortunately, I do not have enough reputation to put this in as comment, but in addition to VAndrei's points it is worth noting that two provided examples use for-loop differently. How does first example perform after s/i := range m1/i := 0; i &lt; n; i++/ ?

It could also be useful to check how "list mult1" and "list mult2" outputs looks like in pprof.
There is great tutorial to get started with Go's pprof very fast: Profiling Go Programs By Russ Cox

huangapple
  • 本文由 发表于 2015年5月11日 00:57:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/30154087.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定