2015年5月11日 00:57:22go评论118阅读模式

英文:

Go: Unexpected performance when accessing an array through slice of slices (2D slice)

问题

我在Go语言中进行了一些矩阵乘法的性能实验，并遇到了一些意外的结果。

版本1：

func newMatrix(n int) [][]int {
    m := make([][]int, n)
    buf := make([]int, n*n)
    for i := range m {
        m[i] = buf[i*n : (i+1)*n]
    }
    return m
}
func mult1(m1, m2, res [][]int) [][]int {
    for i := range m1 {
        for k := range m1[0] {
            for j := range m2[0] {
                res[i][j] += m1[i][k] * m2[k][j]
            }
        }
    }
    return res
}

在这个版本中，我使用一个线性数组，并从乘法中进行索引。

乘法两个2048x2048的矩阵所需的执行时间如下：

版本1：35.550813801秒
版本2：19.090223468秒

版本2快了近两倍。

我使用了下面的方法来进行测量：

start := time.Now()
mult(m1, m2, m3)
stop := time.Now()

我知道使用切片会增加一层间接性，可能会影响缓存性能，但我没有预料到会有这么大的差异。不幸的是，我还没有找到任何适用于Mac的好工具，可以分析Go语言中的缓存效率，所以我不能确定这是否是导致性能差异的原因。

所以我想问的是，这种行为是否符合预期，或者是否有什么我忽略的地方？

软件和硬件：
Go版本1.4.2 darwin/amd64；OS X 10.10.3；2 GHz四核i7处理器。

英文:

I was doing some performance experimentation in Go with matrix multiplication and ran into some unexpected results.

Version 1:

func newMatrix(n int) [][]int {
    m := make([][]int, n)
    buf := make([]int, n*n)
    for i := range m {
	    m[i] = buf[i*n : (i+1)*n]
    }
    return m
}
func mult1(m1, m2, res [][]int) [][]int {
    for i := range m1 {
	    for k := range m1[0] {
		    for j := range m2[0] {
			    res[i][j] += m1[i][k] * m2[k][j]
		    }
	    }
    }
    return res
}

From the linear array i create multiple slices that represent the matrix rows.

Version 2:

func mult2(m1, m2, res []int, n int) []int {
    for i := 0; i &lt; n; i++ {
	    for k := 0; k &lt; n; k++ {
		    for j := 0; j &lt; n; j++ {
			    res[i*n+j] += m1[i*n+k] * m2[k*n+j]
		    }
	    }
    }
    return res
}

In this version I simply use a linear array and index into it from the multiplication.

Multiplying 2 2048x2048 matrices give the following execution time:

 version 1: 35.550813801s
 version 2: 19.090223468s

Version 2 is almost twice as fast.

I used the approach below to take the measurements:

start := time.Now()
mult(m1, m2, m3)
stop := time.Now()

I am aware using slices give another layer of indirection which could impact the cache performance, however I didn't expect it would be such a big difference. Unfortunately I haven't found any good tool, that works with Mac, that can analyse cache efficiency in Go, so I can't say for sure if this is what's causing the performance difference.

So I guess I'm asking is this expected behavior or is there something I'm missing?

Software and hardware:
Go version 1.4.2 darwin/amd64; OS X 10.10.3; 2 GHz quad-core i7.

答案1

得分: 6

你版本1代码中的主要问题似乎是间接寻址。尽管两个版本中矩阵在内存中的布局相同，但使用间接寻址可能会导致以下问题：

相同代码生成更多的指令。编译器可能难以确定何时使用打包版本的SIMD指令（例如SSE、AVX）。你可以通过转储汇编代码来验证这一点，查找XMM或YMM寄存器，并检查操作寄存器的指令是否是打包的。
使编译器难以添加软件预取。由于间接寻址，编译器很难确定如何添加软件预取。你可以在汇编代码中查找vprefetch指令。
硬件预取器的效率会降低，也是由于间接寻址。你首先需要访问行的起始地址，然后访问行的元素，因此很难观察到硬件预取器应该只获取连续的地址。这只能通过像perf这样的性能分析来测量。

因此，在版本1中，间接寻址是主要问题。我还建议在多次迭代中运行这两个代码，以消除版本1可能因上述原因而产生的缓存预热惩罚。

英文:

The main problem in your version 1 code seems to be indirect addressing. Even though the layout in memory for the matrices in both versions is the same, using indirect addressing can lead to:

More generated instructions for the same code. The compiler could have trouble in determining when to use packed versions of SIMD instructions (e.g. SSE, AVX). You can verify this by dumping the assembly code, look for XMM or YMM registers and check if the instructions operating on the registers are packed.
You make it difficult for the compiler to add software prefetches. Because indirect addressing, it's difficult for the compiler to detect how to add software prefetches. You can look for vprefetch instructions in the assembly code.
The hardware prefetcher will be less efficient also due to indirect addressing. You first need to access the line start address and then access the line elements so it's hard to observe that the hardware prefetcher should just fetch consecutive addresses. That's only measurable through profiling like perf.

So in case of version 1, indirect addressing is the main issue. I also recommend running the 2 codes in multiple iterations to remove the cache warming penalty which might be higher for version 1 because of what I explained above.

答案2

得分: -1

很抱歉，我没有足够的声望将此作为评论发布，但除了VAndrei的观点之外，值得注意的是两个提供的示例在使用for循环时有所不同。在s/i := range m1/i := 0; i < n; i++/之后，第一个示例的执行情况如何？

检查"list mult1"和"list mult2"在pprof中的输出可能也很有用。有一个很好的教程可以快速入门Go的pprof：Russ Cox的Go程序性能分析教程

英文:

Unfortunately, I do not have enough reputation to put this in as comment, but in addition to VAndrei's points it is worth noting that two provided examples use for-loop differently. How does first example perform after s/i := range m1/i := 0; i < n; i++/ ?

It could also be useful to check how "list mult1" and "list mult2" outputs looks like in pprof.
There is great tutorial to get started with Go's pprof very fast: Profiling Go Programs By Russ Cox

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

通过切片访问数组（2D切片）时，遇到了意外的性能问题。

问题

答案1

答案2

如何让我的GCloud函数打开一个新的SSH连接来访问SFTP服务器？

Why can't I use an empty interface that holds a concrete type of string as a return value in a function declared to return string?

为什么“属于”关系中的链接记录为空？

golang的xml.Unmarshal函数用于将XML数据解析为Go语言中的结构体。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。