问题

在《计算机组织与设计》RISC-V版本的书中，Patterson和Hennessy的'FIGURE 3.21'显示_mm256_load_pd(C + i + j * n)代表的是C[i][j]，这一点一开始对我来说也有点奇怪（该代码类似于伯克利的dgemm_unroll代码，出自英特尔的一篇文章）。

书中的代码如下：

void dgemm_avx256(const uint32_t n, const double* A, const double* B, double* C)
{

    for( uint32_t i = 0; i &lt; n; i += 4 )
    {
        for( uint32_t j = 0; j &lt; n; j++ ) 
        {
            __m256d c0 = _mm256_load_pd(C + i + j * n); /* c0 = C[i][j] */
            for( uint32_t k = 0; k &lt; n; k++ )
            {
                c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */
                        _mm256_mul_pd( _mm256_load_pd(A + i + k * n), _mm256_broadcast_sd(B + k + j * n) ) );
            }

            _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */
        }
    }
}

然后我在线阅读了英特尔的_mm256_load_pd参考文档，其中参数是'256位对齐的内存位置'，正如参考文档所述。

Q: 所以，C + i + j * n应该是C[j][i]而不是C[i][j]。我做错了什么吗？

我使用gdb进行了测试，当第二次运行_mm256_load_pd时，它显示的确是C[1][0]。

下面是汇编代码片段，临时使用了-nopie，并添加了一些带有-fverbose-asm的注释以及自己添加的注释：

   0x00000000004035b0 &lt;+48&gt;:    44 89 fe                mov    esi,r15d ; _60, i
   0x00000000004035b3 &lt;+51&gt;:    45 31 d2                xor    r10d,r10d ;ivtmp.20 is `j*n`
...
   0x00000000004035c5 &lt;+69&gt;:    48 8d 04 32             lea    rax,[rdx+rsi*1] ;i+j*n
   0x00000000004035c9 &lt;+73&gt;:    4d 8d 5c c5 00          lea    r11,[r13+rax*8+0x0] ; *8 bytes
   0x00000000004035ce &lt;+78&gt;:    49 8d 04 d4             lea    rax,[r12+rdx*8]
   0x00000000004035d2 &lt;+82&gt;:    4c 01 f2                add    rdx,r14
=&gt; 0x00000000004035d5 &lt;+85&gt;:    c4 c1 7d 28 03          vmovapd ymm0,YMMWORD PTR [r11] ; r11 is C + i + j * n
...
   0x0000000000403602 &lt;+130&gt;:   41 01 fa                add    r10d,edi ; j+=n

英文:

In 'Computer Organization and Design' RISC-V version book by Patterson and Hennessy, 'FIGURE 3.21' shows that _mm256_load_pd(C + i + j * n) is C[i][j] which is weird at first glance for me (the code is similar to berkeley code dgemm_unroll which is from one intel article)

code in the book:

void dgemm_avx256(const uint32_t n, const double* A, const double* B, double* C)
{

    for( uint32_t i = 0; i &lt; n; i += 4 )
    {
        for( uint32_t j = 0; j &lt; n; j++ ) 
        {
            __m256d c0 = _mm256_load_pd(C + i + j * n); /* c0 = C[i][j] */
            for( uint32_t k = 0; k &lt; n; k++ )
            {
                c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */
                        _mm256_mul_pd( _mm256_load_pd(A + i + k * n), _mm256_broadcast_sd(B + k + j * n) ) );
            }

            _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */
        }
    }
}

Then I read intel _mm256_load_pd reference online, and the param is '256-bit aligned memory location' as the reference says.

Q: so C + i + j * n should be C[j][i] instead of C[i][j]. Did I make something wrong?

And I tested with gdb, it shows C[1][0] when run _mm256_load_pd second time.

Below is assembly code snippet temporarily with -nopie with some comment with -fverbose-asm and self added:

   0x00000000004035b0 &lt;+48&gt;:    44 89 fe                mov    esi,r15d ; _60, i
   0x00000000004035b3 &lt;+51&gt;:    45 31 d2                xor    r10d,r10d ;ivtmp.20 is `j*n`
...
   0x00000000004035c5 &lt;+69&gt;:    48 8d 04 32             lea    rax,[rdx+rsi*1] ;i+j*n
   0x00000000004035c9 &lt;+73&gt;:    4d 8d 5c c5 00          lea    r11,[r13+rax*8+0x0] ; *8 bytes
   0x00000000004035ce &lt;+78&gt;:    49 8d 04 d4             lea    rax,[r12+rdx*8]
   0x00000000004035d2 &lt;+82&gt;:    4c 01 f2                add    rdx,r14
=&gt; 0x00000000004035d5 &lt;+85&gt;:    c4 c1 7d 28 03          vmovapd ymm0,YMMWORD PTR [r11] ; r11 is C + i + j * n
...
   0x0000000000403602 &lt;+130&gt;:   41 01 fa                add    r10d,edi ; j+=n

答案1

得分: 0

感谢上述评论。

这段代码是Fortran风格的，尽管在COD书中提到了“使用C内置函数生成x86 AVX子字并行指令的优化C版本的DGEMM”。

将Q&A标记为已解决。

英文:

Thanks for above comments.

The code is Fortran-style, although in COD book it says 'Optimized C version of DGEMM using
C intrinsics to generate the AVX subword-parallel
instructions for the x86.'

Mark the Q&A as solved.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

`_mm256_load_pd(C + i + j * n)` 是一个关于 dgemm 测试的问题。

问题

答案1

Indexing pthreads with numbers 0 through n -1.

打印出最大值和最小值之间的差异（保留四位小数）。

问题描述如下：我的输入矩阵不包含零，但在打印矩阵时却出现了零。

Double free when use pcap_close and fclose simultaneously.

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论