`_mm256_load_pd(C + i + j * n)` 是一个关于 dgemm 测试的问题。

huangapple go评论126阅读模式
英文:

question about dgemm test `_mm256_load_pd(C + i + j * n)`

问题

在《计算机组织与设计》RISC-V版本的书中,Patterson和Hennessy的'FIGURE 3.21'显示_mm256_load_pd(C + i + j * n)代表的是C[i][j],这一点一开始对我来说也有点奇怪(该代码类似于伯克利的dgemm_unroll代码,出自英特尔的一篇文章)。

书中的代码如下:

void dgemm_avx256(const uint32_t n, const double* A, const double* B, double* C)
{

    for( uint32_t i = 0; i < n; i += 4 )
    {
        for( uint32_t j = 0; j < n; j++ ) 
        {
            __m256d c0 = _mm256_load_pd(C + i + j * n); /* c0 = C[i][j] */
            for( uint32_t k = 0; k < n; k++ )
            {
                c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */
                        _mm256_mul_pd( _mm256_load_pd(A + i + k * n), _mm256_broadcast_sd(B + k + j * n) ) );
            }

            _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */
        }
    }
}

然后我在线阅读了英特尔的_mm256_load_pd参考文档,其中参数是'256位对齐的内存位置',正如参考文档所述。


Q: 所以,C + i + j * n应该是C[j][i]而不是C[i][j]。我做错了什么吗?


我使用gdb进行了测试,当第二次运行_mm256_load_pd时,它显示的确是C[1][0]

下面是汇编代码片段,临时使用了-nopie,并添加了一些带有-fverbose-asm的注释以及自己添加的注释:

   0x00000000004035b0 <+48>:    44 89 fe                mov    esi,r15d ; _60, i
   0x00000000004035b3 <+51>:    45 31 d2                xor    r10d,r10d ;ivtmp.20 is `j*n`
...
   0x00000000004035c5 <+69>:    48 8d 04 32             lea    rax,[rdx+rsi*1] ;i+j*n
   0x00000000004035c9 <+73>:    4d 8d 5c c5 00          lea    r11,[r13+rax*8+0x0] ; *8 bytes
   0x00000000004035ce <+78>:    49 8d 04 d4             lea    rax,[r12+rdx*8]
   0x00000000004035d2 <+82>:    4c 01 f2                add    rdx,r14
=> 0x00000000004035d5 <+85>:    c4 c1 7d 28 03          vmovapd ymm0,YMMWORD PTR [r11] ; r11 is C + i + j * n
...
   0x0000000000403602 <+130>:   41 01 fa                add    r10d,edi ; j+=n
英文:

In 'Computer Organization and Design' RISC-V version book by Patterson and Hennessy, 'FIGURE 3.21' shows that _mm256_load_pd(C + i + j * n) is C[i][j] which is weird at first glance for me (the code is similar to berkeley code dgemm_unroll which is from one intel article)

code in the book:

void dgemm_avx256(const uint32_t n, const double* A, const double* B, double* C)
{

    for( uint32_t i = 0; i < n; i += 4 )
    {
        for( uint32_t j = 0; j < n; j++ ) 
        {
            __m256d c0 = _mm256_load_pd(C + i + j * n); /* c0 = C[i][j] */
            for( uint32_t k = 0; k < n; k++ )
            {
                c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */
                        _mm256_mul_pd( _mm256_load_pd(A + i + k * n), _mm256_broadcast_sd(B + k + j * n) ) );
            }

            _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */
        }
    }
}

Then I read intel _mm256_load_pd reference online, and the param is '256-bit aligned memory location' as the reference says.


Q: so C + i + j * n should be C[j][i] instead of C[i][j]. Did I make something wrong?


And I tested with gdb, it shows C[1][0] when run _mm256_load_pd second time.

Below is assembly code snippet temporarily with -nopie with some comment with -fverbose-asm and self added:

   0x00000000004035b0 <+48>:    44 89 fe                mov    esi,r15d ; _60, i
   0x00000000004035b3 <+51>:    45 31 d2                xor    r10d,r10d ;ivtmp.20 is `j*n`
...
   0x00000000004035c5 <+69>:    48 8d 04 32             lea    rax,[rdx+rsi*1] ;i+j*n
   0x00000000004035c9 <+73>:    4d 8d 5c c5 00          lea    r11,[r13+rax*8+0x0] ; *8 bytes
   0x00000000004035ce <+78>:    49 8d 04 d4             lea    rax,[r12+rdx*8]
   0x00000000004035d2 <+82>:    4c 01 f2                add    rdx,r14
=> 0x00000000004035d5 <+85>:    c4 c1 7d 28 03          vmovapd ymm0,YMMWORD PTR [r11] ; r11 is C + i + j * n
...
   0x0000000000403602 <+130>:   41 01 fa                add    r10d,edi ; j+=n

答案1

得分: 0

感谢上述评论。

这段代码是Fortran风格的,尽管在COD书中提到了“使用C内置函数生成x86 AVX子字并行指令的优化C版本的DGEMM”。

将Q&A标记为已解决。

英文:

Thanks for above comments.

The code is Fortran-style, although in COD book it says 'Optimized C version of DGEMM using
C intrinsics to generate the AVX subword-parallel
instructions for the x86.'

Mark the Q&A as solved.

huangapple
  • 本文由 发表于 2023年6月5日 19:20:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76405912.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定