英文:
question about dgemm test `_mm256_load_pd(C + i + j * n)`
问题
在《计算机组织与设计》RISC-V版本的书中,Patterson和Hennessy的'FIGURE 3.21'显示_mm256_load_pd(C + i + j * n)
代表的是C[i][j]
,这一点一开始对我来说也有点奇怪(该代码类似于伯克利的dgemm_unroll代码,出自英特尔的一篇文章)。
书中的代码如下:
void dgemm_avx256(const uint32_t n, const double* A, const double* B, double* C)
{
for( uint32_t i = 0; i < n; i += 4 )
{
for( uint32_t j = 0; j < n; j++ )
{
__m256d c0 = _mm256_load_pd(C + i + j * n); /* c0 = C[i][j] */
for( uint32_t k = 0; k < n; k++ )
{
c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */
_mm256_mul_pd( _mm256_load_pd(A + i + k * n), _mm256_broadcast_sd(B + k + j * n) ) );
}
_mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */
}
}
}
然后我在线阅读了英特尔的_mm256_load_pd
参考文档,其中参数是'256位对齐的内存位置',正如参考文档所述。
Q: 所以,C + i + j * n
应该是C[j][i]
而不是C[i][j]
。我做错了什么吗?
我使用gdb进行了测试,当第二次运行_mm256_load_pd
时,它显示的确是C[1][0]
。
下面是汇编代码片段,临时使用了-nopie
,并添加了一些带有-fverbose-asm
的注释以及自己添加的注释:
0x00000000004035b0 <+48>: 44 89 fe mov esi,r15d ; _60, i
0x00000000004035b3 <+51>: 45 31 d2 xor r10d,r10d ;ivtmp.20 is `j*n`
...
0x00000000004035c5 <+69>: 48 8d 04 32 lea rax,[rdx+rsi*1] ;i+j*n
0x00000000004035c9 <+73>: 4d 8d 5c c5 00 lea r11,[r13+rax*8+0x0] ; *8 bytes
0x00000000004035ce <+78>: 49 8d 04 d4 lea rax,[r12+rdx*8]
0x00000000004035d2 <+82>: 4c 01 f2 add rdx,r14
=> 0x00000000004035d5 <+85>: c4 c1 7d 28 03 vmovapd ymm0,YMMWORD PTR [r11] ; r11 is C + i + j * n
...
0x0000000000403602 <+130>: 41 01 fa add r10d,edi ; j+=n
英文:
In 'Computer Organization and Design' RISC-V version book by Patterson and Hennessy, 'FIGURE 3.21' shows that _mm256_load_pd(C + i + j * n)
is C[i][j]
which is weird at first glance for me (the code is similar to berkeley code dgemm_unroll
which is from one intel article)
code in the book:
void dgemm_avx256(const uint32_t n, const double* A, const double* B, double* C)
{
for( uint32_t i = 0; i < n; i += 4 )
{
for( uint32_t j = 0; j < n; j++ )
{
__m256d c0 = _mm256_load_pd(C + i + j * n); /* c0 = C[i][j] */
for( uint32_t k = 0; k < n; k++ )
{
c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */
_mm256_mul_pd( _mm256_load_pd(A + i + k * n), _mm256_broadcast_sd(B + k + j * n) ) );
}
_mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */
}
}
}
Then I read intel _mm256_load_pd
reference online, and the param is '256-bit aligned memory location' as the reference says.
Q: so C + i + j * n
should be C[j][i]
instead of C[i][j]
. Did I make something wrong?
And I tested with gdb, it shows C[1][0]
when run _mm256_load_pd
second time.
Below is assembly code snippet temporarily with -nopie
with some comment with -fverbose-asm
and self added:
0x00000000004035b0 <+48>: 44 89 fe mov esi,r15d ; _60, i
0x00000000004035b3 <+51>: 45 31 d2 xor r10d,r10d ;ivtmp.20 is `j*n`
...
0x00000000004035c5 <+69>: 48 8d 04 32 lea rax,[rdx+rsi*1] ;i+j*n
0x00000000004035c9 <+73>: 4d 8d 5c c5 00 lea r11,[r13+rax*8+0x0] ; *8 bytes
0x00000000004035ce <+78>: 49 8d 04 d4 lea rax,[r12+rdx*8]
0x00000000004035d2 <+82>: 4c 01 f2 add rdx,r14
=> 0x00000000004035d5 <+85>: c4 c1 7d 28 03 vmovapd ymm0,YMMWORD PTR [r11] ; r11 is C + i + j * n
...
0x0000000000403602 <+130>: 41 01 fa add r10d,edi ; j+=n
答案1
得分: 0
感谢上述评论。
这段代码是Fortran风格的,尽管在COD书中提到了“使用C内置函数生成x86 AVX子字并行指令的优化C版本的DGEMM”。
将Q&A标记为已解决。
英文:
Thanks for above comments.
The code is Fortran-style, although in COD book it says 'Optimized C version of DGEMM using
C intrinsics to generate the AVX subword-parallel
instructions for the x86.'
Mark the Q&A as solved.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论