英文:
`_mm256_zeroall()` can't initialize register variables
问题
我想将所有的YMM寄存器都清零,如下所示:=
#include <immintrin.h>;
void fn(float *out) {
register __m256 r0;
_mm256_zeroall();
_mm256_storeu_ps(out, r0);
}
但是gcc/clang给出了一个警告:
warning: 'r0' is used uninitialized [-Wuninitialized]
使用 _mm256_setzero_ps()
是可以的,但代码和生成的汇编都很丑陋。
如果我有12个已定义的寄存器变量,gcc可能会生成12个vmovaps
指令,而clang可能会生成12个vxorps
指令。在最糟糕的情况下,gcc可能会生成memset
函数调用和许多vmovaps
指令。
我只想要一个单一的vzeroall
指令。
是否有办法让编译器知道 _mm256_zeroall()
将清零所有寄存器,而不需要手动编写汇编代码?
编辑 1:
事实上,我正在编写一个矩阵乘积程序,需要在开始时清零许多寄存器。为了简化问题,我在问题中使用了最简单的代码。
我已经确认 vzeroall
在Zen 3上与许多 vmovaps
/vxorps
相比不慢,而且 vzeroall
的代码大小更小,更有利于缓存。
删除寄存器修饰符对于GCC/Clang无效。它生成与先前示例相同的汇编代码。
我发现可以在GCC上指定寄存器名称以消除警告,如下所示:
register __m256 r0 asm("ymm0");
但是clang不遵守这个定义,仍然生成相同的警告。
英文:
I want to zero all YMM registers like this.=:
#include <immintrin.h>
void fn(float *out) {
register __m256 r0;
_mm256_zeroall();
_mm256_storeu_ps(out, r0);
}
But gcc/clang gives me a warning:
warning: 'r0' is used uninitialized [-Wuninitialized]
It's okay to use _mm256_setzero_ps()
but both the code and generated assembly is ugly.
If I have 12 defined register varaibles, the gcc is likely to generate 12 vmovaps
and the clang is likely to generate 12 vxorps
instruction. In the worst case, the gcc would generate memset
function call and many vmovaps
.
I just want a single vzeroall
instruction.
Is there any way to let compiler know that _mm256_zeroall()
will zeros all register without handwriting asm?
Edit 1:
In fact I'm writing a matrix product program, which need to clear many registers at the beginning. To simplify the question, I use the most simple code for question.
I've confirmed vzeroall
is not slow compare to many vmovaps
/vxorps
on Zen 3, and vzeroall
has smaller code size, which is more cache friendly.
Remove register qualifier doesn't work on GCC/Clang. It generates the same assembly as the previous one.
I've found that I can specify the register name on GCC to elimiate the warning, like this:
register __m256 r0 asm("ymm0");
But clang doen't obey the define and still generate the same warning.
答案1
得分: 2
答案是,虽然指令的名称是vzeroall
,它只将前16个矢量寄存器清零,而将其他寄存器保持不变。因此,分配器可能会选择一个上部寄存器进行存储,导致错误行为。
进一步讨论:
首先,你实际上并不是在汇编中编程,你是在C++中编程(尽管是x86内联汇编),如果你需要多次使用一个变量,你只需多次使用它,编译器会在必要时决定是否将其溢出。相反,即使你定义了多个 _mm256_setzero_ps()
,编译器也会将它们理想化为一个单一的变量。
其次,为什么你需要多个零寄存器,我认为大多数AVX指令都是非破坏性的,除了合并掩码指令,但在零上执行合并掩码操作等同于执行零掩码操作。就像你所说的,它是为了多个累加器,我看到编译器没有执行循环剥离,那么你可以手动剥离第一个迭代,这样将消除过多的零寄存器初始化(示例)。
英文:
The answer is that, while the instruction's name is vzeroall
, it only zeroes out the first 16 vector registers and leave the others unchanged. As a result, the allocator may choose an upper register for your store, which leads to wrong behaviour.
Some more discussion:
Firstly, you are not actually programming in assembly, you are programming in C++ (albeit x86 intrinsics), if you need a variable multiple times, you just use it multiple times, and the compiler will decide to spill if it is necessary. In contrast, even if you define multiple _mm256_setzero_ps()
, the compiler will idealise them into a single variable.
Secondly, why do you need multiple zero registers, I believe that most avx instructions are non-destructive, except merge-masking instructions, but merge-masking operations on zero is equivalent to just doing a zero-masking instead. As you said it is for multiple accumulators, and I see that the compilers do not perform loop peeling, then you can manually peel the first iteration instead, which will remove excessive initialisations of zero registers (example).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论