`_mm256_zeroall()` 无法初始化寄存器变量。

huangapple go评论64阅读模式
英文:

`_mm256_zeroall()` can't initialize register variables

问题

我想将所有的YMM寄存器都清零,如下所示:=

#include <immintrin.h>;

void fn(float *out) {
    register __m256 r0;
    _mm256_zeroall();
    _mm256_storeu_ps(out, r0);
}

但是gcc/clang给出了一个警告:

warning: 'r0' is used uninitialized [-Wuninitialized]

使用 _mm256_setzero_ps() 是可以的,但代码和生成的汇编都很丑陋。

如果我有12个已定义的寄存器变量,gcc可能会生成12个vmovaps指令,而clang可能会生成12个vxorps指令。在最糟糕的情况下,gcc可能会生成memset函数调用和许多vmovaps指令。

我只想要一个单一的vzeroall指令。

是否有办法让编译器知道 _mm256_zeroall() 将清零所有寄存器,而不需要手动编写汇编代码?

编辑 1:

事实上,我正在编写一个矩阵乘积程序,需要在开始时清零许多寄存器。为了简化问题,我在问题中使用了最简单的代码。

我已经确认 vzeroall 在Zen 3上与许多 vmovaps/vxorps 相比不慢,而且 vzeroall 的代码大小更小,更有利于缓存。

删除寄存器修饰符对于GCC/Clang无效。它生成与先前示例相同的汇编代码。

我发现可以在GCC上指定寄存器名称以消除警告,如下所示:

register __m256 r0 asm("ymm0");

但是clang不遵守这个定义,仍然生成相同的警告。

英文:

I want to zero all YMM registers like this.=:

#include &lt;immintrin.h&gt;

void fn(float *out) {
    register __m256 r0;
    _mm256_zeroall();
    _mm256_storeu_ps(out, r0);
}

But gcc/clang gives me a warning:

warning: &#39;r0&#39; is used uninitialized [-Wuninitialized]

It's okay to use _mm256_setzero_ps() but both the code and generated assembly is ugly.
If I have 12 defined register varaibles, the gcc is likely to generate 12 vmovaps and the clang is likely to generate 12 vxorps instruction. In the worst case, the gcc would generate memset function call and many vmovaps.
I just want a single vzeroall instruction.

Is there any way to let compiler know that _mm256_zeroall() will zeros all register without handwriting asm?

Edit 1:
In fact I'm writing a matrix product program, which need to clear many registers at the beginning. To simplify the question, I use the most simple code for question.

I've confirmed vzeroall is not slow compare to many vmovaps/vxorps on Zen 3, and vzeroall has smaller code size, which is more cache friendly.

Remove register qualifier doesn't work on GCC/Clang. It generates the same assembly as the previous one.

I've found that I can specify the register name on GCC to elimiate the warning, like this:

register __m256 r0 asm(&quot;ymm0&quot;);

But clang doen't obey the define and still generate the same warning.

答案1

得分: 2

答案是,虽然指令的名称是vzeroall,它只将前16个矢量寄存器清零,而将其他寄存器保持不变。因此,分配器可能会选择一个上部寄存器进行存储,导致错误行为。

进一步讨论:

首先,你实际上并不是在汇编中编程,你是在C++中编程(尽管是x86内联汇编),如果你需要多次使用一个变量,你只需多次使用它,编译器会在必要时决定是否将其溢出。相反,即使你定义了多个 _mm256_setzero_ps(),编译器也会将它们理想化为一个单一的变量。

其次,为什么你需要多个零寄存器,我认为大多数AVX指令都是非破坏性的,除了合并掩码指令,但在零上执行合并掩码操作等同于执行零掩码操作。就像你所说的,它是为了多个累加器,我看到编译器没有执行循环剥离,那么你可以手动剥离第一个迭代,这样将消除过多的零寄存器初始化(示例)。

英文:

The answer is that, while the instruction's name is vzeroall, it only zeroes out the first 16 vector registers and leave the others unchanged. As a result, the allocator may choose an upper register for your store, which leads to wrong behaviour.

Some more discussion:

Firstly, you are not actually programming in assembly, you are programming in C++ (albeit x86 intrinsics), if you need a variable multiple times, you just use it multiple times, and the compiler will decide to spill if it is necessary. In contrast, even if you define multiple _mm256_setzero_ps(), the compiler will idealise them into a single variable.

Secondly, why do you need multiple zero registers, I believe that most avx instructions are non-destructive, except merge-masking instructions, but merge-masking operations on zero is equivalent to just doing a zero-masking instead. As you said it is for multiple accumulators, and I see that the compilers do not perform loop peeling, then you can manually peel the first iteration instead, which will remove excessive initialisations of zero registers (example).

huangapple
  • 本文由 发表于 2023年3月7日 10:58:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/75657654.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定