无法使用 vectorcall 返回多个 SIMD 向量。

huangapple go评论79阅读模式
英文:

Unable to return multiple SIMD vectors using vectorcall

问题

I understand you only want the code-related part translated. Here's the translated code:

我目前正在开发一个处理大量数据的程序,它在一个紧密的循环中处理数据块,这些数据块被加载到 YMM 寄存器中,然后从中提取 64 位的块以进行实际处理。

这个循环是程序中的一部分,根据正在处理的数据的确切内容,程序会在不同的循环之间切换。因此,为了执行这种切换,必须偶尔(有时频繁)中断每个循环。为了使整个系统更易于处理,每个循环都包含在自己的函数中。

我遇到的一个相当大的烦恼(不是第一次遇到),是在函数调用之间相对难以保留 256 位和 64 位块。每个循环处理相同的数据,因此在一个中断后丢弃这些寄存器并立即重新加载完全相同的数据是没有意义的。这不会导致任何重大性能问题,但是可以测量,而且总体上看起来有点愚蠢。

我尝试了大约一百万种不同的方法,但没有一个能给我一个合适的解决方案。当然,我可以简单地将这些块存储在外部切换循环内,并将它们作为引用传递给内部循环,但是快速检查生成的汇编代码显示,无论我尝试什么,GCC  Clang 都会回到指针,无论我尝试什么,都会失去优化的全部意义。

我也可以将每个循环标记为 *always_inline*,打开 LTO,并结束,但是我计划为其中一个循环添加手写的汇编版本,并且我不想被强制内联写它。实际上,我想要的是函数的声明只是向调用者发出信号,向调用者传递的向量(以及相关信息)将作为返回值传递,以适当的寄存器传递,允许我减少开销(不进行内联),至多只有几个寄存器/寄存器的 `mov` 操作。

我找到的最接近的东西是 `vectorcall` 调用约定,由 MSVC 支持,并且至少部分由 Clang  GCC 支持。

供参考,我目前正在使用的是 GCC,但如果它有解决方案,我愿意切换到 Clang。如果只有 MSVC 能够实现这一点,我将选择内联选项。

我创建了这个简单的示例:
```cpp
#include <immintrin.h>

struct HVA4 {
   __m256i data[4];
};

HVA4 __vectorcall example(HVA4 x) {
    x.data[0] = _mm256_permute4x64_epi64(x.data[0], 0b11001001);
    x.data[2] = _mm256_permute4x64_epi64(x.data[2], 0b00111001);

   return x;
}

在 MSVC 19.35 使用 /O2 /GS- /arch:avx2 编译后,生成的代码如下:

vpermq  ymm0, ymm0, 201
vpermq  ymm2, ymm2, 57
ret

Clang 的情况不同。在 16.0.0 使用 -O3 -mavx2 编译时,生成的代码如下:

mov     rax, rcx
vpermpd ymm0, ymmword ptr [rdx], 201
vmovaps ymmword ptr [rdx], ymm0
vpermpd ymm0, ymmword ptr [rdx + 64]
vmovaps ymmword ptr [rdx + 64], ymm0
vmovaps ymm0, ymmword ptr [rdx + 32]
vmovaps ymm1, ymmword ptr [rdx + 96]
vmovaps ymmword ptr [rcx + 96], ymm1
vmovaps ymmword ptr [rcx + 32], ymm0
vmovaps ymm0, ymmword ptr [rdx + 64]
vmovaps ymmword ptr [rcx + 64], ymm0
vmovaps ymm0, ymmword ptr [rdx]
vmovaps ymmword ptr [rcx], ymm0
vzeroupper
ret

我会停在这里,如果您需要更多信息,请告诉我。

英文:

I am currently working on a program that processes large amounts of data in a tight loop. Blocks of data are loaded into YMM registers, from which 64-bit chunks are extracted to be actually worked on.

This loop is one of several, which the program switches between depending on the exact content of the data being processed. As such, each loop must be occasionally interrupted (sometimes frequently) in order to perform said switching. To make the whole system a bit easier to work on, each loop is contained within its own function.

A fairly major annoyance I've run into (not for the first time), is that it is fairly difficult to preserve the 256-bit and 64-bit chunks across the function calls. Each loop processes the same data, so it doesn't make sense to discard these registers when one breaks, only to immediately load the exact same data back in. This doesn't really cause any major performance problems, but it is measurable, and just seems overall silly.

I've tried about a million different things, with not a single one giving me a proper solution. Of course, I could simply store the chunks within the outer switching loop, and pass them to the inner loops as references, but a quick check of the generated assembly shows that both GCC and Clang revert to pointers no matter what I try, defeating the entire point of the optimization.

I could also just mark each loop as always_inline, turn on LTO, and call it a day, but I plan on adding a hand-written assembly version of one of the loops, and I don't want to be forced to write it inline. Really what I'd like is for the function's declaration to simply signal to callers that the vectors (and associated information) will be passed out of the function as return values, in proper registers, allowing me to reduce the overhead (without inlining) to at most a few register/register movs.

The closest thing I've found is the vectorcall calling convention, supported by MSVC, and at least partially by Clang and GCC.

For reference, I am currently using GCC, but would be willing to switch to Clang if it has a solution to this. If MSVC is the only compiler capable, I'll just go with the inlining option.

I created this simple example:

#include <immintrin.h>

struct HVA4 {
   __m256i data[4];
};

HVA4 __vectorcall example(HVA4 x) {
    x.data[0] = _mm256_permute4x64_epi64(x.data[0], 0b11001001);
    x.data[2] = _mm256_permute4x64_epi64(x.data[2], 0b00111001);

   return x;
}

which compiles to

vpermq  ymm0, ymm0, 201
vpermq  ymm2, ymm2, 57
ret

under MSVC 19.35 using /O2 /GS- /arch:avx2.

This is actually exactly what I want: my vector parameters are passed in proper SIMD registers, and are returned as such. The registers used even line up! From reading the MSDN docs , it sounds like I should be able to extend this to non-homogeneous aggregates as well, though even if not, I can make this work.

Clang is another story however. On 16.0.0 using -O3 -mavx2 it generates this absolute mess:

mov     rax, rcx
vpermpd ymm0, ymmword ptr [rdx], 201
vmovaps ymmword ptr [rdx], ymm0
vpermpd ymm0, ymmword ptr [rdx + 64]
vmovaps ymmword ptr [rdx + 64], ymm0
vmovaps ymm0, ymmword ptr [rdx + 32]
vmovaps ymm1, ymmword ptr [rdx + 96]
vmovaps ymmword ptr [rcx + 96], ymm1
vmovaps ymmword ptr [rcx + 32], ymm0
vmovaps ymm0, ymmword ptr [rdx + 64]
vmovaps ymmword ptr [rcx + 64], ymm0
vmovaps ymm0, ymmword ptr [rdx]
vmovaps ymmword ptr [rcx], ymm0
vzeroupper
ret

I'd show GCC's attempt, but it would probably double the size of this question.

The general idea with is the same, however; both GCC and Clang completely refuse to use multiple registers for SIMD return values, and only sometimes do so for parameters (they fare a lot better if the vectors are removed from the struct). While this may be expected behavior for standard calling conventions (I suspect they're actually following the SysV ABI at least for return value placement), vectorcall explicitly allows for it.

Of course, vectorcall is a non-standard attribute, just because two compilers have the same name for something doesn't mean they do the same thing, etc, but at least Clang specifically links to the MSDN docs, so I'd expect it follow them.

Is this simply a bug in clang? Just an unimplemented feature? (Again, it does link to the MSDN docs)

Furthermore, is there any way to achieve the optimizations given by MSVC in code like the example above, in either GCC or Clang, be it via a calling convention, or some compiler specific flag? I'd be happy to try writing a custom convention into the compiler, but that's pretty heavily out of scope for this project.

答案1

得分: 2

"All the YMM registers are call-clobbered," so non-inline functions pose challenges for storing significant data in registers. The Windows x64 convention preserves xmm6..15 but clobbers the wider YMM registers. Many integer registers are also call-clobbered, especially in the x86-64 System V calling convention (non-Windows).

如果您的程序的重要状态仅包括这4个向量和一些整数寄存器,那么是的,MSVC的x64 vectorcall 可以将向量传递给非内联函数,并将它们全部作为返回值返回。

否则,其他状态将不得不在调用时被溢出/重新加载,因此手写汇编的唯一好选择是GNU C内联汇编。

x86-64 SysV将1个向量返回到x/y/zmm0

x86-64 System V calling convention最多可以返回2个矢量寄存器(xmm/ymm/zmm),就像整数参数可以在最多6个寄存器中传递,但只能在RDX:RAX中返回。

但是,仅在返回标量浮点或双精度值的聚合时才使用XMM1(其总大小不超过16字节,因此返回值位于XMM0和XMM1的低8字节中)。ABI文档的分类规则5(c)- 如果聚合的大小超过两个八字节,并且第一个八字节不是SSE或任何其他八字节不是SSEUP,那么整个参数都会传递到内存中。 - 结构体中的第二个__m128i矢量将有第二个SSE类的八字节。这就是为什么这样的结构体返回内存而不是XMM0,XMM1的原因。规则5c允许将单个大于16字节的向量返回到YMM0或ZMM0(其中所有后续的八字节都是SSEUP),而不允许其他情况。

测试确认了这一点。对于struct { __m256i v[2]; },GCC/clang将其返回到内存中,而不是YMM0/YMM1,参见下面的Godbolt链接。但对于struct { float v[3]; },我们看到v[4]作为XMM1的元素1返回(低64位的顶半部分=八字节的一部分):Godbolt

因此,AMD64 System V ABI的调用约定不适用于您的用例,即使它可以在向量寄存器中返回2个向量。

GCC或clang中的vectorcall:与MSVC不同,只返回1个矢量寄存器

您可以使用__attribute__((ms_abi))(gcc或clang)或__attribute__((vectorcall))(仅clang)为汇编函数声明原型,但实际上似乎不像您描述的MSVC工作方式那样工作:多于一个__m256i的结构体仍然通过隐藏指针以内存返回,即使使用了vectorcall。(Godbolt)

GCC错误报告(89485)中的Agner Fog的评论说,针对Windows的clang支持__vectorcall,但该错误报告仅请求GCC对其进行支持,而不讨论它是否支持多个矢量寄存器的结构返回。也许clang的__vectorcall实现与MSVC的不兼容,无法返回多个矢量寄存器的结构?

我没有可用于测试的Windows clang,或者针对与MSVC更兼容的clang-cl。

asm("call foo" : "+x"(v0), ...); 包装以不破坏其他寄存器

正如您在评论中建议的那样,您可以发明自己的调用约定,并通过内联汇编将其描述给编译器。只要它是纯函数,甚至可以避免`"

英文:

All the YMM registers are call-clobbered, so a non-inline function is kind of a showstopper for keeping any significant amount of data in registers. (The Windows x64 convention has call-preserved xmm6..15, but the wider YMM registers are still clobbered.) Quite a few integer registers are also call-clobbered, especially in the x86-64 System V calling convention (non-Windows).

If your program's valuable state is only those 4 vectors and a few integer registers, then yes, MSVC's x64 vectorcall can pass the vectors to non-inline functions and have them all returned as return values.

Otherwise, other state will have to get spilled/reloaded around the call, so the only good option for hand-written asm is GNU C inline asm.


x86-64 SysV returns 1 vector in x/y/zmm0

The x86-64 System V calling convention can return in at most 2 vector registers (xmm/ymm/zmm), like how integer args can be passed in up to 6 regs but only return in RDX:RAX.

But XMM1 is only used when returning an aggregate of scalar float or double (with a total size not exceeding 16 bytes, so the return value is in the low eightbyte of each of XMM0 and XMM1). The ABI doc's classification rule 5 (c) - If the size of the aggregate exceeds two eightbytes and the first eightbyte isn’t
SSE or any other eightbyte isn’t SSEUP, the whole argument is passed in memory.
- a second __m128i vector in a struct will have a second SSE-classed eightbyte. That's why such a struct returns in memory, rather than XMM0, XMM1. Rule 5c allows returning in YMM0 or ZMM0 for a single vector wider than 16 bytes (where all the later eightbytes are SSEUP), not other cases.

Testing confirms this. With struct { __m256i v[2]; }, GCC/clang return that in memory, not YMM0 / YMM1, see the Godbolt link below. But with struct { float v[3]; } we see v[4] being returned in element 1 of XMM1 (the top half of the low 64 bits = an eightbyte): Godbolt

So the AMD64 System V ABI's calling convention is not suited for your use case, even if it could return 2 vectors in vector regs.


vectorcall in GCC or clang: Different from MSVC, only 1 vector reg

You could declare a prototype for your asm function with __attribute__((ms_abi)) (gcc or clang) or __attribute__((vectorcall)) (clang only), but that doesn't actually seem to work the way you describe MSVC working: a struct of more than one __m256i gets returned in memory, by hidden pointer, even with vectorcall. (Godbolt)

A comment from Agner Fog on a GCC bug report (89485) says that clang targeting Windows does support __vectorcall, but that bug was just requesting GCC support for it at all, not discussing whether it returned multiple vectors in registers. Perhaps clang's implementation of __vectorcall isn't ABI-compatible with MSVC's for struct returns of multiple vectors?

I don't have Windows clang available for testing, or clang-cl which aims for more compat with MSVC.


asm("call foo" : "+x"(v0), ...); wrapper to also not clobber other regs

As you suggested in comments, you could invent your own calling convention and describe it to the compiler via inline asm. As long as it's a pure function, you can even avoid a "memory" clobber.

You do need to stop the compiler from using the red zone in the caller because call pushes a return address. See https://stackoverflow.com/q/6380992

The compiler won't know it's a function call at all; the fact that your inline asm template happens to push/pop something on the stack is the important part, not that it jumps somewhere else before execution comes out the other side. The compiler doesn't parse the asm template string except to substitute %operands, like printf. It doesn't care if you reference an operand explicitly or not.

So you still have all the benefits and all the downsides of inline asm (https://gcc.gnu.org/wiki/DontUseInlineAsm), including having to precisely describe the outputs : inputs : clobbers to the compiler for the block of code you're running, like how you'd document in comments for hand-written asm helper functions.

Plus the overhead of a call and ret vs. writing your asm inside the asm statement itself. This seems very bad for something as cheap as two vpermq instructions. You could perhaps use asm(".include 'helper.s'" : "+x"(v0), ...); if you can split up your helpers one per file. (Or perhaps .set something that a .if can check for so you can ask for one block out of a file with multiple blocks? But that's probably harder to maintain.)

If you were using any "m" operands that might pick an addressing mode relative to RSP, that could also break as call pushes a return address. But you won't be in this case; you'll be forcing the compiler to pick specific registers for the operands instead of even giving it the choice of which YMM register to pick.

So it could perhaps look something like

#include <immintrin.h>

auto bar(__m256i v0_in, __m256i v1_in, __m256i v2_in, __m256i v3_in){
    // clang does pass args in the right regs for vectorcall
    // (after taking into account that the first arg-reg slot is taken by the hidden pointer because of disagreement about aggregate returns)
  register __m256i v0 asm("ymm0") = v0_in;  // force "x" constraints to pick a certain register for asm statements.
  register __m256i v1 asm("ymm1") = v1_in;
  register __m256i v2 asm("ymm2") = v2_in;
  register __m256i v3 asm("ymm3") = v3_in;

   v1 = _mm256_add_epi64(v1, v3);  // do something with the incoming args, just for example
    __m256i vlocal = _mm256_add_epi64(v0, v2);  // compiler can allocate this anywhere

    // declare some integer register clobbers if your function needs any
    // the fewer the better; the compiler can keep its own stuff in those regs otherwise
  asm("call asm_foo" : "+x"(v0), "+x"(v1), "+x"(v2), "+x"(v3) : : "rax", "rcx", "rdx");
  // if you don't compile with -mno-red-zone, then  "add $-128, %%rsp ; call ; sub $-128, %%rsp".
  //  But you don't want that each call inside a loop, so just use -mno-red-zone
    return _mm256_add_epi64(vlocal, v2);
}

Godbolt gcc and clang compile this to:

# clang16 -O3 -march=skylake -mno-red-zone

bar(long long __vector(4), long long __vector(4), long long __vector(4), long long __vector(4)):
        vpaddq  ymm1, ymm3, ymm1
        vpaddq  ymm4, ymm2, ymm0      # compiler happened to pick ymm4 for vlocal, a reg not clobbered by the asm statement.
# inline asm starts here
        call    asm_foo
# inline asm ends here
  # if we just return v2, we get  vmovaps ymm0, ymm2
        vpaddq  ymm0, ymm4, ymm2     # use ymm4 which was *not* clobbered by the inline asm statement,
                                     # along with the v2 = ymm2 output of the asm

        ret

vs. GCC being bad as usual at dealing with hard-register constraints on its register allocation:

# gcc13 -O3 -march=skylake -mno-red-zone

bar(long long __vector(4), long long __vector(4), long long __vector(4), long long __vector(4)):
        vmovdqa ymm5, ymm2      # useless copies, silly compiler.
        vmovdqa ymm4, ymm0
        vpaddq  ymm1, ymm1, ymm3
        vpaddq  ymm4, ymm4, ymm5
        call asm_foo
        vpaddq  ymm0, ymm4, ymm2
        ret

Whatever you were going to do in the asm_foo function, you could just as well have done it inside the asm template. And then you could use %0 instead of %%ymm0 to give the compiler a choice of registers. I lined up the variables with the incoming args to make it easy for the compilers.

asm_foo is the function that has the special calling convention. bar() is just a normal function whose callers will assume clobbers all the vector regs and half the integer regs, and can only return one vector by value.

huangapple
  • 本文由 发表于 2023年5月13日 09:44:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76240753.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定