gcc c++ 协程运行 avx SIMD 代码,但导致 SIGSEGV。

huangapple go评论86阅读模式
英文:

gcc c++ coroutine runs avx SIMD code, but causes SIGSEGV

问题

#define AVX512 0
#define AVX2 1
#define SSE 0

HelloCoroutine hello(int& index, int id, int group_size) {
    unsigned res=0;
#if AVX512
// g++ simd.cpp -std=gnu++20 -fcoroutines -O2 -march=skylake -mavx512f
// segment fault
    for(auto i= index++; i< 20; i=index++)
    {
        std::cout <<"step 1" <<std::endl;
        __m512i v_offset = _mm512_set1_epi64(int64_t (i));
        std::cout <<"step 2" <<std::endl;
        __m512i v_size = _mm512_set1_epi64(int64_t(group_size));
        std::cout <<"step 3" <<std::endl;
        res = _mm512_cmpgt_epi64_mask(v_offset, v_size);
        cout <<i << " > " << group_size <<" ? " << (int)res<<endl;
        co_await std::suspend_always();
    }
#elif AVX2 
// g++ simd.cpp -std=gnu++20 -fcoroutines -O2 -march=skylake
// only specify `-O2 -march=skylake` and runs ok on local machine, otherwise segment fault (also on godbolt)
    for(auto i= index++; i< 20; i=index++)
    {
        std::cout <<"step 1" <<std::endl;
        __m256i v_offset = _mm256_set1_epi32(int32_t (i));
        std::cout <<"step 2" <<std::endl;
        __m256i v_size = _mm256_set1_epi32(int32_t(group_size));
        std::cout <<"step 3" <<std::endl;
        res = _mm256_movemask_epi8(_mm256_cmpgt_epi32(v_offset, v_size));
        cout <<i << " > " << group_size <<" ? " << (int)res<<endl;
        co_await std::suspend_always();
    }
#elif SSE
    for(auto i= index++; i< 20; i=index++)
    {
        __m128i v_offset = _mm_set1_epi32(int32_t (i));
        __m128i v_size = _mm_set1_epi32(int32_t(group_size));
        res = _mm_movemask_epi8(_mm_cmpgt_epi32(v_offset, v_size));
        cout <<i << " > " << group_size <<" ? " << res<<endl;
        co_await std::suspend_always();
    }    
#else
    for(auto i= index++; i< 20; i=index++)
    {
        res = i > group_size;
        cout <<i << " > " << group_size <<" ? " << res<<endl;
        co_await std::suspend_always();
    }
#endif
}
英文:

c++ coroutine runs avx SIMD code, but causes SIGSEGV for AVX2 and AVX512

#define AVX512 0
#define AVX2 1
#define SSE 0

HelloCoroutine hello(int& index, int id, int group_size) {
    unsigned res=0;
#if AVX512
// g++ simd.cpp -std=gnu++20 -fcoroutines -O2 -march=skylake -mavx512f
// segment fault
    for(auto i= index++; i< 20; i=index++)
    {
        std::cout <<"step 1" <<std::endl;
        __m512i v_offset = _mm512_set1_epi64(int64_t (i));
        std::cout <<"step 2" <<std::endl;
        __m512i v_size = _mm512_set1_epi64(int64_t(group_size));
        std::cout <<"step 3" <<std::endl;
        res = _mm512_cmpgt_epi64_mask(v_offset, v_size);
        cout <<i << " > " << group_size <<" ? " << (int)res<<endl;
        co_await std::suspend_always();
    }
#elif AVX2 
// g++ simd.cpp -std=gnu++20 -fcoroutines -O2 -march=skylake
// only specify `-O2 -march=skylake` and runs ok on local machine, otherwise segment fault (also on godbolt)
    for(auto i= index++; i< 20; i=index++)
    {
        std::cout <<"step 1" <<std::endl;
        __m256i v_offset = _mm256_set1_epi32(int32_t (i));
        std::cout <<"step 2" <<std::endl;
        __m256i v_size = _mm256_set1_epi32(int32_t(group_size));
        std::cout <<"step 3" <<std::endl;
        res = _mm256_movemask_epi8(_mm256_cmpgt_epi32(v_offset, v_size));
        cout <<i << " > " << group_size <<" ? " << (int)res<<endl;
        co_await std::suspend_always();
    }
#elif SSE
    for(auto i= index++; i< 20; i=index++)
    {
        __m128i v_offset = _mm_set1_epi32(int32_t (i));
        __m128i v_size = _mm_set1_epi32(int32_t(group_size));
        res = _mm_movemask_epi8(_mm_cmpgt_epi32(v_offset, v_size));
        cout <<i << " > " << group_size <<" ? " << res<<endl;
        co_await std::suspend_always();
    }    
#else
    for(auto i= index++; i< 20; i=index++)
    {
        res = i > group_size;
        cout <<i << " > " << group_size <<" ? " << res<<endl;
        co_await std::suspend_always();
    }
#endif
}

compile at https://godbolt.org/z/h3hej1ddq

-std=c++20 -fcoroutines -mbmi2 -mavx -mavx512f -mavx512pf -mavx512er -mavx512cd -mavx512vl

but result error for avx and avx512, only SSE works OK

Program returned: 139
Program terminated with signal: SIGSEGV
step 1

but it works on on clang-16 -std=gnu++20 -O2 -march=skylake -mavx512f https://godbolt.org/z/nMfbn8G9T

答案1

得分: 3

这似乎是GCC的一个错误,除非协程文档明确说明不支持具有alignof(T) > alignof(max_align_t)(例如__m256i__m512i)的局部变量。

您可以将此问题报告给https://gcc.gnu.org/bugzilla/(最好提供一个最小的AVX2测试用例)。

使用只需要AVX2而不是AVX-512的版本,我可以在我的桌面上进行测试,并查看它在需要32字节对齐的情况下是否出现“vmovdqa YMMWORD PTR [rbx+0x40],ymm0”的故障(存储“vpbroadcastd”的结果,初始化“__m256i v_offset = set1...”)(https://godbolt.org/z/8vfz3v5v1 修复了__m256i块,使用-std=gnu++20 -fcoroutines -O2 -march=skylake编译)。

我不知道为什么它在访问局部变量时使用RBX而不是RSP;我猜这是在函数的hello(hello(int&, int, int)::_Z5helloRiii.Frame*) [clone .actor]:版本中协程的工作方式。在协程版本中,GCC仍然只是使用and rsp, -32 / sub rsp, 192来对齐堆栈指针,但对于相对于RBX存储的东西没有帮助。

请注意,您的所有3个版本都需要AVX-512,只是具有不同的矢量宽度。像_mm_cmpgt_epi32_mask这样的比较-生成掩码操作总是需要AVX-512。

如果您想要在AVX2或SSE中使用整数掩码,您需要使用_mm_cmpgt_epi32_mm_movemask_epi8(每字节1位)或_mm_movemask_ps( _mm_castsi128_ps(cmp_result) )(每个int32 1位),或者使用_mm256的等效操作。

使用-march=native-march=skylake-avx512-march=znver4或其他类似选项。没有真正的CPU同时支持AVX512ER(Xeon Phi)和AVX512VL(其他所有CPU)。有关更多信息,请参阅https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512。

如果您的CPU不支持AVX-512,您将获得SIGILL(在所有3种情况下),而不是SIGSEGV。

英文:

This seems to be a GCC bug, unless coroutines are documented to not support local variables with alignof(T) > alignof(max_align_t) (Such as __m256i or __m512i).

You can report it (preferably with a minimal AVX2 test case) to https://gcc.gnu.org/bugzilla/

With a version that only requires AVX2 instead of AVX-512, I could test it on my desktop and see it faults on vmovdqa YMMWORD PTR [rbx+0x40],ymm0 which requires 32-byte alignment. (Storing the result of a vpbroadcastd, initializing __m256i v_offset = set1....) (https://godbolt.org/z/8vfz3v5v1 just fixes the __m256i block, compiles with -std=gnu++20 -fcoroutines -O2 -march=skylake)

IDK why it's using RBX to access locals instead of RSP; I guess that's how coroutines work in the hello(hello(int&, int, int)::_Z5helloRiii.Frame*) [clone .actor]: version of the function. In that coroutine version, GCC still just aligns the stack pointer with and rsp, -32 / sub rsp, 192, but that doesn't help for things stored relative to RBX.


Note that all 3 of your versions require AVX-512, just with different vector widths. Compare-into-mask like _mm_cmpgt_epi32_mask always requires AVX-512.

If you want an integer mask with AVX2 or SSE, you need _mm_cmpgt_epi32 and _mm_movemask_epi8 (1 bit per byte) or _mm_movemask_ps( _mm_castsi128_ps(cmp_result) ) (1 bit per int32), or the _mm256 equivalent.

Use -march=native or -march=skylake-avx512, -march=znver4, or whatever. No real CPUs have ever supported both AVX512ER (Xeon Phi) and AVX512VL (everything else). https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

If your CPU didn't support AVX-512, you'd get SIGILL (on all 3), not SIGSEGV.

huangapple
  • 本文由 发表于 2023年7月20日 16:08:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76727854.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定