英文:
gcc c++ coroutine runs avx SIMD code, but causes SIGSEGV
问题
#define AVX512 0
#define AVX2 1
#define SSE 0
HelloCoroutine hello(int& index, int id, int group_size) {
unsigned res=0;
#if AVX512
// g++ simd.cpp -std=gnu++20 -fcoroutines -O2 -march=skylake -mavx512f
// segment fault
for(auto i= index++; i< 20; i=index++)
{
std::cout <<"step 1" <<std::endl;
__m512i v_offset = _mm512_set1_epi64(int64_t (i));
std::cout <<"step 2" <<std::endl;
__m512i v_size = _mm512_set1_epi64(int64_t(group_size));
std::cout <<"step 3" <<std::endl;
res = _mm512_cmpgt_epi64_mask(v_offset, v_size);
cout <<i << " > " << group_size <<" ? " << (int)res<<endl;
co_await std::suspend_always();
}
#elif AVX2
// g++ simd.cpp -std=gnu++20 -fcoroutines -O2 -march=skylake
// only specify `-O2 -march=skylake` and runs ok on local machine, otherwise segment fault (also on godbolt)
for(auto i= index++; i< 20; i=index++)
{
std::cout <<"step 1" <<std::endl;
__m256i v_offset = _mm256_set1_epi32(int32_t (i));
std::cout <<"step 2" <<std::endl;
__m256i v_size = _mm256_set1_epi32(int32_t(group_size));
std::cout <<"step 3" <<std::endl;
res = _mm256_movemask_epi8(_mm256_cmpgt_epi32(v_offset, v_size));
cout <<i << " > " << group_size <<" ? " << (int)res<<endl;
co_await std::suspend_always();
}
#elif SSE
for(auto i= index++; i< 20; i=index++)
{
__m128i v_offset = _mm_set1_epi32(int32_t (i));
__m128i v_size = _mm_set1_epi32(int32_t(group_size));
res = _mm_movemask_epi8(_mm_cmpgt_epi32(v_offset, v_size));
cout <<i << " > " << group_size <<" ? " << res<<endl;
co_await std::suspend_always();
}
#else
for(auto i= index++; i< 20; i=index++)
{
res = i > group_size;
cout <<i << " > " << group_size <<" ? " << res<<endl;
co_await std::suspend_always();
}
#endif
}
英文:
c++ coroutine runs avx SIMD code, but causes SIGSEGV for AVX2 and AVX512
#define AVX512 0
#define AVX2 1
#define SSE 0
HelloCoroutine hello(int& index, int id, int group_size) {
unsigned res=0;
#if AVX512
// g++ simd.cpp -std=gnu++20 -fcoroutines -O2 -march=skylake -mavx512f
// segment fault
for(auto i= index++; i< 20; i=index++)
{
std::cout <<"step 1" <<std::endl;
__m512i v_offset = _mm512_set1_epi64(int64_t (i));
std::cout <<"step 2" <<std::endl;
__m512i v_size = _mm512_set1_epi64(int64_t(group_size));
std::cout <<"step 3" <<std::endl;
res = _mm512_cmpgt_epi64_mask(v_offset, v_size);
cout <<i << " > " << group_size <<" ? " << (int)res<<endl;
co_await std::suspend_always();
}
#elif AVX2
// g++ simd.cpp -std=gnu++20 -fcoroutines -O2 -march=skylake
// only specify `-O2 -march=skylake` and runs ok on local machine, otherwise segment fault (also on godbolt)
for(auto i= index++; i< 20; i=index++)
{
std::cout <<"step 1" <<std::endl;
__m256i v_offset = _mm256_set1_epi32(int32_t (i));
std::cout <<"step 2" <<std::endl;
__m256i v_size = _mm256_set1_epi32(int32_t(group_size));
std::cout <<"step 3" <<std::endl;
res = _mm256_movemask_epi8(_mm256_cmpgt_epi32(v_offset, v_size));
cout <<i << " > " << group_size <<" ? " << (int)res<<endl;
co_await std::suspend_always();
}
#elif SSE
for(auto i= index++; i< 20; i=index++)
{
__m128i v_offset = _mm_set1_epi32(int32_t (i));
__m128i v_size = _mm_set1_epi32(int32_t(group_size));
res = _mm_movemask_epi8(_mm_cmpgt_epi32(v_offset, v_size));
cout <<i << " > " << group_size <<" ? " << res<<endl;
co_await std::suspend_always();
}
#else
for(auto i= index++; i< 20; i=index++)
{
res = i > group_size;
cout <<i << " > " << group_size <<" ? " << res<<endl;
co_await std::suspend_always();
}
#endif
}
compile at https://godbolt.org/z/h3hej1ddq
-std=c++20 -fcoroutines -mbmi2 -mavx -mavx512f -mavx512pf -mavx512er -mavx512cd -mavx512vl
but result error for avx and avx512, only SSE works OK
Program returned: 139
Program terminated with signal: SIGSEGV
step 1
but it works on on clang-16 -std=gnu++20 -O2 -march=skylake -mavx512f
https://godbolt.org/z/nMfbn8G9T
答案1
得分: 3
这似乎是GCC的一个错误,除非协程文档明确说明不支持具有alignof(T) > alignof(max_align_t)
(例如__m256i
或__m512i
)的局部变量。
您可以将此问题报告给https://gcc.gnu.org/bugzilla/(最好提供一个最小的AVX2测试用例)。
使用只需要AVX2而不是AVX-512的版本,我可以在我的桌面上进行测试,并查看它在需要32字节对齐的情况下是否出现“vmovdqa YMMWORD PTR [rbx+0x40],ymm0”的故障(存储“vpbroadcastd”的结果,初始化“__m256i v_offset = set1...”)(https://godbolt.org/z/8vfz3v5v1 修复了__m256i
块,使用-std=gnu++20 -fcoroutines -O2 -march=skylake
编译)。
我不知道为什么它在访问局部变量时使用RBX而不是RSP;我猜这是在函数的hello(hello(int&, int, int)::_Z5helloRiii.Frame*) [clone .actor]:
版本中协程的工作方式。在协程版本中,GCC仍然只是使用and rsp, -32
/ sub rsp, 192
来对齐堆栈指针,但对于相对于RBX存储的东西没有帮助。
请注意,您的所有3个版本都需要AVX-512,只是具有不同的矢量宽度。像_mm_cmpgt_epi32_mask
这样的比较-生成掩码操作总是需要AVX-512。
如果您想要在AVX2或SSE中使用整数掩码,您需要使用_mm_cmpgt_epi32
和_mm_movemask_epi8
(每字节1位)或_mm_movemask_ps( _mm_castsi128_ps(cmp_result) )
(每个int32 1位),或者使用_mm256
的等效操作。
使用-march=native
或-march=skylake-avx512
,-march=znver4
或其他类似选项。没有真正的CPU同时支持AVX512ER(Xeon Phi)和AVX512VL(其他所有CPU)。有关更多信息,请参阅https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512。
如果您的CPU不支持AVX-512,您将获得SIGILL(在所有3种情况下),而不是SIGSEGV。
英文:
This seems to be a GCC bug, unless coroutines are documented to not support local variables with alignof(T) > alignof(max_align_t)
(Such as __m256i
or __m512i
).
You can report it (preferably with a minimal AVX2 test case) to https://gcc.gnu.org/bugzilla/
With a version that only requires AVX2 instead of AVX-512, I could test it on my desktop and see it faults on vmovdqa YMMWORD PTR [rbx+0x40],ymm0
which requires 32-byte alignment. (Storing the result of a vpbroadcastd
, initializing __m256i v_offset = set1...
.) (https://godbolt.org/z/8vfz3v5v1 just fixes the __m256i
block, compiles with -std=gnu++20 -fcoroutines -O2 -march=skylake
)
IDK why it's using RBX to access locals instead of RSP; I guess that's how coroutines work in the hello(hello(int&, int, int)::_Z5helloRiii.Frame*) [clone .actor]:
version of the function. In that coroutine version, GCC still just aligns the stack pointer with and rsp, -32
/ sub rsp, 192
, but that doesn't help for things stored relative to RBX.
Note that all 3 of your versions require AVX-512, just with different vector widths. Compare-into-mask like _mm_cmpgt_epi32_mask
always requires AVX-512.
If you want an integer mask with AVX2 or SSE, you need _mm_cmpgt_epi32
and _mm_movemask_epi8
(1 bit per byte) or _mm_movemask_ps( _mm_castsi128_ps(cmp_result) )
(1 bit per int32), or the _mm256
equivalent.
Use -march=native
or -march=skylake-avx512
, -march=znver4
, or whatever. No real CPUs have ever supported both AVX512ER (Xeon Phi) and AVX512VL (everything else). https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512
If your CPU didn't support AVX-512, you'd get SIGILL (on all 3), not SIGSEGV.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论