2023年7月20日 16:08:18go评论86阅读模式

英文:

gcc c++ coroutine runs avx SIMD code, but causes SIGSEGV

问题

#define AVX512 0
#define AVX2 1
#define SSE 0

HelloCoroutine hello(int&amp; index, int id, int group_size) {
    unsigned res=0;
#if AVX512
// g++ simd.cpp -std=gnu++20 -fcoroutines -O2 -march=skylake -mavx512f
// segment fault
    for(auto i= index++; i&lt; 20; i=index++)
    {
        std::cout &lt;&lt;&quot;step 1&quot; &lt;&lt;std::endl;
        __m512i v_offset = _mm512_set1_epi64(int64_t (i));
        std::cout &lt;&lt;&quot;step 2&quot; &lt;&lt;std::endl;
        __m512i v_size = _mm512_set1_epi64(int64_t(group_size));
        std::cout &lt;&lt;&quot;step 3&quot; &lt;&lt;std::endl;
        res = _mm512_cmpgt_epi64_mask(v_offset, v_size);
        cout &lt;&lt;i &lt;&lt; &quot; &gt; &quot; &lt;&lt; group_size &lt;&lt;&quot; ? &quot; &lt;&lt; (int)res&lt;&lt;endl;
        co_await std::suspend_always();
    }
#elif AVX2 
// g++ simd.cpp -std=gnu++20 -fcoroutines -O2 -march=skylake
// only specify `-O2 -march=skylake` and runs ok on local machine, otherwise segment fault (also on godbolt)
    for(auto i= index++; i&lt; 20; i=index++)
    {
        std::cout &lt;&lt;&quot;step 1&quot; &lt;&lt;std::endl;
        __m256i v_offset = _mm256_set1_epi32(int32_t (i));
        std::cout &lt;&lt;&quot;step 2&quot; &lt;&lt;std::endl;
        __m256i v_size = _mm256_set1_epi32(int32_t(group_size));
        std::cout &lt;&lt;&quot;step 3&quot; &lt;&lt;std::endl;
        res = _mm256_movemask_epi8(_mm256_cmpgt_epi32(v_offset, v_size));
        cout &lt;&lt;i &lt;&lt; &quot; &gt; &quot; &lt;&lt; group_size &lt;&lt;&quot; ? &quot; &lt;&lt; (int)res&lt;&lt;endl;
        co_await std::suspend_always();
    }
#elif SSE
    for(auto i= index++; i&lt; 20; i=index++)
    {
        __m128i v_offset = _mm_set1_epi32(int32_t (i));
        __m128i v_size = _mm_set1_epi32(int32_t(group_size));
        res = _mm_movemask_epi8(_mm_cmpgt_epi32(v_offset, v_size));
        cout &lt;&lt;i &lt;&lt; &quot; &gt; &quot; &lt;&lt; group_size &lt;&lt;&quot; ? &quot; &lt;&lt; res&lt;&lt;endl;
        co_await std::suspend_always();
    }    
#else
    for(auto i= index++; i&lt; 20; i=index++)
    {
        res = i &gt; group_size;
        cout &lt;&lt;i &lt;&lt; &quot; &gt; &quot; &lt;&lt; group_size &lt;&lt;&quot; ? &quot; &lt;&lt; res&lt;&lt;endl;
        co_await std::suspend_always();
    }
#endif
}

英文:

c++ coroutine runs avx SIMD code, but causes SIGSEGV for AVX2 and AVX512

#define AVX512 0
#define AVX2 1
#define SSE 0

HelloCoroutine hello(int&amp; index, int id, int group_size) {
    unsigned res=0;
#if AVX512
// g++ simd.cpp -std=gnu++20 -fcoroutines -O2 -march=skylake -mavx512f
// segment fault
    for(auto i= index++; i&lt; 20; i=index++)
    {
        std::cout &lt;&lt;&quot;step 1&quot; &lt;&lt;std::endl;
        __m512i v_offset = _mm512_set1_epi64(int64_t (i));
        std::cout &lt;&lt;&quot;step 2&quot; &lt;&lt;std::endl;
        __m512i v_size = _mm512_set1_epi64(int64_t(group_size));
        std::cout &lt;&lt;&quot;step 3&quot; &lt;&lt;std::endl;
        res = _mm512_cmpgt_epi64_mask(v_offset, v_size);
        cout &lt;&lt;i &lt;&lt; &quot; &gt; &quot; &lt;&lt; group_size &lt;&lt;&quot; ? &quot; &lt;&lt; (int)res&lt;&lt;endl;
        co_await std::suspend_always();
    }
#elif AVX2 
// g++ simd.cpp -std=gnu++20 -fcoroutines -O2 -march=skylake
// only specify `-O2 -march=skylake` and runs ok on local machine, otherwise segment fault (also on godbolt)
    for(auto i= index++; i&lt; 20; i=index++)
    {
        std::cout &lt;&lt;&quot;step 1&quot; &lt;&lt;std::endl;
        __m256i v_offset = _mm256_set1_epi32(int32_t (i));
        std::cout &lt;&lt;&quot;step 2&quot; &lt;&lt;std::endl;
        __m256i v_size = _mm256_set1_epi32(int32_t(group_size));
        std::cout &lt;&lt;&quot;step 3&quot; &lt;&lt;std::endl;
        res = _mm256_movemask_epi8(_mm256_cmpgt_epi32(v_offset, v_size));
        cout &lt;&lt;i &lt;&lt; &quot; &gt; &quot; &lt;&lt; group_size &lt;&lt;&quot; ? &quot; &lt;&lt; (int)res&lt;&lt;endl;
        co_await std::suspend_always();
    }
#elif SSE
    for(auto i= index++; i&lt; 20; i=index++)
    {
        __m128i v_offset = _mm_set1_epi32(int32_t (i));
        __m128i v_size = _mm_set1_epi32(int32_t(group_size));
        res = _mm_movemask_epi8(_mm_cmpgt_epi32(v_offset, v_size));
        cout &lt;&lt;i &lt;&lt; &quot; &gt; &quot; &lt;&lt; group_size &lt;&lt;&quot; ? &quot; &lt;&lt; res&lt;&lt;endl;
        co_await std::suspend_always();
    }    
#else
    for(auto i= index++; i&lt; 20; i=index++)
    {
        res = i &gt; group_size;
        cout &lt;&lt;i &lt;&lt; &quot; &gt; &quot; &lt;&lt; group_size &lt;&lt;&quot; ? &quot; &lt;&lt; res&lt;&lt;endl;
        co_await std::suspend_always();
    }
#endif
}

compile at https://godbolt.org/z/h3hej1ddq

-std=c++20 -fcoroutines -mbmi2 -mavx -mavx512f -mavx512pf -mavx512er -mavx512cd -mavx512vl

but result error for avx and avx512, only SSE works OK

Program returned: 139
Program terminated with signal: SIGSEGV
step 1

but it works on on clang-16 -std=gnu++20 -O2 -march=skylake -mavx512f https://godbolt.org/z/nMfbn8G9T

答案1

得分: 3

这似乎是GCC的一个错误，除非协程文档明确说明不支持具有alignof(T) > alignof(max_align_t)（例如__m256i或__m512i）的局部变量。

您可以将此问题报告给https://gcc.gnu.org/bugzilla/（最好提供一个最小的AVX2测试用例）。

使用只需要AVX2而不是AVX-512的版本，我可以在我的桌面上进行测试，并查看它在需要32字节对齐的情况下是否出现“vmovdqa YMMWORD PTR [rbx+0x40]，ymm0”的故障（存储“vpbroadcastd”的结果，初始化“__m256i v_offset = set1...”）（https://godbolt.org/z/8vfz3v5v1 修复了__m256i块，使用-std=gnu++20 -fcoroutines -O2 -march=skylake编译）。

我不知道为什么它在访问局部变量时使用RBX而不是RSP；我猜这是在函数的hello(hello(int&, int, int)::_Z5helloRiii.Frame*) [clone .actor]:版本中协程的工作方式。在协程版本中，GCC仍然只是使用and rsp, -32 / sub rsp, 192来对齐堆栈指针，但对于相对于RBX存储的东西没有帮助。

请注意，您的所有3个版本都需要AVX-512，只是具有不同的矢量宽度。像_mm_cmpgt_epi32_mask这样的比较-生成掩码操作总是需要AVX-512。

如果您想要在AVX2或SSE中使用整数掩码，您需要使用_mm_cmpgt_epi32和_mm_movemask_epi8（每字节1位）或_mm_movemask_ps( _mm_castsi128_ps(cmp_result) )（每个int32 1位），或者使用_mm256的等效操作。

使用-march=native或-march=skylake-avx512，-march=znver4或其他类似选项。没有真正的CPU同时支持AVX512ER（Xeon Phi）和AVX512VL（其他所有CPU）。有关更多信息，请参阅https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512。

如果您的CPU不支持AVX-512，您将获得SIGILL（在所有3种情况下），而不是SIGSEGV。

英文:

This seems to be a GCC bug, unless coroutines are documented to not support local variables with alignof(T) > alignof(max_align_t) (Such as __m256i or __m512i).

You can report it (preferably with a minimal AVX2 test case) to https://gcc.gnu.org/bugzilla/

With a version that only requires AVX2 instead of AVX-512, I could test it on my desktop and see it faults on vmovdqa YMMWORD PTR [rbx+0x40],ymm0 which requires 32-byte alignment. (Storing the result of a vpbroadcastd, initializing __m256i v_offset = set1....) (https://godbolt.org/z/8vfz3v5v1 just fixes the __m256i block, compiles with -std=gnu++20 -fcoroutines -O2 -march=skylake)

IDK why it's using RBX to access locals instead of RSP; I guess that's how coroutines work in the hello(hello(int&, int, int)::_Z5helloRiii.Frame*) [clone .actor]: version of the function. In that coroutine version, GCC still just aligns the stack pointer with and rsp, -32 / sub rsp, 192, but that doesn't help for things stored relative to RBX.

Note that all 3 of your versions require AVX-512, just with different vector widths. Compare-into-mask like _mm_cmpgt_epi32_mask always requires AVX-512.

If you want an integer mask with AVX2 or SSE, you need _mm_cmpgt_epi32 and _mm_movemask_epi8 (1 bit per byte) or _mm_movemask_ps( _mm_castsi128_ps(cmp_result) ) (1 bit per int32), or the _mm256 equivalent.

Use -march=native or -march=skylake-avx512, -march=znver4, or whatever. No real CPUs have ever supported both AVX512ER (Xeon Phi) and AVX512VL (everything else). https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

If your CPU didn't support AVX-512, you'd get SIGILL (on all 3), not SIGSEGV.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

gcc c++ 协程运行 avx SIMD 代码，但导致 SIGSEGV。

问题

答案1

为什么在C++中无效选择后’break’不起作用？

窗口句柄为何未正确存储？

如何在单线程系统上停止阻塞任务？

Compile-Time Topological Sort Exceeds Recursion Depth in C++

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论