2023年5月7日 22:04:47go评论73阅读模式

英文:

C++ compilers give different signs of NaN for constant propagation of subtracting +-Infinity or +-NaN from itself in AVX SIMD code

问题

I'm investigating how to detect in which lanes of a SIMD register the floats are either +/- infinity or +/- nan. After seeing some weird behavior at runtime, I decided to throw things into Godbolt to investigate, and things are weird:

#include <immintrin.h>
#include <cstdio>
#include <limits>
#include <cstdint>

static constexpr float inf = std::numeric_limits<float>::infinity();
static constexpr float qnan = std::numeric_limits<float>::quiet_NaN();
static constexpr float snan = std::numeric_limits<float>::signaling_NaN();

int main() {
    __m256 a = _mm256_setr_ps(0.0f, 1.0f, inf, -inf, qnan, -qnan, snan, -snan);

    __m256 mask = _mm256_sub_ps(a, a);

    // Extract masks as integers 
    int mask_bits = _mm256_movemask_ps(mask);

    std::printf("Mask for INFINITY or NaN: 0x%x\n", mask_bits);

    #define PRINT_ALL
    #ifdef PRINT_ALL
    float data_field[8];
    float mask_field[8];
    _mm256_storeu_ps(data_field, a);
    _mm256_storeu_ps(mask_field, mask);
    for (int i = 0; i < 8; ++i) {
        std::printf("isfinite(%f) = %x = %f\n", data_field[i], ((int32_t*)(char*)mask_field)[i], mask_field[i]);
    }
    #endif
    
    return 0;
}

Compilers give different results, and produce even different results depending on the optimization level. Some compilers are just fully executing the code at compile time with (broken?) reasoning and it all compiles down to some hard-coded print statements, without actual calculations at runtime. Changing the optimization level causes some compilers to trigger this (incorrect?) optimizations?

Additionally, I managed to influence what happens by printing out all results manually (the PRINT_ALL option), it seems.

The printed mask differs widely:

Without PRINT_ALL:
- GCC 13.1 -O0: 0xac - new NaNs are -nan; preserve sign of input NaN.
- GCC 13.1 -O1: 0x5c - new NaNs are -nan, flip sign of input NaNs.
- Clang 16.0.0 -O0: 0xac
- Clang 16.0.0 -O1: 0xa0 - new NaNs are +nan; preserve sign of input NaN.
- ICX 2022.2.1 -O0: 0xac
- ICX 2022.2.1 -O1: 0xa0
With PRINT_ALL, optimized GCC now matches what the hardware does, LLVM (clang and ICX) doesn't change.
- GCC 13.1 -O0: 0xac
- GCC 13.1 -O1: 0xac
- Clang 16.0.0 -O0: 0xac
- Clang 16.0.0 -O1: 0xa0
- ICX 2022.2.1 -O0: 0xac
- ICX 2022.2.1 -O1: 0xa0

A "new NaN" is inf - inf or -inf - -inf, where the result is NaN but neither input was NaN. These form the high 2 bits of the low hex digit, the 0x?C or 0x?0. The low 2 bits of that nibble come from the 0-0 and 1-1 elements, which produce +0.0 output as required for finite same-same with rounding modes other than towards -Inf (which isn't the default.)

ICX and clang seem to agree with each other, but still differ in results depending on the optimization level. I'm guessing 0xac is the correct result, as that is what happens in -O0 and all results are actually calculated by the CPU at runtime, without the compiler trying to be clever.

Bottom line, my question is, is this "expected behavior" according to some rules I am not aware of, or did I find a bug in three different compilers (GCC, Clang, and ICX)? (I couldn't test MSVC, as Goldbolt doesn't support executing the code for those builds.)

(-fno-strict-aliasing doesn't affect the results, so ((int32_t*)(char*)mask_field)[i] wasn't causing this.)

英文:

#include &lt;immintrin.h&gt;
#include &lt;cstdio&gt;
#include &lt;limits&gt;
#include &lt;cstdint&gt;

static constexpr float inf = std::numeric_limits&lt;float&gt;::infinity();
static constexpr float qnan = std::numeric_limits&lt;float&gt;::quiet_NaN();
static constexpr float snan = std::numeric_limits&lt;float&gt;::signaling_NaN();

int main() {
    __m256 a = _mm256_setr_ps(0.0f, 1.0f, inf, -inf, qnan, -qnan, snan, -snan);

    __m256 mask = _mm256_sub_ps(a, a);

    // Extract masks as integers 
    int mask_bits = _mm256_movemask_ps(mask);

    std::printf(&quot;Mask for INFINITY or NaN: 0x%x\n&quot;, mask_bits);

    #define PRINT_ALL
    #ifdef PRINT_ALL
    float data_field[8];
    float mask_field[8];
    _mm256_storeu_ps(data_field, a);
    _mm256_storeu_ps(mask_field, mask);
    for (int i = 0; i &lt; 8; ++i) {
        std::printf(&quot;isfinite(%f) = %x = %f\n&quot;, data_field[i], ((int32_t*)(char*)mask_field)[i], mask_field[i]);
    }
    #endif
    
    return 0;
}

Additionally, I managed to influence what happens by printing out all results manually (the PRINT_ALL option), it seems.

The printed mask differs widely:

Without PRINT_ALL:
- GCC 13.1 -O0: 0xac - new NaNs are -nan; preserve sign of input NaN.
- GCC 13.1 -O1: 0x5c - new NaNs are -nan, flip sign of input NaNs.
- Clang 16.0.0 -O0: 0xac
- Clang 16.0.0 -O1: 0xa0 - new NaNs are +nan; preserve sign of input NaN.
- ICX 2022.2.1 -O0: 0xac
- ICX 2022.2.1 -O1: 0xa0
With PRINT_ALL, optimized GCC now matches what the hardware does, LLVM (clang and ICX) doesn't change.
- GCC 13.1 -O0: 0xac
- GCC 13.1 -O1: 0xac
- Clang 16.0.0 -O0: 0xac
- Clang 16.0.0 -O1: 0xa0
- ICX 2022.2.1 -O0: 0xac
- ICX 2022.2.1 -O1: 0xa0

(-fno-strict-aliasing doesn't affect the results, so ((int32_t*)(char*)mask_field)[i] wasn't causing this.)

答案1

得分: 3

The sign of a NaN result is specified only for abs, copysign, and unary minus. Otherwise, the sign is unspecified. When both operands of an SSE instruction are not NaN, x86 CPUs produce a negative NaN, but compilers are not obligated to simulate that when optimizing.

Therefore, only the low two bits of the mask_bits variable are predictable.

For instance, in case of gcc -O1, the compiler internally transforms _mm256_sub_ps(a, a) to a + b, where b is a constant vector with the same contents as a, but all signs flipped. After that, it emits the vaddps instruction with those constant vectors on registers, and the bits in the high nibble of the result depend on the order of operands (the CPU copies the NaN from one of the operands).

LLVM folds subtraction of infinities to positive NaN, where the CPU produces a negative NaN: https://godbolt.org/z/1hd69josr

英文:

Sign of a NaN result is specified only for abs, copysign, and unary minus. Otherwise, the sign is unspecified. When both operands of an SSE instruction are not NaN, x86 CPUs produce a negative NaN, but compilers are not obligated to simulate that when optimizing.

Therefore only the low two bits of the mask_bits variable are predictable.

For instance, in case of gcc -O1, the compiler internally tranforms _mm256_sub_ps(a, a) to a + b, where b is a constant vector with the same contents as a, but all signs flipped. After that it emits the vaddps instruction with those constant vectors on registers, and the bits in the high nibble of the result depend on the order of operands (the CPU copies the NaN from one of the operands).

LLVM folds subtraction of infinities to positive NaN, where the CPU produces a negative NaN: https://godbolt.org/z/1hd69josr

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

C++ compilers give different signs of NaN for constant propagation of subtracting +-Infinity or +-NaN from itself in AVX SIMD code

问题

答案1

在一个成员变量中存储`std::future`并反复覆盖这个成员变量是否安全？

如何在C++中引用类的非静态成员。

如何在C++中将一个类定义重用为不同的类

如何在C++中为哈希表创建最佳链式方法？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论