C++ compilers give different signs of NaN for constant propagation of subtracting +-Infinity or +-NaN from itself in AVX SIMD code

huangapple go评论73阅读模式
英文:

C++ compilers give different signs of NaN for constant propagation of subtracting +-Infinity or +-NaN from itself in AVX SIMD code

问题

I'm investigating how to detect in which lanes of a SIMD register the floats are either +/- infinity or +/- nan. After seeing some weird behavior at runtime, I decided to throw things into Godbolt to investigate, and things are weird:

#include <immintrin.h>
#include <cstdio>
#include <limits>
#include <cstdint>

static constexpr float inf = std::numeric_limits<float>::infinity();
static constexpr float qnan = std::numeric_limits<float>::quiet_NaN();
static constexpr float snan = std::numeric_limits<float>::signaling_NaN();

int main() {
    __m256 a = _mm256_setr_ps(0.0f, 1.0f, inf, -inf, qnan, -qnan, snan, -snan);

    __m256 mask = _mm256_sub_ps(a, a);

    // Extract masks as integers 
    int mask_bits = _mm256_movemask_ps(mask);

    std::printf("Mask for INFINITY or NaN: 0x%x\n", mask_bits);

    #define PRINT_ALL
    #ifdef PRINT_ALL
    float data_field[8];
    float mask_field[8];
    _mm256_storeu_ps(data_field, a);
    _mm256_storeu_ps(mask_field, mask);
    for (int i = 0; i < 8; ++i) {
        std::printf("isfinite(%f) = %x = %f\n", data_field[i], ((int32_t*)(char*)mask_field)[i], mask_field[i]);
    }
    #endif
    
    return 0;
}

Compilers give different results, and produce even different results depending on the optimization level. Some compilers are just fully executing the code at compile time with (broken?) reasoning and it all compiles down to some hard-coded print statements, without actual calculations at runtime. Changing the optimization level causes some compilers to trigger this (incorrect?) optimizations?

Additionally, I managed to influence what happens by printing out all results manually (the PRINT_ALL option), it seems.

The printed mask differs widely:

  • Without PRINT_ALL:
    • GCC 13.1 -O0: 0xac - new NaNs are -nan; preserve sign of input NaN.
    • GCC 13.1 -O1: 0x5c - new NaNs are -nan, flip sign of input NaNs.
    • Clang 16.0.0 -O0: 0xac
    • Clang 16.0.0 -O1: 0xa0 - new NaNs are +nan; preserve sign of input NaN.
    • ICX 2022.2.1 -O0: 0xac
    • ICX 2022.2.1 -O1: 0xa0
  • With PRINT_ALL, optimized GCC now matches what the hardware does, LLVM (clang and ICX) doesn't change.
    • GCC 13.1 -O0: 0xac
    • GCC 13.1 -O1: 0xac
    • Clang 16.0.0 -O0: 0xac
    • Clang 16.0.0 -O1: 0xa0
    • ICX 2022.2.1 -O0: 0xac
    • ICX 2022.2.1 -O1: 0xa0

A "new NaN" is inf - inf or -inf - -inf, where the result is NaN but neither input was NaN. These form the high 2 bits of the low hex digit, the 0x?C or 0x?0. The low 2 bits of that nibble come from the 0-0 and 1-1 elements, which produce +0.0 output as required for finite same-same with rounding modes other than towards -Inf (which isn't the default.)

ICX and clang seem to agree with each other, but still differ in results depending on the optimization level. I'm guessing 0xac is the correct result, as that is what happens in -O0 and all results are actually calculated by the CPU at runtime, without the compiler trying to be clever.

Bottom line, my question is, is this "expected behavior" according to some rules I am not aware of, or did I find a bug in three different compilers (GCC, Clang, and ICX)? (I couldn't test MSVC, as Goldbolt doesn't support executing the code for those builds.)

(-fno-strict-aliasing doesn't affect the results, so ((int32_t*)(char*)mask_field)[i] wasn't causing this.)

英文:

I'm investigating how to detect in which lanes of a SIMD register the floats are either +/- infinity or +/- nan. After seeing some weird behavior at runtime, I decided to throw things into Godbolt to investigate, and things are weird:
https://godbolt.org/z/TdnrK8rqd

#include &lt;immintrin.h&gt;
#include &lt;cstdio&gt;
#include &lt;limits&gt;
#include &lt;cstdint&gt;

static constexpr float inf = std::numeric_limits&lt;float&gt;::infinity();
static constexpr float qnan = std::numeric_limits&lt;float&gt;::quiet_NaN();
static constexpr float snan = std::numeric_limits&lt;float&gt;::signaling_NaN();

int main() {
    __m256 a = _mm256_setr_ps(0.0f, 1.0f, inf, -inf, qnan, -qnan, snan, -snan);

    __m256 mask = _mm256_sub_ps(a, a);

    // Extract masks as integers 
    int mask_bits = _mm256_movemask_ps(mask);

    std::printf(&quot;Mask for INFINITY or NaN: 0x%x\n&quot;, mask_bits);

    #define PRINT_ALL
    #ifdef PRINT_ALL
    float data_field[8];
    float mask_field[8];
    _mm256_storeu_ps(data_field, a);
    _mm256_storeu_ps(mask_field, mask);
    for (int i = 0; i &lt; 8; ++i) {
        std::printf(&quot;isfinite(%f) = %x = %f\n&quot;, data_field[i], ((int32_t*)(char*)mask_field)[i], mask_field[i]);
    }
    #endif
    
    return 0;
}

Compilers give different results, and produce even different results depending on the optimization level. Some compilers are just fully executing the code at compile time with (broken?) reasoning and it all compiles down to some hard-coded print statements, without actual calculations at runtime. Changing the optimization level causes some compilers to trigger this (incorrect?) optimizations?

Additionally, I managed to influence what happens by printing out all results manually (the PRINT_ALL option), it seems.

The printed mask differs widely:

  • Without PRINT_ALL:
    • GCC 13.1 -O0: 0xac - new NaNs are -nan; preserve sign of input NaN.
    • GCC 13.1 -O1: 0x5c - new NaNs are -nan, flip sign of input NaNs.
    • Clang 16.0.0 -O0: 0xac
    • Clang 16.0.0 -O1: 0xa0 - new NaNs are +nan; preserve sign of input NaN.
    • ICX 2022.2.1 -O0: 0xac
    • ICX 2022.2.1 -O1: 0xa0
  • With PRINT_ALL, optimized GCC now matches what the hardware does, LLVM (clang and ICX) doesn't change.
    • GCC 13.1 -O0: 0xac
    • GCC 13.1 -O1: 0xac
    • Clang 16.0.0 -O0: 0xac
    • Clang 16.0.0 -O1: 0xa0
    • ICX 2022.2.1 -O0: 0xac
    • ICX 2022.2.1 -O1: 0xa0

A "new NaN" is inf - inf or -inf - -inf, where the result is NaN but neither input was NaN. These form the high 2 bits of the low hex digit, the 0x?C or 0x?0. The low 2 bits of that nibble come from the 0-0 and 1-1 elements, which produce +0.0 output as required for finite same-same with rounding modes other than towards -Inf (which isn't the default.)

ICX and clang seem to agree with each other, but still differ in results depending on the optimization level. I'm guessing 0xac is the correct result, as that is what happens in -O0 and all results are actually calculated by the CPU at runtime, without the compiler trying to be clever.

Bottom line, my question is, is this "expected behavior" according to some rules I am not aware of, or did I find a bug in three different compilers (GCC, Clang, and ICX)? (I couldn't test MSVC, as Goldbolt doesn't support executing the code for those builds.)

(-fno-strict-aliasing doesn't affect the results, so ((int32_t*)(char*)mask_field)[i] wasn't causing this.)

答案1

得分: 3

The sign of a NaN result is specified only for abs, copysign, and unary minus. Otherwise, the sign is unspecified. When both operands of an SSE instruction are not NaN, x86 CPUs produce a negative NaN, but compilers are not obligated to simulate that when optimizing.

Therefore, only the low two bits of the mask_bits variable are predictable.

For instance, in case of gcc -O1, the compiler internally transforms _mm256_sub_ps(a, a) to a + b, where b is a constant vector with the same contents as a, but all signs flipped. After that, it emits the vaddps instruction with those constant vectors on registers, and the bits in the high nibble of the result depend on the order of operands (the CPU copies the NaN from one of the operands).

LLVM folds subtraction of infinities to positive NaN, where the CPU produces a negative NaN: https://godbolt.org/z/1hd69josr

英文:

Sign of a NaN result is specified only for abs, copysign, and unary minus. Otherwise, the sign is unspecified. When both operands of an SSE instruction are not NaN, x86 CPUs produce a negative NaN, but compilers are not obligated to simulate that when optimizing.

Therefore only the low two bits of the mask_bits variable are predictable.

For instance, in case of gcc -O1, the compiler internally tranforms _mm256_sub_ps(a, a) to a + b, where b is a constant vector with the same contents as a, but all signs flipped. After that it emits the vaddps instruction with those constant vectors on registers, and the bits in the high nibble of the result depend on the order of operands (the CPU copies the NaN from one of the operands).

LLVM folds subtraction of infinities to positive NaN, where the CPU produces a negative NaN: https://godbolt.org/z/1hd69josr

huangapple
  • 本文由 发表于 2023年5月7日 22:04:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76194413.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定