英文:
C++ compilers give different signs of NaN for constant propagation of subtracting +-Infinity or +-NaN from itself in AVX SIMD code
问题
I'm investigating how to detect in which lanes of a SIMD register the floats are either +/- infinity or +/- nan. After seeing some weird behavior at runtime, I decided to throw things into Godbolt to investigate, and things are weird:
#include <immintrin.h>
#include <cstdio>
#include <limits>
#include <cstdint>
static constexpr float inf = std::numeric_limits<float>::infinity();
static constexpr float qnan = std::numeric_limits<float>::quiet_NaN();
static constexpr float snan = std::numeric_limits<float>::signaling_NaN();
int main() {
__m256 a = _mm256_setr_ps(0.0f, 1.0f, inf, -inf, qnan, -qnan, snan, -snan);
__m256 mask = _mm256_sub_ps(a, a);
// Extract masks as integers
int mask_bits = _mm256_movemask_ps(mask);
std::printf("Mask for INFINITY or NaN: 0x%x\n", mask_bits);
#define PRINT_ALL
#ifdef PRINT_ALL
float data_field[8];
float mask_field[8];
_mm256_storeu_ps(data_field, a);
_mm256_storeu_ps(mask_field, mask);
for (int i = 0; i < 8; ++i) {
std::printf("isfinite(%f) = %x = %f\n", data_field[i], ((int32_t*)(char*)mask_field)[i], mask_field[i]);
}
#endif
return 0;
}
Compilers give different results, and produce even different results depending on the optimization level. Some compilers are just fully executing the code at compile time with (broken?) reasoning and it all compiles down to some hard-coded print statements, without actual calculations at runtime. Changing the optimization level causes some compilers to trigger this (incorrect?) optimizations?
Additionally, I managed to influence what happens by printing out all results manually (the PRINT_ALL
option), it seems.
The printed mask differs widely:
- Without
PRINT_ALL
:- GCC 13.1
-O0
: 0xac - new NaNs are-nan
; preserve sign of input NaN. - GCC 13.1
-O1
: 0x5c - new NaNs are-nan
, flip sign of input NaNs. - Clang 16.0.0
-O0
: 0xac - Clang 16.0.0
-O1
: 0xa0 - new NaNs are+nan
; preserve sign of input NaN. - ICX 2022.2.1
-O0
: 0xac - ICX 2022.2.1
-O1
: 0xa0
- GCC 13.1
- With
PRINT_ALL
, optimized GCC now matches what the hardware does, LLVM (clang and ICX) doesn't change.- GCC 13.1
-O0
: 0xac - GCC 13.1
-O1
: 0xac - Clang 16.0.0
-O0
: 0xac - Clang 16.0.0
-O1
: 0xa0 - ICX 2022.2.1
-O0
: 0xac - ICX 2022.2.1
-O1
: 0xa0
- GCC 13.1
A "new NaN" is inf - inf
or -inf - -inf
, where the result is NaN but neither input was NaN. These form the high 2 bits of the low hex digit, the 0x?C
or 0x?0
. The low 2 bits of that nibble come from the 0-0
and 1-1
elements, which produce +0.0
output as required for finite same-same
with rounding modes other than towards -Inf (which isn't the default.)
ICX and clang seem to agree with each other, but still differ in results depending on the optimization level. I'm guessing 0xac is the correct result, as that is what happens in -O0
and all results are actually calculated by the CPU at runtime, without the compiler trying to be clever.
Bottom line, my question is, is this "expected behavior" according to some rules I am not aware of, or did I find a bug in three different compilers (GCC, Clang, and ICX)? (I couldn't test MSVC, as Goldbolt doesn't support executing the code for those builds.)
(-fno-strict-aliasing
doesn't affect the results, so ((int32_t*)(char*)mask_field)[i]
wasn't causing this.)
英文:
I'm investigating how to detect in which lanes of a SIMD register the floats are either +/- infinity or +/- nan. After seeing some weird behavior at runtime, I decided to throw things into Godbolt to investigate, and things are weird:
https://godbolt.org/z/TdnrK8rqd
#include <immintrin.h>
#include <cstdio>
#include <limits>
#include <cstdint>
static constexpr float inf = std::numeric_limits<float>::infinity();
static constexpr float qnan = std::numeric_limits<float>::quiet_NaN();
static constexpr float snan = std::numeric_limits<float>::signaling_NaN();
int main() {
__m256 a = _mm256_setr_ps(0.0f, 1.0f, inf, -inf, qnan, -qnan, snan, -snan);
__m256 mask = _mm256_sub_ps(a, a);
// Extract masks as integers
int mask_bits = _mm256_movemask_ps(mask);
std::printf("Mask for INFINITY or NaN: 0x%x\n", mask_bits);
#define PRINT_ALL
#ifdef PRINT_ALL
float data_field[8];
float mask_field[8];
_mm256_storeu_ps(data_field, a);
_mm256_storeu_ps(mask_field, mask);
for (int i = 0; i < 8; ++i) {
std::printf("isfinite(%f) = %x = %f\n", data_field[i], ((int32_t*)(char*)mask_field)[i], mask_field[i]);
}
#endif
return 0;
}
Compilers give different results, and produce even different results depending on the optimization level. Some compilers are just fully executing the code at compile time with (broken?) reasoning and it all compiles down to some hard-coded print statements, without actual calculations at runtime. Changing the optimization level causes some compilers to trigger this (incorrect?) optimizations?
Additionally, I managed to influence what happens by printing out all results manually (the PRINT_ALL
option), it seems.
The printed mask differs widely:
- Without
PRINT_ALL
:- GCC 13.1
-O0
: 0xac - new NaNs are-nan
; preserve sign of input NaN. - GCC 13.1
-O1
: 0x5c - new NaNs are-nan
, flip sign of input NaNs. - Clang 16.0.0
-O0
: 0xac - Clang 16.0.0
-O1
: 0xa0 - new NaNs are+nan
; preserve sign of input NaN. - ICX 2022.2.1
-O0
: 0xac - ICX 2022.2.1
-O1
: 0xa0
- GCC 13.1
- With
PRINT_ALL
, optimized GCC now matches what the hardware does, LLVM (clang and ICX) doesn't change.- GCC 13.1
-O0
: 0xac - GCC 13.1
-O1
: 0xac - Clang 16.0.0
-O0
: 0xac - Clang 16.0.0
-O1
: 0xa0 - ICX 2022.2.1
-O0
: 0xac - ICX 2022.2.1
-O1
: 0xa0
- GCC 13.1
A "new NaN" is inf - inf
or -inf - -inf
, where the result is NaN but neither input was NaN. These form the high 2 bits of the low hex digit, the 0x?C
or 0x?0
. The low 2 bits of that nibble come from the 0-0
and 1-1
elements, which produce +0.0
output as required for finite same-same
with rounding modes other than towards -Inf (which isn't the default.)
ICX and clang seem to agree with each other, but still differ in results depending on the optimization level. I'm guessing 0xac is the correct result, as that is what happens in -O0
and all results are actually calculated by the CPU at runtime, without the compiler trying to be clever.
Bottom line, my question is, is this "expected behavior" according to some rules I am not aware of, or did I find a bug in three different compilers (GCC, Clang, and ICX)? (I couldn't test MSVC, as Goldbolt doesn't support executing the code for those builds.)
(-fno-strict-aliasing
doesn't affect the results, so ((int32_t*)(char*)mask_field)[i]
wasn't causing this.)
答案1
得分: 3
The sign of a NaN result is specified only for abs, copysign, and unary minus. Otherwise, the sign is unspecified. When both operands of an SSE instruction are not NaN, x86 CPUs produce a negative NaN, but compilers are not obligated to simulate that when optimizing.
Therefore, only the low two bits of the mask_bits
variable are predictable.
For instance, in case of gcc -O1
, the compiler internally transforms _mm256_sub_ps(a, a)
to a + b
, where b
is a constant vector with the same contents as a
, but all signs flipped. After that, it emits the vaddps
instruction with those constant vectors on registers, and the bits in the high nibble of the result depend on the order of operands (the CPU copies the NaN from one of the operands).
LLVM folds subtraction of infinities to positive NaN, where the CPU produces a negative NaN: https://godbolt.org/z/1hd69josr
英文:
Sign of a NaN result is specified only for abs, copysign, and unary minus. Otherwise, the sign is unspecified. When both operands of an SSE instruction are not NaN, x86 CPUs produce a negative NaN, but compilers are not obligated to simulate that when optimizing.
Therefore only the low two bits of the mask_bits
variable are predictable.
For instance, in case of gcc -O1
, the compiler internally tranforms _mm256_sub_ps(a, a)
to a + b
, where b
is a constant vector with the same contents as a
, but all signs flipped. After that it emits the vaddps
instruction with those constant vectors on registers, and the bits in the high nibble of the result depend on the order of operands (the CPU copies the NaN from one of the operands).
LLVM folds subtraction of infinities to positive NaN, where the CPU produces a negative NaN: https://godbolt.org/z/1hd69josr
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论