英文:
What causes the different NaN behavior when compiling `_mm_ucomilt_ss` intrinsic?
问题
以下是翻译好的代码部分:
为什么下面的代码在GCC 8.5上与NaN一起失败?
```cpp
bool isfinite_sse42(float num)
{
return _mm_ucomilt_ss(_mm_set_ss(std::abs(num)),
_mm_set_ss(std::numeric_limits<float>::infinity())) == 1;
}
对于GCC 8.5,我的期望是返回false。
Intel Intrinsics指南对于_mm_ucomilt_ss
的描述如下:
返回 ( a[31:0] != NaN AND b[31:0] != NaN AND a[31:0] == b[31:0] ) ? 1 : 0
也就是说,如果a
或b
中有一个是NaN,则返回0。在汇编级别(Godbolt)上,可以看到一个ucomiss abs(x), Infinity
,然后是一个setb
。
# GCC8.5 -O2 不符合NaN的文档化内部行为
ucomiss xmm0, DWORD PTR .LC2[rip]
setb al
有趣的是,更新的GCC和Clang将比较从 a < b
更改为 b > a
,因此使用 seta
。但是为什么带有 setb
的代码对NaN返回true,而带有 seta
的代码对NaN返回false?
<details>
<summary>英文:</summary>
Can someone explain me why the following code fails for GCC 8.5 with NaNs?
bool isfinite_sse42(float num)
{
return _mm_ucomilt_ss(_mm_set_ss(std::abs(num)),
_mm_set_ss(std::numeric_limits<float>::infinity())) == 1;
}
My expectation for GCC 8.5 would be to return false.
The Intel Intrinsics guide for `_mm_ucomilt_ss` says
```c++
RETURN ( a[31:0] != NaN AND b[31:0] != NaN AND a[31:0] == b[31:0] ) ? 1 : 0
i.e., if either a
or b
is NaN it returns 0. On assembly level (Godbolt) one can see a ucomiss abs(x), Infinity
followed by a setb
.
# GCC8.5 -O2 doesn't match documented intrinsic behaviour for NaN
ucomiss xmm0, DWORD PTR .LC2[rip]
setb al
Interestingly newer GCCs and Clang swap the comparison from a < b
to b > a
and therefore use seta
. But why does the code with setb
returns true for NaN and why seta
returns false for NaN?
答案1
得分: 1
GCC在GCC13之前存在缺陷,未实现NaN情况下内部函数的文档化语义,需要单独检查PF,或者像ucomiss Inf, abs
那样执行,以使无序情况设置CF,与abs < Inf
相同。
请参阅https://www.felixcloutier.com/x86/ucomiss#operation或https://www.felixcloutier.com/x86/fcomi:fcomip:fucomi:fucomip中的更好的表格。 (所有设置EFLAGS的x86标量FP比较都以相同的方式进行,与历史上的fcom
/ fstsw
/ sahf
匹配。)
比较结果 | ZF | PF | CF |
---|---|---|---|
left > right | 0 | 0 | 0 |
left < right | 0 | 0 | 1 |
left = right | 1 | 0 | 0 |
无序 | 1 | 1 | 1 |
请注意,CF在left < right
和无序情况下都设置,但在另外两种情况下不会设置。
如果您可以安排事情以便检查>
或>=
,则不需要setnp cl
/ and al, cl
来排除无序情况。 这就是clang 16和GCC 13为了从ucomiss inf, abs
/ seta
获得正确结果所做的事情。
如果您写abs(x) < infinity
,GCC 8.5会执行正确的操作,只有标量内部函数没有正确实现。(对于纯标量代码,它使用comiss
而不是ucomiss
,唯一的区别是它将在QNaN和SNaN上更新FP环境,引发#I FP异常。)
这需要单独的movss
加载而不是内存源。但这确实让GCC避免了无用的SSE4.1 insertps
指令,该指令将XMM0的高3个元素置零,而ucomiss
根本不读取。 Clang看到了这一点,并优化掉了_mm_set_ss(num)
的这部分,但GCC没有。将标量float
转换为具有不关心的上部元素的__m128
的高效方法在英特尔的内部函数API中一直存在问题,只有一些编译器能够优化掉。(https://stackoverflow.com/questions/39318496/how-to-merge-a-scalar-into-a-vector-without-the-compiler-wasting-an-instruction) 一个float
只是__m128
的低元素。
英文:
GCC is buggy before GCC13, not implementing the documented semantics of the intrinsic for the NaN case which require either checking PF separately, or doing it as ucomiss Inf, abs
so the unordered case sets CF the same way as abs < Inf
.
See https://www.felixcloutier.com/x86/ucomiss#operation or the nicer table in https://www.felixcloutier.com/x86/fcomi:fcomip:fucomi:fucomip . (All x86 scalar FP compares that set EFLAGS do it the same way, matching historical fcom
/ fstsw
/ sahf
.)
Comparison Results | ZF | PF | CF |
---|---|---|---|
left > right | 0 | 0 | 0 |
left < right | 0 | 0 | 1 |
left = right | 1 | 0 | 0 |
Unordered | 1 | 1 | 1 |
Notice that CF is set for both the left < right
and unordered cases, but not for the other two cases.
If you can arrange things such that you can check for >
or >=
, you don't need to setnp cl
/ and al, cl
to rule out Unordered. This is what clang 16 and GCC 13 do to get correct results from ucomiss inf, abs
/ seta
.
GCC8.5 does the right thing if you write abs(x) < infinity
, it's only the scalar intrinsic that it doesn't implement properly. (With plain scalar code, it uses comiss
instead of ucomiss
, the only difference being that it will update the FP environment with a #I FP-exception on QNaN as well as SNaN.)
This requires a separate movss
load instead of a memory source. But this does let GCC avoid the useless SSE4.1 insertps
instruction that zeros the high 3 elements of XMM0, which ucomiss
doesn't read anyway. Clang sees that and optimizes away that part of _mm_set_ss(num)
but GCC doesn't. The lack of an efficient way to go from a scalar float
to a __m128
with don't-care upper elements is a persistent problem in Intel's intrinsics API that only some compilers manage to optimize around. (https://stackoverflow.com/questions/39318496/how-to-merge-a-scalar-into-a-vector-without-the-compiler-wasting-an-instruction) A float
is just the low element of a __m128
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论