不同的 NaN 行为在编译 `_mm_ucomilt_ss` 内部函数时是由什么引起的?

huangapple go评论65阅读模式
英文:

What causes the different NaN behavior when compiling `_mm_ucomilt_ss` intrinsic?

问题

以下是翻译好的代码部分:

为什么下面的代码在GCC 8.5上与NaN一起失败

```cpp
bool isfinite_sse42(float num)
{    
    return _mm_ucomilt_ss(_mm_set_ss(std::abs(num)),
                          _mm_set_ss(std::numeric_limits<float>::infinity())) == 1;
}

对于GCC 8.5,我的期望是返回false。

Intel Intrinsics指南对于_mm_ucomilt_ss的描述如下:

返回 ( a[31:0] != NaN AND b[31:0] != NaN AND a[31:0] == b[31:0] ) ? 1 : 0

也就是说,如果ab中有一个是NaN,则返回0。在汇编级别(Godbolt)上,可以看到一个ucomiss abs(x), Infinity,然后是一个setb

# GCC8.5 -O2 不符合NaN的文档化内部行为
        ucomiss xmm0, DWORD PTR .LC2[rip]
        setb    al

有趣的是,更新的GCC和Clang将比较从 a < b 更改为 b > a,因此使用 seta。但是为什么带有 setb 的代码对NaN返回true,而带有 seta 的代码对NaN返回false?


<details>
<summary>英文:</summary>

Can someone explain me why the following code fails for GCC 8.5 with NaNs?

bool isfinite_sse42(float num)
{
return _mm_ucomilt_ss(_mm_set_ss(std::abs(num)),
_mm_set_ss(std::numeric_limits<float>::infinity())) == 1;
}



My expectation for GCC 8.5 would be to return false.

The Intel Intrinsics guide for `_mm_ucomilt_ss` says

```c++
RETURN ( a[31:0] != NaN AND b[31:0] != NaN AND a[31:0] == b[31:0] ) ? 1 : 0

i.e., if either a or b is NaN it returns 0. On assembly level (Godbolt) one can see a ucomiss abs(x), Infinity followed by a setb.

# GCC8.5 -O2  doesn&#39;t match documented intrinsic behaviour for NaN
        ucomiss xmm0, DWORD PTR .LC2[rip]
        setb    al

Interestingly newer GCCs and Clang swap the comparison from a &lt; b to b &gt; a and therefore use seta. But why does the code with setb returns true for NaN and why seta returns false for NaN?

答案1

得分: 1

GCC在GCC13之前存在缺陷,未实现NaN情况下内部函数的文档化语义,需要单独检查PF,或者像ucomiss Inf, abs那样执行,以使无序情况设置CF,与abs &lt; Inf相同。

请参阅https://www.felixcloutier.com/x86/ucomiss#operation或https://www.felixcloutier.com/x86/fcomi:fcomip:fucomi:fucomip中的更好的表格。 (所有设置EFLAGS的x86标量FP比较都以相同的方式进行,与历史上的fcom / fstsw / sahf匹配。)

比较结果 ZF PF CF
left > right 0 0 0
left < right 0 0 1
left = right 1 0 0
无序 1 1 1

请注意,CF在left &lt; right和无序情况下都设置,但在另外两种情况下不会设置。

如果您可以安排事情以便检查&gt;&gt;=,则不需要setnp cl / and al, cl来排除无序情况。 这就是clang 16和GCC 13为了从ucomiss inf, abs / seta获得正确结果所做的事情。

如果您写abs(x) &lt; infinity,GCC 8.5会执行正确的操作,只有标量内部函数没有正确实现。(对于纯标量代码,它使用comiss而不是ucomiss,唯一的区别是它将在QNaN和SNaN上更新FP环境,引发#I FP异常。)

这需要单独的movss加载而不是内存源。但这确实让GCC避免了无用的SSE4.1 insertps指令,该指令将XMM0的高3个元素置零,而ucomiss根本不读取。 Clang看到了这一点,并优化掉了_mm_set_ss(num)的这部分,但GCC没有。将标量float转换为具有不关心的上部元素的__m128的高效方法在英特尔的内部函数API中一直存在问题,只有一些编译器能够优化掉。(https://stackoverflow.com/questions/39318496/how-to-merge-a-scalar-into-a-vector-without-the-compiler-wasting-an-instruction) 一个float只是__m128的低元素。

英文:

GCC is buggy before GCC13, not implementing the documented semantics of the intrinsic for the NaN case which require either checking PF separately, or doing it as ucomiss Inf, abs so the unordered case sets CF the same way as abs &lt; Inf.

See https://www.felixcloutier.com/x86/ucomiss#operation or the nicer table in https://www.felixcloutier.com/x86/fcomi:fcomip:fucomi:fucomip . (All x86 scalar FP compares that set EFLAGS do it the same way, matching historical fcom / fstsw / sahf.)

Comparison Results ZF PF CF
left > right 0 0 0
left < right 0 0 1
left = right 1 0 0
Unordered 1 1 1

Notice that CF is set for both the left &lt; right and unordered cases, but not for the other two cases.

If you can arrange things such that you can check for &gt; or &gt;=, you don't need to setnp cl / and al, cl to rule out Unordered. This is what clang 16 and GCC 13 do to get correct results from ucomiss inf, abs / seta.

GCC8.5 does the right thing if you write abs(x) &lt; infinity, it's only the scalar intrinsic that it doesn't implement properly. (With plain scalar code, it uses comiss instead of ucomiss, the only difference being that it will update the FP environment with a #I FP-exception on QNaN as well as SNaN.)

This requires a separate movss load instead of a memory source. But this does let GCC avoid the useless SSE4.1 insertps instruction that zeros the high 3 elements of XMM0, which ucomiss doesn't read anyway. Clang sees that and optimizes away that part of _mm_set_ss(num) but GCC doesn't. The lack of an efficient way to go from a scalar float to a __m128 with don't-care upper elements is a persistent problem in Intel's intrinsics API that only some compilers manage to optimize around. (https://stackoverflow.com/questions/39318496/how-to-merge-a-scalar-into-a-vector-without-the-compiler-wasting-an-instruction) A float is just the low element of a __m128.

huangapple
  • 本文由 发表于 2023年6月1日 21:35:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76382470.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定