x86是否有与Arm FCVTNS(标量)等效的功能?

huangapple go评论88阅读模式
英文:

Does x86 have equivalent of Arm FCVTNS (scalar)?

问题

Arm有FCVTNS(标量)指令,它是(重点在于):

浮点数转换为有符号整数,四舍五入到最近的偶数(标量)。

一个简单的问题:x86有与Arm的FCVTNS(标量)等效的指令吗?

我已经快速浏览了x86指令列表,但没有找到我要找的内容。有一个常见的CVTTSS2SI,它向零舍入(当转换不精确时),这不是我要找的。

英文:

Arm has FCVTNS (scalar) instruction, which is (emphasis added):
> Floating-point Convert to Signed integer, rounding to nearest with ties to even (scalar).

A simple question: Does x86 have equivalent of Arm FCVTNS (scalar)?

I've already quickly went through the list of x86 instructions, but couldn't find what I'm looking for. There is a usual CVTTSS2SI, which rounds toward zero (when a conversion is inexact), which is not what I'm looking for.

答案1

得分: 4

cvtss2si是非截断转换指令,它使用当前的舍入模式,通常是最近偶数舍入,但可以通过MXCSR寄存器进行更改(fenv.h在x86-64上影响的就是MXCSR)。对于像cvtps2dq xmm,xmm这样的打包转换指令也是一样的。

存在截断版本是因为C语言规定(int)my_float使用截断。在传统的x87(在SSE3 fisttp之前),编译器必须在每次转换前后更改x87的舍入模式,这非常麻烦。


如果你需要在具有不同舍入模式的MXCSR中运行的代码中使用最近偶数舍入,你可以使用AVX-512的vcvtss2si eax, xmm0, {rn-sae}(NASM语法)来覆盖该指令的舍入模式。

如果没有AVX-512,如果你需要在同一个循环中使用不同的舍入模式,可以在MXCSR和x87控制字中设置不同的舍入模式。使用movss存储/fld dword重新加载/fistp转换为整数与当前x87的舍入模式相比,可能更高效,而不需要大量展开循环。(使用stmxcsr生成的两个保存值)

ldmxcsr在Skylake / Alder Lake上是4个微操作,但在Zen上只有1个。然而,它的吞吐量比微操作计数和执行端口预期的要低一些。参见https://uops.info/)

英文:

The non-truncating cvtss2si uses the current rounding mode, which is usually nearest-even but can be changed (in the MXCSR, which is what fenv.h affects on x86-64). Same for packed conversions like cvtps2dq xmm,xmm.

The truncating versions exist because C specifies that (int)my_float uses truncation. With legacy x87 (before SSE3 fisttp), compilers had to change the x87 rounding mode to truncation and back around every conversion, which sucked a lot.


If you need round-to-nearest-even in code that will run with a different rounding mode in MXCSR, you could use AVX-512 vcvtss2si eax, xmm0, {rn-sae} (NASM syntax) to override the rounding mode for that instruction.

Without AVX-512, you could have different rounding modes set in MXCSR and the x87 control word, if you need different rounding in the same loop. (movss store / fld dword reload / fistp conversion to integer with the current x87 rounding mode is probably more efficient than ldmxcsr twice per iteration without a lot of unrolling. (From two saved values generated with stmxcsr.)

(ldmxcsr is 4 uops on Skylake / Alder Lake, but only 1 on Zen. Its throughput is a bit lower than you'd expect from the uop count and execution ports, though. See https://uops.info/)

huangapple
  • 本文由 发表于 2023年8月8日 22:02:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76860305.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定