英文:
Does x86 have equivalent of Arm FCVTNS (scalar)?
问题
Arm有FCVTNS(标量)
指令,它是(重点在于):
浮点数转换为有符号整数,四舍五入到最近的偶数(标量)。
一个简单的问题:x86有与Arm的FCVTNS(标量)
等效的指令吗?
我已经快速浏览了x86指令列表,但没有找到我要找的内容。有一个常见的CVTTSS2SI
,它向零舍入(当转换不精确时),这不是我要找的。
英文:
Arm has FCVTNS (scalar)
instruction, which is (emphasis added):
> Floating-point Convert to Signed integer, rounding to nearest with ties to even (scalar).
A simple question: Does x86 have equivalent of Arm FCVTNS (scalar)
?
I've already quickly went through the list of x86 instructions, but couldn't find what I'm looking for. There is a usual CVTTSS2SI
, which rounds toward zero (when a conversion is inexact), which is not what I'm looking for.
答案1
得分: 4
cvtss2si
是非截断转换指令,它使用当前的舍入模式,通常是最近偶数舍入,但可以通过MXCSR寄存器进行更改(fenv.h
在x86-64上影响的就是MXCSR)。对于像cvtps2dq xmm,xmm
这样的打包转换指令也是一样的。
存在截断版本是因为C语言规定(int)my_float
使用截断。在传统的x87(在SSE3 fisttp
之前),编译器必须在每次转换前后更改x87的舍入模式,这非常麻烦。
如果你需要在具有不同舍入模式的MXCSR中运行的代码中使用最近偶数舍入,你可以使用AVX-512的vcvtss2si eax, xmm0, {rn-sae}
(NASM语法)来覆盖该指令的舍入模式。
如果没有AVX-512,如果你需要在同一个循环中使用不同的舍入模式,可以在MXCSR和x87控制字中设置不同的舍入模式。使用movss
存储/fld dword
重新加载/fistp
转换为整数与当前x87的舍入模式相比,可能更高效,而不需要大量展开循环。(使用stmxcsr
生成的两个保存值)
(ldmxcsr
在Skylake / Alder Lake上是4个微操作,但在Zen上只有1个。然而,它的吞吐量比微操作计数和执行端口预期的要低一些。参见https://uops.info/)
英文:
The non-truncating cvtss2si
uses the current rounding mode, which is usually nearest-even but can be changed (in the MXCSR, which is what fenv.h
affects on x86-64). Same for packed conversions like cvtps2dq xmm,xmm
.
The truncating versions exist because C specifies that (int)my_float
uses truncation. With legacy x87 (before SSE3 fisttp
), compilers had to change the x87 rounding mode to truncation and back around every conversion, which sucked a lot.
If you need round-to-nearest-even in code that will run with a different rounding mode in MXCSR, you could use AVX-512 vcvtss2si eax, xmm0, {rn-sae}
(NASM syntax) to override the rounding mode for that instruction.
Without AVX-512, you could have different rounding modes set in MXCSR and the x87 control word, if you need different rounding in the same loop. (movss
store / fld dword
reload / fistp
conversion to integer with the current x87 rounding mode is probably more efficient than ldmxcsr
twice per iteration without a lot of unrolling. (From two saved values generated with stmxcsr
.)
(ldmxcsr
is 4 uops on Skylake / Alder Lake, but only 1 on Zen. Its throughput is a bit lower than you'd expect from the uop count and execution ports, though. See https://uops.info/)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论