问题

Arm有FCVTNS（标量）指令，它是（重点在于）：

浮点数转换为有符号整数，四舍五入到最近的偶数（标量）。

一个简单的问题：x86有与Arm的FCVTNS（标量）等效的指令吗？

我已经快速浏览了x86指令列表，但没有找到我要找的内容。有一个常见的CVTTSS2SI，它向零舍入（当转换不精确时），这不是我要找的。

英文:

Arm has FCVTNS (scalar) instruction, which is (emphasis added):
> Floating-point Convert to Signed integer, rounding to nearest with ties to even (scalar).

A simple question: Does x86 have equivalent of Arm FCVTNS (scalar)?

I've already quickly went through the list of x86 instructions, but couldn't find what I'm looking for. There is a usual CVTTSS2SI, which rounds toward zero (when a conversion is inexact), which is not what I'm looking for.

答案1

得分: 4

cvtss2si是非截断转换指令，它使用当前的舍入模式，通常是最近偶数舍入，但可以通过MXCSR寄存器进行更改（fenv.h在x86-64上影响的就是MXCSR）。对于像cvtps2dq xmm,xmm这样的打包转换指令也是一样的。

存在截断版本是因为C语言规定(int)my_float使用截断。在传统的x87（在SSE3 fisttp之前），编译器必须在每次转换前后更改x87的舍入模式，这非常麻烦。

如果你需要在具有不同舍入模式的MXCSR中运行的代码中使用最近偶数舍入，你可以使用AVX-512的vcvtss2si eax, xmm0, {rn-sae}（NASM语法）来覆盖该指令的舍入模式。

如果没有AVX-512，如果你需要在同一个循环中使用不同的舍入模式，可以在MXCSR和x87控制字中设置不同的舍入模式。使用movss存储/fld dword重新加载/fistp转换为整数与当前x87的舍入模式相比，可能更高效，而不需要大量展开循环。（使用stmxcsr生成的两个保存值）

（ldmxcsr在Skylake / Alder Lake上是4个微操作，但在Zen上只有1个。然而，它的吞吐量比微操作计数和执行端口预期的要低一些。参见https://uops.info/）

英文:

The non-truncating cvtss2si uses the current rounding mode, which is usually nearest-even but can be changed (in the MXCSR, which is what fenv.h affects on x86-64). Same for packed conversions like cvtps2dq xmm,xmm.

The truncating versions exist because C specifies that (int)my_float uses truncation. With legacy x87 (before SSE3 fisttp), compilers had to change the x87 rounding mode to truncation and back around every conversion, which sucked a lot.

If you need round-to-nearest-even in code that will run with a different rounding mode in MXCSR, you could use AVX-512 vcvtss2si eax, xmm0, {rn-sae} (NASM syntax) to override the rounding mode for that instruction.

Without AVX-512, you could have different rounding modes set in MXCSR and the x87 control word, if you need different rounding in the same loop. (movss store / fld dword reload / fistp conversion to integer with the current x87 rounding mode is probably more efficient than ldmxcsr twice per iteration without a lot of unrolling. (From two saved values generated with stmxcsr.)

(ldmxcsr is 4 uops on Skylake / Alder Lake, but only 1 on Zen. Its throughput is a bit lower than you'd expect from the uop count and execution ports, though. See https://uops.info/)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

x86是否有与Arm FCVTNS（标量）等效的功能？

问题

答案1

从Golang类型中提取未导出字段的正确方法是什么？

Azure Data Studio – 我的结果被四舍五入 – SQL

如果浮点数具有6位精度，为什么我们可以使用printf显示超过6位的浮点数？

如何在Golang中将字符串转换为Primitive.ObjectID？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论