Why does C++ rounding behavior (for compile-time constants) change if math is moved to an inline function?

huangapple go评论69阅读模式
英文:

Why does C++ rounding behavior (for compile-time constants) change if math is moved to an inline function?

问题

The assembly code and the behavior of the functions Eps1() and Eps2() seem to produce slightly different results due to the optimization choices made by the compiler. This difference could be due to how the compiler optimizes inline functions versus non-inline functions, as well as potential variations in compiler behavior.

这两个函数Eps1()Eps2()的汇编代码和行为似乎由编译器的优化选择造成了轻微的不同。这种差异可能是由于编译器对内联函数和非内联函数进行优化的方式以及编译器行为的变化所致。

I cannot provide a definitive answer as to why this happens without detailed analysis of the compiler's behavior and standard compliance. It could be related to how the compiler handles inline functions and the order of optimization passes it performs.

如果没有详细分析编译器的行为和标准兼容性,我无法提供关于为什么会发生这种情况的明确答案。这可能与编译器处理内联函数的方式以及它执行优化通道的顺序有关。

You may want to consult the compiler's documentation or seek assistance from compiler experts to understand the specific optimization choices made in this scenario and whether it aligns with C++ standards or if it's a compiler-specific behavior or bug.

您可能希望查阅编译器的文档或寻求编译器专家的帮助,以了解在这种情况下所做的具体优化选择以及它是否符合C++标准,或者是否是编译器特定的行为或错误。

英文:

Consider the following functions:

static inline float Eps(const float x) {
  const float eps = std::numeric_limits<float>::epsilon();
  return (1.0f + eps) * x - x;
}

float Eps1() {
  return Eps(0xFFFFFFp-24f);
}

float Eps2() {
  const float eps = std::numeric_limits<float>::epsilon();
  const float x = 0xFFFFFFp-24f;
  return (1.0f + eps) * x - x;
}

At -O2 with -std=c++20, both of these functions compile down to a single movss followed by a ret using clang 16.0.0 targetting x86 and a mov followed by a bx with gcc 11.2.1 targeting ARM. The assembly generated for ARM is consistent with a returned value of ~5.96e-8, but the assembly generated for x86 is not. Eps1() (using the inline function) returns ~1.19e-7 while Eps2() returns ~5.96e-8. [Compiler Explorer / Godbolt]

.LCPI0_0:
  .long 0x33ffffff # float 1.19209282E-7
Eps1(): # @Eps1()
  movss xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
  ret
.LCPI1_0:
  .long 0x33800000 # float 5.96046448E-8
Eps2(): # @Eps2()
  movss xmm0, dword ptr [rip + .LCPI1_0] # xmm0 = mem[0],zero,zero,zero
  ret

I can sort of understand the compiler choosing either option. With x = 0xFFFFFFp-24f (i.e. the next representable value below 1.0f), both compilers consistently round (1.0f + eps) * x to 1.0f which means that (1.0f + eps) * x - x will give the smaller value. However, machine precision of 1.0f is twice that of 0xFFFFFFp-24f so something like a multiply-add instruction that preserves extra precision would have an intermediate value of roughly 1.0 + 0.5 * eps which will yield the larger value.

The thing I don't understand is why the answer changes depending on whether the math is in an inline function or directly invoked. Is there somewhere in the standard that rationalizes this, is this undefined behavior, or is this a Clang bug?

答案1

得分: 8

With clang 16的默认值为-ffp-contract=on(类似于#pragma STDC FP_CONTRACT ON),ISO C++允许编译器在FP临时变量上保持无限精度,或者不保持,包括按情况选择。值得注意的是,将a*b+c合并为fma(a,b,c)也包括在内,甚至在编译时进行常量传播时也是如此。ISO C++允许pragma的这两种默认设置。

请参阅以下链接:

如果您使用-ffp-contract=off(或意外地使用-ffp-contract=fast),两个函数都返回5.96046448E-8。我没有检查,但这可能与fma((1.0f + eps), x, -x)*x步骤后进行舍入的结果相同。

有一个奇怪的现象,它在编译时评估时使用内联函数或不使用内联函数时进行不同的舍入。

对于运行时变量float x函数参数,如果使用-march=haswell-march=x86-64-v3编译以启用FMA,它将为Eps1和Eps2生成相同的汇编代码,将两者都合并为vfmsub132ss

如果您想编写确保在步骤之间进行舍入的源代码(除了fp-contract=fast),请不要将多个内容合并为同一个表达式。https://godbolt.org/z/15T5jcdfv 展示了使用单独语句的 Eps3,它返回与 Eps2 相匹配但与 Eps1 不匹配(使用 -ffp-contract=on)。


关于内联函数差异的理论

可能是clang / LLVM的内部在内联之前将内联函数合并为FMA,因此Eps1通过FMA进行常量传播。

但是在Eps2中,常量立即可用,因此可以首先进行常量传播。逐步插入数字比优化抽象操作更便宜,所以实际上它确实在不寻找执行一个FMA的机会的情况下执行这样做。

https://godbolt.org/z/7c9EYK8fb 展示了具有float x函数参数的版本,合同开/关,证实了在启用FP合同时Eps1和Eps2编译为相同的FMA汇编代码(Eps3如预期的那样未合并,除非使用 -ffp-contract=fast)。

至于为什么-ffp-contract=fast会给出与-ffp-contract=off相同的常量传播结果,也许当它不必跟踪操作是否是源代码中同一语句的一部分时,它可以推迟寻找合同的优化步骤。推迟到内联和常量传播之后会解释为何以与禁用合同相同的方式执行常量传播。

英文:

With clang 16's default of -ffp-contract=on (like #pragma STDC FP_CONTRACT ON), ISO C++ allows the compiler to keep infinite precision for FP temporaries or not, its choice, including on a case by case basis. Notably, contracting a*b+c into fma(a,b,c). This includes when doing constant-propagation at compile time. ISO C++ allows either default for the pragma.

See also

If you use -ffp-contract=off (or surprisingly -ffp-contract=fast), both functions return 5.96046448E-8. I haven't checked, but that's probably the same as fma((1.0f + eps), x, -x) vs. rounding after the *x step.

Strange quirk that it rounds differently during compile-time eval with an inline function or not.

With a runtime-variable float x function arg, it makes the same asm for both Eps1 and Eps2, contracting both to vfmsub132ss, when you compile with -march=haswell or -march=x86-64-v3 to make FMA available.

If you want to write source that definitely does round between steps (except with fp-contract=fast), don't make multiple things part of the same expression. https://godbolt.org/z/15T5jcdfv shows an Eps3 using separate statements giving a return value that matches Eps2 but not Eps1 (with -ffp-contract=on).


Theory about the reason for a difference with an inline function

Probably clang / LLVM's internals contracted the inline function into an FMA before inlining it, so constant propagation happened through an FMA for Eps1.

But in Eps2, the constant was available right away, so constant propagation could be done first. Plugging in numbers one operation at a time is cheaper for the compiler than optimizing the abstract operations. So in fact it does do that without looking for the opportunity to do one FMA.

https://godbolt.org/z/7c9EYK8fb shows versions with float x function args, with contract on/off, confirming that Eps1 and Eps2 do compile to the same asm using FMA when FP contraction is enabled. (And Eps3 doesn't contract, as expected. Unless you use -ffp-contract=fast.)

As for why -ffp-contract=fast gives the same constant-propagation result as -ffp-contract=off, perhaps when it doesn't have to keep track of whether operations were part of the same statement in the source, it can defer the optimization pass that looks for contraction. Deferring it until after inlining and constant propagation would explain the fact that constant propagation was done in a way that gave the same result as with contraction disabled.

huangapple
  • 本文由 发表于 2023年5月10日 11:49:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76214764.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定