英文:
Can clang be convinced to optimize this almost-leaf function
问题
以下是您提供的代码的翻译部分:
考虑以下的 几乎是叶子 函数:
int almost_leaf(int* x) {
if (__builtin_expect(*x >= 0, true)) {
return *x;
}
return x_was_negative() + 1;
}
它在某种意义上是 几乎是叶子 的,因为它不是严格的叶子函数(如果 x 为负数,它可能会调用 x_was_negative
,但 __builtin_expect
提示编译器通常采用 return *x
分支,这不涉及任何调用。
clang-16
将其编译 如下:
almost_leaf(int*): # @almost_leaf(int*)
push rax
mov eax, dword ptr [rdi]
test eax, eax
js .LBB0_1
pop rcx
ret
.LBB0_1:
call x_was_negative()
inc eax
pop rcx
ret
在快速(预期的)路径上的 push
和 pop
(直到第一个 ret
)在这里是完全不必要的:堆栈未使用,并且没有调用需要根据 ABI 要求对齐堆栈。
更好的做法是只在调用 x_was_negative()
的慢路径上对齐堆栈,就像 gcc 做的那样:
almost_leaf(int*):
mov eax, DWORD PTR [rdi]
test eax, eax
js .L8
ret
.L8:
sub rsp, 8
call x_was_negative()
add rsp, 8
inc eax
ret
能否说服 clang 以高效方式编译这个 几乎是叶子 函数?
请注意,clang 可以 在不对齐堆栈的情况下编译几乎是叶子函数:例如,如果 x
是 int
而不是 int*
,它可以工作,如果 x_was_negative
可以编译为尾调用,那么它也可以工作(但在这种情况下根本不需要对齐)。
英文:
Consider the following almost leaf function:
int almost_leaf(int* x) {
if (__builtin_expect(*x >= 0, true)) {
return *x;
}
return x_was_negative() + 1;
}
It is almost leaf in the sense that it is not strictly a leaf function (it may call x_was_negative
is x is negative, but the __builtin_expect
hints to the compiler that the return *x
branch is usually taken, which involves no calls.
clang-16
compiles it like this:
almost_leaf(int*): # @almost_leaf(int*)
push rax
mov eax, dword ptr [rdi]
test eax, eax
js .LBB0_1
pop rcx
ret
.LBB0_1:
call x_was_negative()
inc eax
pop rcx
ret
The push
and pop
on the fast (expected) path (the part up to the first ret
) are totally unnecessary here: the stack is unused, and no calls are made which require an aligned stack "due to ABI".
It would be better to just align the stack on the slow path where x_was_negative()
is called, like gcc does:
almost_leaf(int*):
mov eax, DWORD PTR [rdi]
test eax, eax
js .L8
ret
.L8:
sub rsp, 8
call x_was_negative()
add rsp, 8
inc eax
ret
Can clang be convinced to compile this almost leaf function efficiently?
<sup>Note that clang can compile almost leaf functions without aligning the stack: e.g., if x
is an int
instead of int*
it works, and if x_was_negative
can be compiled as a tailcall it at also works (but trivially since no alignment is needed in that case at all).</sup>
答案1
得分: 9
这个优化现在在Clang的开发版本(trunk)上执行,因此应该在下一个版本中可用(肯定是Clang 17.0)。这可以在Godbolt上看到。下面是新生成的代码:
almost_leaf(int*):
mov eax, dword ptr [rdi]
test eax, eax
js .LBB0_1
ret
.LBB0_1:
push rax
call x_was_negative()@PLT
inc eax
add rsp, 8
ret
正如我们所见,快速路径是相同的。
请注意,两个版本的Clang的初始未优化的LLVM-IR大致相同,但最后的低级优化步骤会导致这段代码的结果略有不同。具体来说,对执行的LLVM-IR优化步骤进行深入分析显示,Clang 16在“Prologue/Epilogue Insertion & Frame Finalization”(也称为“prologepilog”)优化步骤中错过了一个优化。在Clang的开发版本中,即使在-O1
优化级别下,也会执行这个优化步骤。这个优化步骤可以在Godbolt上看到。
英文:
This optimization is now performed on the development version of Clang (trunk) and so it should be available in the next releases (certainly Clang 17.0). This can be seen on Godbolt. Here is the newly generated code:
almost_leaf(int*):
mov eax, dword ptr [rdi]
test eax, eax
js .LBB0_1
ret
.LBB0_1:
push rax
call x_was_negative()@PLT
inc eax
add rsp, 8
ret
As we can see, the fast path is the same.
Note that the initial unoptimized LLVM-IR is about the same for the two versions of Clang, but the last low-level optimizations steps results in slightly different results for this code. More specifically, a deep analysis of the LLVM-IR optimization steps performed shows that Clang 16 missed an optimization during the "Prologue/Epilogue Insertion & Frame Finalization" optimization step (a.k.a. "prologepilog"). This optimization is done even in -O1
in the development version of Clang. This optimization step can be seen on Godbolt.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论