Clang能够被说服优化这个几乎是叶子函数吗?

huangapple go评论82阅读模式
英文:

Can clang be convinced to optimize this almost-leaf function

问题

以下是您提供的代码的翻译部分:

考虑以下的 几乎是叶子 函数:

int almost_leaf(int* x) {
    if (__builtin_expect(*x >= 0, true)) {
        return *x;
    }
    return x_was_negative() + 1;
}

它在某种意义上是 几乎是叶子 的,因为它不是严格的叶子函数(如果 x 为负数,它可能会调用 x_was_negative,但 __builtin_expect 提示编译器通常采用 return *x 分支,这不涉及任何调用。

clang-16 将其编译 如下:

almost_leaf(int*):                      # @almost_leaf(int*)
        push    rax
        mov     eax, dword ptr [rdi]
        test    eax, eax
        js      .LBB0_1
        pop     rcx
        ret
.LBB0_1:
        call    x_was_negative()
        inc     eax
        pop     rcx
        ret

在快速(预期的)路径上的 pushpop(直到第一个 ret)在这里是完全不必要的:堆栈未使用,并且没有调用需要根据 ABI 要求对齐堆栈。

更好的做法是只在调用 x_was_negative() 的慢路径上对齐堆栈,就像 gcc 做的那样:

almost_leaf(int*):
        mov     eax, DWORD PTR [rdi]
        test    eax, eax
        js      .L8
        ret
.L8:
        sub     rsp, 8
        call    x_was_negative()
        add     rsp, 8
        inc     eax
        ret

能否说服 clang 以高效方式编译这个 几乎是叶子 函数?

请注意,clang 可以 在不对齐堆栈的情况下编译几乎是叶子函数:例如,如果 xint 而不是 int*,它可以工作,如果 x_was_negative 可以编译为尾调用,那么它也可以工作(但在这种情况下根本不需要对齐)。

英文:

Consider the following almost leaf function:

int almost_leaf(int* x) {
    if (__builtin_expect(*x >= 0, true)) {
        return *x;
    }
    return x_was_negative() + 1;
}

It is almost leaf in the sense that it is not strictly a leaf function (it may call x_was_negative is x is negative, but the __builtin_expect hints to the compiler that the return *x branch is usually taken, which involves no calls.

clang-16 compiles it like this:

almost_leaf(int*):                      # @almost_leaf(int*)
        push    rax
        mov     eax, dword ptr [rdi]
        test    eax, eax
        js      .LBB0_1
        pop     rcx
        ret
.LBB0_1:
        call    x_was_negative()
        inc     eax
        pop     rcx
        ret

The push and pop on the fast (expected) path (the part up to the first ret) are totally unnecessary here: the stack is unused, and no calls are made which require an aligned stack "due to ABI".

It would be better to just align the stack on the slow path where x_was_negative() is called, like gcc does:

almost_leaf(int*):
        mov     eax, DWORD PTR [rdi]
        test    eax, eax
        js      .L8
        ret
.L8:
        sub     rsp, 8
        call    x_was_negative()
        add     rsp, 8
        inc     eax
        ret

Can clang be convinced to compile this almost leaf function efficiently?


<sup>Note that clang can compile almost leaf functions without aligning the stack: e.g., if x is an int instead of int* it works, and if x_was_negative can be compiled as a tailcall it at also works (but trivially since no alignment is needed in that case at all).</sup>

答案1

得分: 9

这个优化现在在Clang的开发版本(trunk)上执行,因此应该在下一个版本中可用(肯定是Clang 17.0)。这可以在Godbolt上看到。下面是新生成的代码:

almost_leaf(int*):
        mov     eax, dword ptr [rdi]
        test    eax, eax
        js      .LBB0_1
        ret
.LBB0_1:
        push    rax
        call    x_was_negative()@PLT
        inc     eax
        add     rsp, 8
        ret

正如我们所见,快速路径是相同的。

请注意,两个版本的Clang的初始未优化的LLVM-IR大致相同,但最后的低级优化步骤会导致这段代码的结果略有不同。具体来说,对执行的LLVM-IR优化步骤进行深入分析显示,Clang 16在“Prologue/Epilogue Insertion & Frame Finalization”(也称为“prologepilog”)优化步骤中错过了一个优化。在Clang的开发版本中,即使在-O1优化级别下,也会执行这个优化步骤。这个优化步骤可以在Godbolt上看到。

英文:

This optimization is now performed on the development version of Clang (trunk) and so it should be available in the next releases (certainly Clang 17.0). This can be seen on Godbolt. Here is the newly generated code:

almost_leaf(int*):
        mov     eax, dword ptr [rdi]
        test    eax, eax
        js      .LBB0_1
        ret
.LBB0_1:
        push    rax
        call    x_was_negative()@PLT
        inc     eax
        add     rsp, 8
        ret

As we can see, the fast path is the same.

Note that the initial unoptimized LLVM-IR is about the same for the two versions of Clang, but the last low-level optimizations steps results in slightly different results for this code. More specifically, a deep analysis of the LLVM-IR optimization steps performed shows that Clang 16 missed an optimization during the "Prologue/Epilogue Insertion & Frame Finalization" optimization step (a.k.a. "prologepilog"). This optimization is done even in -O1 in the development version of Clang. This optimization step can be seen on Godbolt.

huangapple
  • 本文由 发表于 2023年6月15日 13:40:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76479415.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定