2023年6月15日 13:40:23go评论115阅读模式

英文:

Can clang be convinced to optimize this almost-leaf function

问题

以下是您提供的代码的翻译部分：

考虑以下的 几乎是叶子 函数：

int almost_leaf(int* x) {
    if (__builtin_expect(*x >= 0, true)) {
        return *x;
    }
    return x_was_negative() + 1;
}

它在某种意义上是 几乎是叶子 的，因为它不是严格的叶子函数（如果 x 为负数，它可能会调用 x_was_negative，但 __builtin_expect 提示编译器通常采用 return *x 分支，这不涉及任何调用。

clang-16 将其编译如下：

almost_leaf(int*):                      # @almost_leaf(int*)
        push    rax
        mov     eax, dword ptr [rdi]
        test    eax, eax
        js      .LBB0_1
        pop     rcx
        ret
.LBB0_1:
        call    x_was_negative()
        inc     eax
        pop     rcx
        ret

在快速（预期的）路径上的 push 和 pop（直到第一个 ret）在这里是完全不必要的：堆栈未使用，并且没有调用需要根据 ABI 要求对齐堆栈。

更好的做法是只在调用 x_was_negative() 的慢路径上对齐堆栈，就像 gcc 做的那样：

almost_leaf(int*):
        mov     eax, DWORD PTR [rdi]
        test    eax, eax
        js      .L8
        ret
.L8:
        sub     rsp, 8
        call    x_was_negative()
        add     rsp, 8
        inc     eax
        ret

能否说服 clang 以高效方式编译这个 几乎是叶子 函数？

请注意，clang 可以在不对齐堆栈的情况下编译几乎是叶子函数：例如，如果 x 是 int 而不是 int*，它可以工作，如果 x_was_negative 可以编译为尾调用，那么它也可以工作（但在这种情况下根本不需要对齐）。

英文:

Consider the following almost leaf function:

int almost_leaf(int* x) {
    if (__builtin_expect(*x &gt;= 0, true)) {
        return *x;
    }
    return x_was_negative() + 1;
}

It is almost leaf in the sense that it is not strictly a leaf function (it may call x_was_negative is x is negative, but the __builtin_expect hints to the compiler that the return *x branch is usually taken, which involves no calls.

clang-16 compiles it like this:

almost_leaf(int*):                      # @almost_leaf(int*)
        push    rax
        mov     eax, dword ptr [rdi]
        test    eax, eax
        js      .LBB0_1
        pop     rcx
        ret
.LBB0_1:
        call    x_was_negative()
        inc     eax
        pop     rcx
        ret

The push and pop on the fast (expected) path (the part up to the first ret) are totally unnecessary here: the stack is unused, and no calls are made which require an aligned stack "due to ABI".

It would be better to just align the stack on the slow path where x_was_negative() is called, like gcc does:

almost_leaf(int*):
        mov     eax, DWORD PTR [rdi]
        test    eax, eax
        js      .L8
        ret
.L8:
        sub     rsp, 8
        call    x_was_negative()
        add     rsp, 8
        inc     eax
        ret

Can clang be convinced to compile this almost leaf function efficiently?

<sup>Note that clang can compile almost leaf functions without aligning the stack: e.g., if x is an int instead of int* it works, and if x_was_negative can be compiled as a tailcall it at also works (but trivially since no alignment is needed in that case at all).</sup>

答案1

得分: 9

这个优化现在在Clang的开发版本（trunk）上执行，因此应该在下一个版本中可用（肯定是Clang 17.0）。这可以在Godbolt上看到。下面是新生成的代码：

almost_leaf(int*):
        mov     eax, dword ptr [rdi]
        test    eax, eax
        js      .LBB0_1
        ret
.LBB0_1:
        push    rax
        call    x_was_negative()@PLT
        inc     eax
        add     rsp, 8
        ret

正如我们所见，快速路径是相同的。

请注意，两个版本的Clang的初始未优化的LLVM-IR大致相同，但最后的低级优化步骤会导致这段代码的结果略有不同。具体来说，对执行的LLVM-IR优化步骤进行深入分析显示，Clang 16在“Prologue/Epilogue Insertion & Frame Finalization”（也称为“prologepilog”）优化步骤中错过了一个优化。在Clang的开发版本中，即使在-O1优化级别下，也会执行这个优化步骤。这个优化步骤可以在Godbolt上看到。

英文:

This optimization is now performed on the development version of Clang (trunk) and so it should be available in the next releases (certainly Clang 17.0). This can be seen on Godbolt. Here is the newly generated code:

almost_leaf(int*):
        mov     eax, dword ptr [rdi]
        test    eax, eax
        js      .LBB0_1
        ret
.LBB0_1:
        push    rax
        call    x_was_negative()@PLT
        inc     eax
        add     rsp, 8
        ret

As we can see, the fast path is the same.

Note that the initial unoptimized LLVM-IR is about the same for the two versions of Clang, but the last low-level optimizations steps results in slightly different results for this code. More specifically, a deep analysis of the LLVM-IR optimization steps performed shows that Clang 16 missed an optimization during the "Prologue/Epilogue Insertion & Frame Finalization" optimization step (a.k.a. "prologepilog"). This optimization is done even in -O1 in the development version of Clang. This optimization step can be seen on Godbolt.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Clang能够被说服优化这个几乎是叶子函数吗？

问题

答案1

使用空体和不使用空体的for循环所花费的时间相同。

Cplex优化程序返回结果为零。

从Java Webapp下载非常大的查询结果

当运行我的项目中的一个函数时，stod()函数返回了一个不正确的结果。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。