问题

我看到CLANG的优化报告中，我的热循环在regalloc阶段有寄存器溢出。是否有一般的编程技巧来避免这些溢出，或者如何建议编译器优先溢出哪个变量。我在网上搜索，但找不到相关信息。

英文:

I see through CLANG's optimization report that my hot loop has register spills in regalloc pass. Are there any general programming techniques on how to avoid the spills, or how to suggest the compiler what variable to prefer to spill. I was looking online, but I couldn't find anything.

答案1

得分: 1

在C和C++中，register关键字专门设计用于此目的：

register：自动存储期。同时提示编译器将对象放置在处理器的寄存器中。

话虽如此，自C++17以来，它已经过时且不再使用（但在C中仍在使用）。这里是关于C语言的相关文档，这里是关于C++的。

据我所知，现代编译器倾向于忽略这个关键字，因为它们往往表现得非常出色。即使它们失败，对于具有16个通用寄存器和通常至少2个加载单元（最近的处理器也倾向于具有两个存储单元）的现代x64主流处理器来说，寄存器溢出也不是那么关键。话虽如此，在嵌入式处理器和一些非常关键的循环中，这仍然可能是一个问题。

一个解决方案是在关键循环中避免使用许多变量并尽可能减小它们的作用域（尽管优化编译器现在应该能够非常好地重新排列指令）。因此，有时将循环拆分为两个或更多部分可能会有益，以减小寄存器压力。在现代处理器上，这种方法特别有用，因为基本循环由于指令级并行性和乱序执行而具有相对较小的开销。当进行寄存器平铺优化时，必须仔细调整瓦片大小以适应目标体系结构，以避免寄存器溢出，这实际上可能会完全破坏优化。在一些绝望的情况或者非常关键的循环中，一个解决方案是直接用汇编语言编写代码。据我所知，一些项目，如Openh264（快速视频解码）和GOTO BLAS（快速基本线性代数），采用了这种方法。

英文:

In C and C++, the register keyword is specifically designed for that:
> register: automatic storage duration. Also hints to the compiler to place the object in the processor's register.

That being said, it is now obsolete and unused since C++17 (still used in C though). Here is the documentation about that in C, and here for C++.

AFAIK, modern compilers tend to simply ignore this keyword since they tends to do a very very good job. Even when they fail, a register spilling is not so critical on modern x64 mainstream processors having 16 general purpose registers and generally at least 2 load units (recent ones also tends to have two store units). That being said, this can still be an issue for embedded processors and in some very critical loops.

One solution is simply to avoid using many variables in critical loops and reduce their scope as much as possible (though optimizing compilers should be able to reorder instructions very well nowadays). For this reason, it can sometime be beneficial to to split loops in two or more so to reduce the register pressure. This method is especially useful on modern processor where basic loops have a relatively small overhead thanks to the instruction-level parallelism and out-of-order execution. When a register tiling optimization is performed, the tile size must be carefully tuned for the target architecture so to avoid register spilling which can actually completely defeat the optimization.  In some desperate situations or in very critical loops, a solution is simply to write the code is assembler directly. AFAIK, some projects like Openh264 (fast video decoding) GOTO BLAS (fast basic linear algebra) does that.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

修复程序中的寄存器溢出问题

问题

答案1

如何改进具有嵌套循环的此算法的时间复杂度？

Laravel Eloquent查询与集合优化

如何提高 Golang 在计数过程中的速度？

React Native FlatList 元素的 onPress 事件直到列表渲染完成后才触发。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论