英文:
Fixing Register Spilling in Your Program
问题
我看到CLANG的优化报告中,我的热循环在regalloc
阶段有寄存器溢出。是否有一般的编程技巧来避免这些溢出,或者如何建议编译器优先溢出哪个变量。我在网上搜索,但找不到相关信息。
英文:
I see through CLANG's optimization report that my hot loop has register spills in regalloc
pass. Are there any general programming techniques on how to avoid the spills, or how to suggest the compiler what variable to prefer to spill. I was looking online, but I couldn't find anything.
答案1
得分: 1
在C和C++中,register
关键字专门设计用于此目的:
register
:自动存储期。同时提示编译器将对象放置在处理器的寄存器中。
话虽如此,自C++17以来,它已经过时且不再使用(但在C中仍在使用)。这里是关于C语言的相关文档,这里是关于C++的。
据我所知,现代编译器倾向于忽略这个关键字,因为它们往往表现得非常出色。即使它们失败,对于具有16个通用寄存器和通常至少2个加载单元(最近的处理器也倾向于具有两个存储单元)的现代x64主流处理器来说,寄存器溢出也不是那么关键。话虽如此,在嵌入式处理器和一些非常关键的循环中,这仍然可能是一个问题。
一个解决方案是在关键循环中避免使用许多变量并尽可能减小它们的作用域(尽管优化编译器现在应该能够非常好地重新排列指令)。因此,有时将循环拆分为两个或更多部分可能会有益,以减小寄存器压力。在现代处理器上,这种方法特别有用,因为基本循环由于指令级并行性和乱序执行而具有相对较小的开销。当进行寄存器平铺优化时,必须仔细调整瓦片大小以适应目标体系结构,以避免寄存器溢出,这实际上可能会完全破坏优化。 在一些绝望的情况或者非常关键的循环中,一个解决方案是直接用汇编语言编写代码。据我所知,一些项目,如Openh264(快速视频解码)和GOTO BLAS(快速基本线性代数),采用了这种方法。
英文:
In C and C++, the register
keyword is specifically designed for that:
> register
: automatic storage duration. Also hints to the compiler to place the object in the processor's register.
That being said, it is now obsolete and unused since C++17 (still used in C though). Here is the documentation about that in C, and here for C++.
AFAIK, modern compilers tend to simply ignore this keyword since they tends to do a very very good job. Even when they fail, a register spilling is not so critical on modern x64 mainstream processors having 16 general purpose registers and generally at least 2 load units (recent ones also tends to have two store units). That being said, this can still be an issue for embedded processors and in some very critical loops.
One solution is simply to avoid using many variables in critical loops and reduce their scope as much as possible (though optimizing compilers should be able to reorder instructions very well nowadays). For this reason, it can sometime be beneficial to to split loops in two or more so to reduce the register pressure. This method is especially useful on modern processor where basic loops have a relatively small overhead thanks to the instruction-level parallelism and out-of-order execution. When a register tiling optimization is performed, the tile size must be carefully tuned for the target architecture so to avoid register spilling which can actually completely defeat the optimization. <!-- Not using too many variables in loops is still good on modern processor because it can slightly increase the instruction level-parallelism since the reorder buffer (ROB) has a limited size (through is it generally pretty big). --> In some desperate situations or in very critical loops, a solution is simply to write the code is assembler directly. AFAIK, some projects like Openh264 (fast video decoding) GOTO BLAS (fast basic linear algebra) does that.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论