为什么gcc在寄存器可用时在堆栈上使用变量?

huangapple go评论62阅读模式
英文:

Why does gcc use variables on the stack when registers are available?

问题

Here's the translated version of your text:

写一些非常基本的C代码,针对Cortex-M0设备,我对反汇编结果感到惊讶:

void delay(void) {
for (int x=0;x<0xffff;x++) ;
}

这变成了:

for (int x=0;x<0xffff;x++) ;
    2300        movs r3, #0
    9301        str r3, [sp, #4]
    E002        b 0x0800026E
    9B01        ldr r3, [sp, #4]      //0x08000268
    3301        adds r3, #1
    9301        str r3, [sp, #4]
    9B01        ldr r3, [sp, #4]      //0x0800026E
    4A03        ldr r2, =0x0000FFFE
    4293        cmp r3, r2
    DDF8        ble 0x08000268
--- main.c -- 8 --------------------------------------------
}
    46C0        nop
    46C0        nop
    B002        add sp, sp, #8
    4770        bx lr
    46C0        nop
    0000FFFE    .word 0x0000FFFE

现在这看起来非常浪费资源。我知道我的目的是通过简单的延迟函数来"浪费时间",但似乎gcc只使用了两个寄存器来访问栈上的变量。

这是带有所有默认设置的Rowley Crossworks 4.10,使用附带的GCC编译器。调试配置没有添加任何优化标志。

像这样的代码不会更好吗?

# 计数器重置
  movs r0, #0x0
  ldr r1, =0xffff

loopone:
  adds r0,#0x1
  cmp r0,r1
  bne loopone

默认未优化的gcc输出似乎更喜欢使用栈变量而不是可用的寄存器。但根据AAPCS,我们有4个寄存器可用,可以绕过通常情况下的任何栈推送和弹出。这个函数也没有被内联,这可能可以解释这一点,但仅将原始值保存到栈上并恢复它们仍然比反复使用栈要好。

为什么gcc更喜欢使用栈而不是可用的寄存器?

英文:

Writing some very basic C code for a Cortex-M0 device, I'm surprised to see the disassembly:

void delay(void) {
for (int x=0;x<0xffff;x++) ;
}

This becomes:

for (int x=0;x<0xffff;x++) ;
    2300        movs r3, #0
    9301        str r3, [sp, #4]
    E002        b 0x0800026E
    9B01        ldr r3, [sp, #4]      //0x08000268
    3301        adds r3, #1
    9301        str r3, [sp, #4]
    9B01        ldr r3, [sp, #4]      //0x0800026E
    4A03        ldr r2, =0x0000FFFE
    4293        cmp r3, r2
    DDF8        ble 0x08000268
--- main.c -- 8 --------------------------------------------
}
    46C0        nop
    46C0        nop
    B002        add sp, sp, #8
    4770        bx lr
    46C0        nop
    0000FFFE    .word 0x0000FFFE

Now this seems awfully wasteful. I know my purpose was to 'waste time' with the simple delay function, but it seems like gcc uses only two registers to access variables on the stack.

This is stock Rowley Crossworks 4.10 with all default settings using the GCC compiler that came with it. The debug configuration adds no optimization flags.

Wouldn't something like this be significantly better?

# Counter reset
  movs r0, #0x0
  ldr r1, =0xffff

loopone:
  adds r0,#0x1
  cmp r0,r1
  bne loopone

It seems like default unoptimized gcc output prefers stack variables over registers. But we have 4 registers available as per AAPCS which lets us bypass any stack pushes and pops above the usual. This function was also not inlined, which could possibly explain this, but just saving the original values to stack and recovering them would still be better than repeatedly using the stack like this.

Why does gcc prefer the stack over available registers?

答案1

得分: 1

编译器的工作方式是相当简单但非常准确地翻译您的代码,然后如果您要求,通过对简单生成的翻译进行详细分析来优化翻译。

这些优化在编译时性能上可能非常昂贵(对于一些项目,构建时间至关重要),而且使调试更加困难(例如,变量可能会消失)— 因此优化是可选的。

这意味着,对于简单但非常正确的初始翻译,所有变量都获得内存位置 — 编译器知道这样做不会出错,这意味着它将是一种正确的翻译。编译器知道它不会因为昂贵的分析而耗尽寄存器,例如。

出于各种原因,它具有通用的能力,可以删除加载和存储,将值重新定位到寄存器,它在(可选的)优化期间广泛应用于生成的代码,而不仅仅是已声明的变量,因此这些机制已经存在,而在简单的翻译中特殊处理已声明的变量(即立即将它们放入寄存器中)几乎没有优点。

简而言之,未经优化的代码有很大的改进空间,但当然,这就是优化器的作用。优化简单翻译的详细分析方法可以捕捉到一般和明显性质的改进(如将变量放入寄存器),以及对输入语句和表达式及其在较低级别的翻译中发现的确切模式非常特定的隐藏改进。将变量从内存重新定位到寄存器并不总是有益的(如果变量仅使用一次且在调用过程中存活),详细分析可以确定(通过某种度量)何时是有益的,何时不是。这是一种有条不紊的方法,比仅仅尝试一开始就生成良好的代码更为有效。

英文:

The way compilers work is to translate your code rather simplistically but very correctly, then if you ask for it, optimize the translation by detailed analysis of that simplistically generated translation.

These optimizations can be very expensive from a compile-time performance perspective (and for some projects, build time is critical), plus they make debugging more difficult (e.g. variables can disappear) — so optimization is optional.

This means that, for simplistic very correct initial translation, the variables all get memory locations — the compiler knows that it can't go wrong with that, meaning it will be a correct translation.  The compiler knows that it won't run out of registers, for example, without expensive analysis. 

For a variety of reasons, it has the general capability of removing loads and stores, relocating values to registers, which it applies broadly to the generated code during (optional) optimization, not just to declared variables, so the mechanisms are there anyway, and there's little merit to special handling of declared variables (i.e. putting them in registers right away in the simplistic translation).

In short, there's lots of room for improvement of the unoptimized code, but of course, that's what the optimizer is for.  The approach of detailed analysis in optimizing the simplistic translation catches improvements of a general and obvious nature (like putting variables in registers) as well as hidden improvements that are very specific to the exact patterns found in the statements and expressions of the input and their translations at a lower level.  It is not always a win to relocate a variable from memory to a register (if the variable is used only once and live across a call), and the detailed analysis can determine (by some measure) where that is a win and where not.  This is a methodical approach that is more effective than simply trying to generate good code in the first place.

huangapple
  • 本文由 发表于 2023年5月7日 14:42:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76192541.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定