英文:
Is it better to use the returned value of a function directly and not store it in a variable?
问题
I think your approach to structure the code with ReturnType ret; ret = functionWithLotsOfParameters(a, b, c, d, etc);
is clearer and doesn't significantly impact memory usage compared to the initial approach. However, the exact impact on memory depends on the compiler optimizations and the context of your embedded system. In most cases, the difference in memory usage should be minimal, and the improved code readability is valuable.
英文:
I'm having a discussion with a colleague about how to structure the return code checks in our code base. In a lot of places the code looks like this:
logOnError(functionWithLotsOfParameters(a, b, c, d, etc),
"Error Message", module_name, some_more_stuff);
I think this kind of buries what is actually happening here, which is the function call of the inner function. I would like to structure the code like this:
ReturnType ret;
ret = functionWithLotsOfParameters(a, b, c, d, etc);
logOnError(ret, "Error Message", module_name, some_more_stuff);
ReturnType
in this context is basically just an unsigned int
. My colleague argues that this increases the stack size of the function unnecessarily, which could be an issue, because this runs on an embedded system and we're somewhat memory constrained.
My counter argument is that even in the first case the return value has to be somewhere in memory, so I don't think that the footprint would increase.
Who's right?
答案1
得分: 5
以下是翻译好的部分:
-
通常最佳实践是将复杂表达式拆分为多行。可读性强、易维护的代码通常比堆栈使用或执行速度更重要。此外,这些微小的优化通常不值得付出努力,可以假设编译器的优化器会处理它,除非您有充分的理由相信会有不同情况。
-
需要注意的是,如果堆栈使用很重要,那么一开始就不应该使用如此重的API来调用该函数,而是应该传递一个指向结构的单个指针。实际上,这是一种在一些非常有限的系统中有时使用的堆栈使用优化技术,比如低端的8位微控制器。例如,PIC因其功能失调的堆栈实现而臭名昭著,除了大小外,您还必须跟踪调用堆栈深度。
-
您在假设返回的值不能存储在空中方面是正确的 - 程序使用的每个值都必须分配在某个地方,无论是存储在命名变量中还是匿名临时位置中。虽然不一定在堆栈上,也可能在寄存器内,这取决于ABI和调用约定。
-
在更加人为的上下文中测试您的人工代码 https://godbolt.org/z/bYsE1WT7e...
-
gcc x86_64,第一个版本实际上有点慢
-
clang x86_64,两个版本的代码完全相同。
-
gcc ARM32,两个版本的代码完全相同。
-
gcc AVR,第一个版本实际上有点慢
为什么第一个版本在某些系统上较慢,是因为ABI和特定编译器根据ABI进行了优化的方式。ABI规定了返回代码必须存储在哪里以及第一个参数必须存储在哪里:不同的地方。这反过来可能意味着编译器在函数调用之间必须对数据进行一些移动。
教训是,过度思考/过度工程化各种微小优化,试图为性能而编写较不易读的代码可能会产生相反的效果。一般的最佳实践包括:
- 尽可能编写简单和易读的代码。可读性强、简单的代码通常具有最佳性能。
- 除非实际发现性能问题,否则不要手动优化。
- 除非您确实对目标CPU和ABI有相当深入的了解,否则不要手动优化。
例如,我曾经自信满满地说这些代码段肯定会被优化得完全相同,但后来发现在某些目标上它们并不相同。
英文:
It is general good practice to split complex expression into several, across several lines. Readable, maintainable code is almost always more important than stack usage, or even execution speed. Also, micro-optimizations like these are often not worth the effort - assume that the compiler's optimizer will take care of it unless you have good reasons to believe otherwise.
Notably, if stack usage was important, you shouldn't use such a heavy API to the function in the first place, but pass a single pointer to struct instead. This is actually a stack use optimization technique sometimes used in some very limited systems like low-end 8-bit microcontrollers. PIC for example is infamous for its dysfunctional stack implementation, where in addition to size you also had to keep track of call stack depth.
You are correct in assuming that returned values cannot be stored in thin air - every value used by the program has to be allocated somewhere, no matter if stored in a named variable or an anonymous temporary location. Not necessarily on the stack though, could as well be inside a register, depending on ABI and calling convention.
Testing your artificial code in an even more artificial context https://godbolt.org/z/bYsE1WT7e...
- gcc x86_64, first version actually somewhat slower
- clang x86_64, identical code for both versions.
- gcc ARM32, identical code for both versions.
- gcc AVR, first version actually somewhat slower
Why the first version was slower on some systems is because of ABI and how the specific compiler optimizes according to the ABI. The ABI states where the return code has to be stored and where the first parameter has to be stored: different places. That in turn could mean that the compiler has to shuffle around the data a bit between function calls.
Lesson learnt is that overthinking/over-engineering various micro-optimizations and trying to write less readable code for the sake of performance can have the opposite effect. The general best practices are:
- Write code as simple and readable as possible. Readable, simple code often has the best performance.
- Don't optimize manually unless you have actually found a performance problem.
- Don't optimize manually unless you actually have somewhat in-depth knowledge of the target CPU and ABI.
For example I was about to make a cocksure statement that these snippets would surely be optimized identically, but then it turned out they weren't on some targets.
答案2
得分: 3
在一个符号调试器中(你确实应该使用它),如果未分配返回值,检查或推断返回值可能会很繁琐。此外,“逐步进入”操作在语句级别上运行,所以在你的示例中,“逐步进入”会首先进入functionWithLotsOfParameters()
,如果你只想进入logOnError()
,这可能会很繁琐。
风格、实践和可读性的论点是主观的,而性能的论点则值得怀疑;调试器体验是客观事实。习惯性地以支持调试的风格编码显然是一个好主意。
此外,无论函数有多少个参数,这个论点都成立,这与参数的数量无关(尽管是否建议使用是一个不同的问题)。对于具有_长列表_的代码,最好使用清晰的空白布局来处理(即换行、缩进),这更多地是主观问题。
英文:
In a symbolic debugger (which you really should be using), it can be cumbersome to inspect or infer the return value if not assigned. Also "step-into" operations work at statement level, so in your example step-into would enter functionWithLotsOfParameters()
first, which can be cumbersome if you only intended to step-into logOnError()
.
Style, practice and readability arguments are a matter of opinion, while performance arguments are questionable; the debugger experience is a matter of fact. Habitually coding in a style that supports debug is clearly a good idea.
Moreover the argument holds regardless of the number of parameters the function has, which is an irrelevance (although it's advisability is a separate issue). The presentation of code with long lists of any kind is better dealt with with clear whitespace layout (i.e. newlines, indentation), and is even more so a matter of opinion.
答案3
得分: 1
这取决于您使用的优化级别。
让我们尝试看看编译器给出的结果。对于这个测试,我使用了以下代码:
#include <stdio.h>
#include <stdlib.h>
int __attribute__ ((noinline)) square(int num) {
return num * num;
}
void __attribute__ ((noinline)) log_val(int a, int b, int c, int d)
{
printf("LOG: %i %i %i %i", a, b, c, d);
}
void test_0(int val)
{
int retval;
retval = square(val);
log_val(0,1,2,retval);
}
void test_1(int val)
{
log_val(0,1,2,square(val));
}
使用ARM架构的GCC(因为我更熟悉ARM汇编),带有-O0标志,我们得到:
test_0:
push {r7, lr}
sub sp, sp, #16
add r7, sp, #0
str r0, [r7, #4]
ldr r0, [r7, #4]
bl square
str r0, [r7, #12]
ldr r3, [r7, #12]
movs r2, #2
movs r1, #1
movs r0, #0
bl log_val
nop
adds r7, r7, #16
mov sp, r7
pop {r7, pc}
test_1:
push {r7, lr}
sub sp, sp, #8
add r7, sp, #0
str r0, [r7, #4]
ldr r0, [r7, #4]
bl square
mov r3, r0
movs r2, #2
movs r1, #1
movs r0, #0
bl log_val
nop
adds r7, r7, #8
mov sp, r7
pop {r7, pc}
我们可以看到,第一个版本增加了16字节的堆栈(sub sp, sp, #16
),而第二个版本增加了8字节的堆栈(sub sp, sp, #8
)。事实上,使用单独变量的堆栈更大。
但这是没有进行任何优化的情况。现在让我们看看当启用最低优化级别-O1
时会发生什么:
test_0:
push {r3, lr}
bl square
mov r3, r0
movs r2, #2
movs r1, #1
movs r0, #0
bl log_val
pop {r3, pc}
test_1:
push {r3, lr}
bl square
mov r3, r0
movs r2, #2
movs r1, #1
movs r0, #0
bl log_val
pop {r3, pc}
在这里,我们可以看到两个函数是完全相同的,甚至都没有使用堆栈。这意味着使用-O1
时,生成的汇编与编码风格无关。
因此,如果不使用优化,您的同事是正确的,但即使启用了最低优化,结果也是相同的。
使用最容易阅读、最易维护等方法... 但堆栈大小不是问题。
英文:
It depends on the optimisation level you use.
Let's try and see what the compiler gives use. For this test I used the following code :
#include <stdio.h>
#include <stdlib.h>
int __attribute__ ((noinline)) square(int num) {
return num * num;
}
void __attribute__ ((noinline)) log_val(int a, int b, int c, int d)
{
printf("LOG: %i %i %i %i", a, b, c, d);
}
void test_0(int val)
{
int retval;
retval = square(val);
log_val(0,1,2,retval);
}
void test_1(int val)
{
log_val(0,1,2,square(val));
}
Using GCC for ARM (as I am more familiar with ARM assembly) with flag -O0, we get :
test_0:
push {r7, lr}
sub sp, sp, #16
add r7, sp, #0
str r0, [r7, #4]
ldr r0, [r7, #4]
bl square
str r0, [r7, #12]
ldr r3, [r7, #12]
movs r2, #2
movs r1, #1
movs r0, #0
bl log_val
nop
adds r7, r7, #16
mov sp, r7
pop {r7, pc}
test_1:
push {r7, lr}
sub sp, sp, #8
add r7, sp, #0
str r0, [r7, #4]
ldr r0, [r7, #4]
bl square
mov r3, r0
movs r2, #2
movs r1, #1
movs r0, #0
bl log_val
nop
adds r7, r7, #8
mov sp, r7
pop {r7, pc}
We can see that the first version increases the stack by 16 bytes (sub sp, sp, #16
), and the second by 8 bytes (sub sp, sp, #8
). The stack is, in fact, bigger with a separate variable.
But this is with no optimisation at all. Now let's see what happens when we enable the lowest optimisation level, -O1
:
test_0:
push {r3, lr}
bl square
mov r3, r0
movs r2, #2
movs r1, #1
movs r0, #0
bl log_val
pop {r3, pc}
test_1:
push {r3, lr}
bl square
mov r3, r0
movs r2, #2
movs r1, #1
movs r0, #0
bl log_val
pop {r3, pc}
Here we see that both function are absolutely identical, and don't even use the stack. Meaning that with -O1
the resulting assembly is independent of the coding style.
So your colleague is right if you don't use optimisation, but even with the lowest optimisation enabled the results are identical.
Use whichever method is the most readable, easiest to maintain, etc... But the stack size is a non-issue.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论