英文:
Storage of String Literals in memory c++
问题
我读到字符串文字总是存储在只读内存中,这是有道理的。
然而,如果我使用字符串文字初始化字符数组,它仍然会将字符串文字存储在只读内存中,然后将其复制到字符数组的内存位置。
我的问题是,在这种情况下,为什么要首先将字符串文字存储在只读内存中,而不是直接将其存储在字符数组的内存位置呢?
英文:
I read that string literals are always stored in read only memory and it makes sense as to why.
However if I initialize a character array using a string literal, it still stores the string literal in read only memory and then copies it into the memory location of the character array.
My question is, in this scenario, why bother storing the string literal in read only memory in the first place, why not directly store it in the memory location of character array.
答案1
得分: 2
我读到字符串字面值通常存储在只读内存中,这是有道理的。
字符串字面值的存储位置是实现定义的。如果编译器决定发出一个大的字符串字面值,通常会位于静态内存的只读部分,例如.rodata
。
然而,是否有必要这样做取决于编译器。编译器允许根据按需原则来优化代码,因此如果程序的行为与字面值存储在其他地方或根本不存储相同,那也是允许的。
示例 1
int sum() {
char arr[] = "ab";
return arr[0] + arr[1];
}
以下是相应的汇编输出:
sum():
mov eax, 195
ret
在这种情况下,因为一切都是编译时常量,根本没有字符串字面值或数组。编译器进行了优化,通过对ASCII字符a
和b
求和,将我们的代码转换为return 195;
。
示例 2
void consume(const char*);
void short_string() {
char arr[] = "short str";
consume(arr);
}
short_string():
sub rsp, 24
movabs rax, 8391086215229565043
mov qword ptr [rsp + 8], rax
mov word ptr [rsp + 16], 114
lea rdi, [rsp + 8]
call consume(char const*)@PLT
add rsp, 24
ret
再次,没有生成任何代码来将字符串存储在只读内存中,但它也没有完全被优化掉。编译器看到字符串short str
非常短,因此将其ASCII字节视为数字8391086215229565043
,并直接将其内存移到堆栈上。consume()
以指向堆栈内存的指针调用。
示例 3
void long_string() {
char arr[] = "Lorem ipsum dolor [...] est laborum.";
consume(arr);
}
long_string():
push rbx
sub rsp, 448
lea rsi, [rip + .L__const.long_string().arr]
mov rbx, rsp
mov edx, 446
mov rdi, rbx
call memcpy@PLT
mov rdi, rbx
call consume(char const*)@PLT
add rsp, 448
pop rbx
ret
.L__const.long_string().arr:
.asciz "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
我们的字符串现在太长,无法视为一个数字或两个数字。整个字符串现在将被发出到静态内存,很可能是在链接后的.rodata
部分。它仍然有助于存在,因为我们可以使用memcpy
将其从静态内存复制到初始化arr
的堆栈上。
结论
如果你担心编译器在这里做了一些浪费的事情,不必担心。现代编译器非常擅长优化和决定哪些符号放在哪里,如果它们发出字符串字面值,通常是因为这是使某些其他代码工作的必要,或者因为它使数组的初始化更容易。
英文:
> I read that string literals are always stored in read only memory and it makes sense as to why.
The storage location of string literals is implementation-defined. If compilers decide to emit a large string literal, it will usually be located in a read-only section of static memory, such as .rodata
.
However, whether this is even necessary is up to the compiler. Compilers are allowed to optimize your code according to the as-if rule, so if the behavior of the program is the same with the literal being stored elsewhere, or nowhere at all, that is also allowed.
Example 1
int sum() {
char arr[] = "ab";
return arr[0] + arr[1];
}
With the following assembly output:
sum():
mov eax, 195
ret
In this case, because everything is a compile-time constant, there is no string literal or array at all. The compiler optimized it away and turned our code into return 195;
by summing up the two ASCII characters a
and b
.
Example 2
void consume(const char*);
void short_string() {
char arr[] = "short str";
consume(arr);
}
short_string():
sub rsp, 24
movabs rax, 8391086215229565043
mov qword ptr [rsp + 8], rax
mov word ptr [rsp + 16], 114
lea rdi, [rsp + 8]
call consume(char const*)@PLT
add rsp, 24
ret
Once again, no code was emitted that would keep the string in read-only memory, but it also wasn't away optimized completely. The compiler sees that the string short str
is very short, so it treats its ASCII bytes as a number 8391086215229565043
and directly mov
s its memory onto the stack. consume()
is called with a pointer to stack memory.
Example 3
void long_string() {
char arr[] = "Lorem ipsum dolor [...] est laborum.";
consume(arr);
}
long_string():
push rbx
sub rsp, 448
lea rsi, [rip + .L__const.long_string().arr]
mov rbx, rsp
mov edx, 446
mov rdi, rbx
call memcpy@PLT
mov rdi, rbx
call consume(char const*)@PLT
add rsp, 448
pop rbx
ret
.L__const.long_string().arr:
.asciz "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
Our string is now much too long to be treated as a number or two. The entire string will now be emitted into static memory, most likely .rodata
after linking. It is still helpful for it to exist, because we can use memcpy
to copy it from static memory onto the stack when initializing arr
.
Conclusion
If you're worried about compilers doing something wasteful here, don't be. Modern compilers are very good at optimizing and deciding which symbols go where, and if they emit a string literal, this is usually because it must exist for some other code to work, or because it makes initialization of an array easier.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论