在C++中字符串文字的存储方式是怎样的?

huangapple go评论88阅读模式
英文:

Storage of String Literals in memory c++

问题

我读到字符串文字总是存储在只读内存中,这是有道理的。

然而,如果我使用字符串文字初始化字符数组,它仍然会将字符串文字存储在只读内存中,然后将其复制到字符数组的内存位置。

我的问题是,在这种情况下,为什么要首先将字符串文字存储在只读内存中,而不是直接将其存储在字符数组的内存位置呢?

英文:

I read that string literals are always stored in read only memory and it makes sense as to why.

However if I initialize a character array using a string literal, it still stores the string literal in read only memory and then copies it into the memory location of the character array.

My question is, in this scenario, why bother storing the string literal in read only memory in the first place, why not directly store it in the memory location of character array.

答案1

得分: 2

我读到字符串字面值通常存储在只读内存中,这是有道理的。

字符串字面值的存储位置是实现定义的。如果编译器决定发出一个大的字符串字面值,通常会位于静态内存的只读部分,例如.rodata

然而,是否有必要这样做取决于编译器。编译器允许根据按需原则来优化代码,因此如果程序的行为与字面值存储在其他地方或根本不存储相同,那也是允许的。

示例 1

  1. int sum() {
  2. char arr[] = "ab";
  3. return arr[0] + arr[1];
  4. }

以下是相应的汇编输出:

  1. sum():
  2. mov eax, 195
  3. ret

在这种情况下,因为一切都是编译时常量,根本没有字符串字面值或数组。编译器进行了优化,通过对ASCII字符ab求和,将我们的代码转换为return 195;

示例 2

  1. void consume(const char*);
  2. void short_string() {
  3. char arr[] = "short str";
  4. consume(arr);
  5. }
  1. short_string():
  2. sub rsp, 24
  3. movabs rax, 8391086215229565043
  4. mov qword ptr [rsp + 8], rax
  5. mov word ptr [rsp + 16], 114
  6. lea rdi, [rsp + 8]
  7. call consume(char const*)@PLT
  8. add rsp, 24
  9. ret

再次,没有生成任何代码来将字符串存储在只读内存中,但它也没有完全被优化掉。编译器看到字符串short str非常短,因此将其ASCII字节视为数字8391086215229565043,并直接将其内存移到堆栈上。consume()以指向堆栈内存的指针调用。

示例 3

  1. void long_string() {
  2. char arr[] = "Lorem ipsum dolor [...] est laborum.";
  3. consume(arr);
  4. }
  1. long_string():
  2. push rbx
  3. sub rsp, 448
  4. lea rsi, [rip + .L__const.long_string().arr]
  5. mov rbx, rsp
  6. mov edx, 446
  7. mov rdi, rbx
  8. call memcpy@PLT
  9. mov rdi, rbx
  10. call consume(char const*)@PLT
  11. add rsp, 448
  12. pop rbx
  13. ret
  14. .L__const.long_string().arr:
  15. .asciz "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

我们的字符串现在太长,无法视为一个数字或两个数字。整个字符串现在将被发出到静态内存,很可能是在链接后的.rodata部分。它仍然有助于存在,因为我们可以使用memcpy将其从静态内存复制到初始化arr的堆栈上。

结论

如果你担心编译器在这里做了一些浪费的事情,不必担心。现代编译器非常擅长优化和决定哪些符号放在哪里,如果它们发出字符串字面值,通常是因为这是使某些其他代码工作的必要,或者因为它使数组的初始化更容易。


请参阅在Compiler Explorer上的实际示例

英文:

> I read that string literals are always stored in read only memory and it makes sense as to why.

The storage location of string literals is implementation-defined. If compilers decide to emit a large string literal, it will usually be located in a read-only section of static memory, such as .rodata.

However, whether this is even necessary is up to the compiler. Compilers are allowed to optimize your code according to the as-if rule, so if the behavior of the program is the same with the literal being stored elsewhere, or nowhere at all, that is also allowed.

Example 1

  1. int sum() {
  2. char arr[] = "ab";
  3. return arr[0] + arr[1];
  4. }

With the following assembly output:

  1. sum():
  2. mov eax, 195
  3. ret

In this case, because everything is a compile-time constant, there is no string literal or array at all. The compiler optimized it away and turned our code into return 195; by summing up the two ASCII characters a and b.

Example 2

  1. void consume(const char*);
  2. void short_string() {
  3. char arr[] = "short str";
  4. consume(arr);
  5. }
  1. short_string():
  2. sub rsp, 24
  3. movabs rax, 8391086215229565043
  4. mov qword ptr [rsp + 8], rax
  5. mov word ptr [rsp + 16], 114
  6. lea rdi, [rsp + 8]
  7. call consume(char const*)@PLT
  8. add rsp, 24
  9. ret

Once again, no code was emitted that would keep the string in read-only memory, but it also wasn't away optimized completely. The compiler sees that the string short str is very short, so it treats its ASCII bytes as a number 8391086215229565043 and directly movs its memory onto the stack. consume() is called with a pointer to stack memory.

Example 3

  1. void long_string() {
  2. char arr[] = "Lorem ipsum dolor [...] est laborum.";
  3. consume(arr);
  4. }
  1. long_string():
  2. push rbx
  3. sub rsp, 448
  4. lea rsi, [rip + .L__const.long_string().arr]
  5. mov rbx, rsp
  6. mov edx, 446
  7. mov rdi, rbx
  8. call memcpy@PLT
  9. mov rdi, rbx
  10. call consume(char const*)@PLT
  11. add rsp, 448
  12. pop rbx
  13. ret
  14. .L__const.long_string().arr:
  15. .asciz "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

Our string is now much too long to be treated as a number or two. The entire string will now be emitted into static memory, most likely .rodata after linking. It is still helpful for it to exist, because we can use memcpy to copy it from static memory onto the stack when initializing arr.

Conclusion

If you're worried about compilers doing something wasteful here, don't be. Modern compilers are very good at optimizing and deciding which symbols go where, and if they emit a string literal, this is usually because it must exist for some other code to work, or because it makes initialization of an array easier.


See live examples with Compiler Explorer

huangapple
  • 本文由 发表于 2023年6月12日 20:01:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76456460-2.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定