如何确保Box::new()真的进行了堆分配?

huangapple go评论105阅读模式
英文:

How to make sure Box::new() really does heap allocation?

问题

我正在尝试测量Box::new()的性能:

  1. fn main() {
  2. let start = Instant::now();
  3. let mut sum = 0;
  4. for _ in 0..100000 {
  5. sum += 42;
  6. }
  7. println!("简单求和: {:?}", start.elapsed());
  8. let start2 = Instant::now();
  9. for _ in 0..100000 {
  10. let b = Box::new(42);
  11. Box::leak(b);
  12. }
  13. println!("多次堆分配: {:?}", start2.elapsed());
  14. }

我得到的结果如下:

  1. 简单求和: 1.413291毫秒
  2. 多次堆分配: 6.9935毫秒

显然,数据看起来不对。 Box::new()的开销必须比+=多5倍。优化是在何时发生的?如何禁用它?

英文:

I'm trying to measure the performance of Box::new():

  1. fn main() {
  2. let start = Instant::now();
  3. let mut sum = 0;
  4. for _ in 0..100000 {
  5. sum += 42;
  6. }
  7. println!("Simple sum: {:?}", start.elapsed());
  8. let start2 = Instant::now();
  9. for _ in 0..100000 {
  10. let b = Box::new(42);
  11. Box::leak(b);
  12. }
  13. println!("Many heap calls: {:?}", start2.elapsed());
  14. }

I'm getting:

  1. Simple sum: 1.413291ms
  2. Many heap calls: 6.9935ms

Obviously, the data doesn't look right. Box::new() must be a much heavier operation than just 5x of +=. Where does the optimization kick in? How to disable it?

答案1

得分: 3

请注意,以下是已翻译的内容:

确保使用 --release 标志来运行基准代码,否则结果将几乎没有意义。

在你的情况下,如果我使用 --release 运行它,我得到:

  1. 简单求和:100纳秒
  2. 许多堆调用:100纳秒

这意味着编译器完全优化掉了一切,因为你的循环没有任何副作用。除了时间外,如果一个操作没有效果,编译器允许简单地将其删除。

请注意,编译器甚至会发出警告:

  1. 警告:变量 `sum` 被赋值,但从未使用
  2. --> src\main.rs:5:13
  3. |
  4. 5 | let mut sum = 0;
  5. | ^^^
  6. |
  7. = 注:考虑改用 `_sum`
  8. = 注:默认情况下启用 `#[warn(unused_variables)]`

尽管如此,有些情况下,你可能希望保留操作,即使它们没有副作用,比如用于基准测试。为此,Rust 提供了 std::hint::black_box 函数,它返回与输入完全相同,但对编译器而言似乎进行了一些复杂的计算,以使编译器无法再证明输入等于输出。这可以防止编译器优化掉这个函数,以及与它相关的所有内容。

在你的情况下,这是如何防止Rust优化掉你的循环的一个示例:

  1. use std::time::Instant;
  2. fn main() {
  3. let start = Instant::now();
  4. let mut sum = 0;
  5. for _ in 0..100000 {
  6. sum += 42;
  7. std::hint::black_box(sum);
  8. }
  9. println!("简单求和:{:?}", start.elapsed());
  10. let start2 = Instant::now();
  11. for _ in 0..100000 {
  12. let b = std::hint::black_box(Box::new(42));
  13. std::hint::black_box(Box::leak(b));
  14. }
  15. println!("许多堆调用:{:?}", start2.elapsed());
  16. }
  1. 简单求和:27.2微秒
  2. 许多堆调用:2.8956毫秒

现在这些数字更有意义了。

要100%确定它没有优化掉任何重要部分,你可以随时检查汇编代码。由于包围 println!() 的汇编代码很难阅读,最好将它们提取到自己的函数中。

确保将这些函数设置为 pub,以便它们出现在最终的汇编代码中,否则它们可能因内联而消失。

以下是示例 演示如何做

  1. use std::time::Instant;
  2. pub fn 简单求和() {
  3. let mut sum = 0;
  4. for _ in 0..100000 {
  5. sum += 42;
  6. std::hint::black_box(sum);
  7. }
  8. }
  9. pub fn 许多堆调用() {
  10. for _ in 0..100000 {
  11. let b = std::hint::black_box(Box::new(42));
  12. std::hint::black_box(Box::leak(b));
  13. }
  14. }
  15. fn main() {
  16. let start = Instant::now();
  17. 简单求和();
  18. println!("简单求和:{:?}", start.elapsed());
  19. let start2 = Instant::now();
  20. 许多堆调用();
  21. println!("许多堆调用:{:?}", start2.elapsed());
  22. }
  1. example::简单求和:
  2. sub rsp, 4
  3. mov eax, 42
  4. mov rcx, rsp
  5. .LBB0_1:
  6. mov dword ptr [rsp], eax
  7. add eax, 42
  8. cmp eax, 4200042
  9. jne .LBB0_1
  10. add rsp, 4
  11. ret
  12. example::许多堆调用:
  13. push r15
  14. push r14
  15. push rbx
  16. sub rsp, 16
  17. mov ebx, 100000
  18. mov r14, qword ptr [rip + __rust_alloc@GOTPCREL]
  19. lea r15, [rsp + 8]
  20. .LBB1_1:
  21. mov edi, 4
  22. mov esi, 4
  23. call r14
  24. test rax, rax
  25. je .LBB1_4
  26. mov dword ptr [rax], 42
  27. mov qword ptr [rsp + 8], rax
  28. mov rax, qword ptr [rsp + 8]
  29. mov qword ptr [rsp + 8], rax
  30. dec ebx
  31. jne .LBB1_1
  32. add rsp, 16
  33. pop rbx
  34. pop r14
  35. pop r15
  36. ret
  37. .LBB1_4:
  38. mov edi, 4
  39. mov esi, 4
  40. call qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
  41. ud2

需要注意的重要部分是 .LBB0_1:.LBB1_1:jne .LBB0_1 以及 jne .LBB1_1,它们是两个 for 循环。这显示循环没有被优化掉。

还请注意 mov r14, qword ptr [rip + __rust_alloc@GOTPCREL]call r14,这是进行堆分配的实际调用。因此,这个部分也没有被优化掉。

此外,注意到有趣的 cmp eax, 4200042。这表明它重新工作了第一个循环;而不是执行:

  1. let mut sum = 0;
  2. for _ in 0..100000 {
  3. sum += 42;
  4. }

英文:

Make sure to run benchmark code with --release, otherwise the results will be pretty much meaningless.

In your case, if I run it with --release, I get:

  1. Simple sum: 100ns
  2. Many heap calls: 100ns

This means that the compiler completely optimized away everything, because your loops had zero side effects. If (apart from the time it would take) an operation has no effect, the compiler is allowed to simply remove it.

Note that the compiler even warns:

  1. warning: variable `sum` is assigned to, but never used
  2. --> src\main.rs:5:13
  3. |
  4. 5 | let mut sum = 0;
  5. | ^^^
  6. |
  7. = note: consider using `_sum` instead
  8. = note: `#[warn(unused_variables)]` on by default

That said, there are situations where you want to keep the operation even though they have no side effect, like for benchmarking. For that, Rust provides std::hint::black_box, which is a function that returns exactly what you give to it, but looks to the compiler as if some fancy calculation would take place so that the compiler can no longer prove that the input is equal to the output. That prevents the compiler from optimizing this function away, and with that everything that feeds into it.

In your case, this is one example of how you could prevent Rust from optimizing away your loop:

  1. use std::time::Instant;
  2. fn main() {
  3. let start = Instant::now();
  4. let mut sum = 0;
  5. for _ in 0..100000 {
  6. sum += 42;
  7. std::hint::black_box(sum);
  8. }
  9. println!("Simple sum: {:?}", start.elapsed());
  10. let start2 = Instant::now();
  11. for _ in 0..100000 {
  12. let b = std::hint::black_box(Box::new(42));
  13. std::hint::black_box(Box::leak(b));
  14. }
  15. println!("Many heap calls: {:?}", start2.elapsed());
  16. }
  1. Simple sum: 27.2µs
  2. Many heap calls: 2.8956ms

Now those numbers make more sense.

To be 100% sure that it didn't optimize away anything important, you could always check with the disassembly. Because asm surrounding println!()s is hard to read, it makes sense to extract them into their own functions.

Be sure to make those functions pub to make them show up in the final assembly, otherwis they might disappear due to inlining.

Here is how this would look:

  1. use std::time::Instant;
  2. pub fn simple_sum() {
  3. let mut sum = 0;
  4. for _ in 0..100000 {
  5. sum += 42;
  6. std::hint::black_box(sum);
  7. }
  8. }
  9. pub fn many_heap_calls() {
  10. for _ in 0..100000 {
  11. let b = std::hint::black_box(Box::new(42));
  12. std::hint::black_box(Box::leak(b));
  13. }
  14. }
  15. fn main() {
  16. let start = Instant::now();
  17. simple_sum();
  18. println!("Simple sum: {:?}", start.elapsed());
  19. let start2 = Instant::now();
  20. many_heap_calls();
  21. println!("Many heap calls: {:?}", start2.elapsed());
  22. }
  1. example::simple_sum:
  2. sub rsp, 4
  3. mov eax, 42
  4. mov rcx, rsp
  5. .LBB0_1:
  6. mov dword ptr [rsp], eax
  7. add eax, 42
  8. cmp eax, 4200042
  9. jne .LBB0_1
  10. add rsp, 4
  11. ret
  12. example::many_heap_calls:
  13. push r15
  14. push r14
  15. push rbx
  16. sub rsp, 16
  17. mov ebx, 100000
  18. mov r14, qword ptr [rip + __rust_alloc@GOTPCREL]
  19. lea r15, [rsp + 8]
  20. .LBB1_1:
  21. mov edi, 4
  22. mov esi, 4
  23. call r14
  24. test rax, rax
  25. je .LBB1_4
  26. mov dword ptr [rax], 42
  27. mov qword ptr [rsp + 8], rax
  28. mov rax, qword ptr [rsp + 8]
  29. mov qword ptr [rsp + 8], rax
  30. dec ebx
  31. jne .LBB1_1
  32. add rsp, 16
  33. pop rbx
  34. pop r14
  35. pop r15
  36. ret
  37. .LBB1_4:
  38. mov edi, 4
  39. mov esi, 4
  40. call qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
  41. ud2

The important part to notice here is the .LBB0_1:, .LBB1_1: and the jne .LBB0_1 and jne .LBB1_1, which are the two for loops. This shows that the loops did not get optimized away.

Also note the mov r14, qword ptr [rip + __rust_alloc@GOTPCREL] and call r14, which is the actual call that does the heap allocation. So this one also didn't get optimized away.

Also, notice the interesting looking cmp eax, 4200042. This one shows that it reworked the first loop; instead of doing:

  1. let mut sum = 0;
  2. for _ in 0..100000 {
  3. sum += 42;
  4. }

it optimized it to

  1. let mut sum = 0;
  2. while sum != 4200042 {
  3. sum += 42;
  4. }

which does in fact give the same result and reuses the sum variable as the loop counter 如何确保Box::new()真的进行了堆分配?

Now compared to how it was before:

  1. use std::time::Instant;
  2. pub fn simple_sum() {
  3. let mut sum = 0;
  4. for _ in 0..100000 {
  5. sum += 42;
  6. }
  7. }
  8. pub fn many_heap_calls() {
  9. for _ in 0..100000 {
  10. let b = Box::new(42);
  11. Box::leak(b);
  12. }
  13. }
  14. fn main() {
  15. let start = Instant::now();
  16. simple_sum();
  17. println!("Simple sum: {:?}", start.elapsed());
  18. let start2 = Instant::now();
  19. many_heap_calls();
  20. println!("Many heap calls: {:?}", start2.elapsed());
  21. }
  1. example::simple_sum:
  2. ret
  3. example::many_heap_calls:
  4. ret

I don't think this one requires further explanation 如何确保Box::new()真的进行了堆分配?

huangapple
  • 本文由 发表于 2023年3月31日 18:34:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/75897535.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定