英文:
How to make sure Box::new() really does heap allocation?
问题
我正在尝试测量Box::new()
的性能:
fn main() {
let start = Instant::now();
let mut sum = 0;
for _ in 0..100000 {
sum += 42;
}
println!("简单求和: {:?}", start.elapsed());
let start2 = Instant::now();
for _ in 0..100000 {
let b = Box::new(42);
Box::leak(b);
}
println!("多次堆分配: {:?}", start2.elapsed());
}
我得到的结果如下:
简单求和: 1.413291毫秒
多次堆分配: 6.9935毫秒
显然,数据看起来不对。 Box::new()
的开销必须比+=
多5倍。优化是在何时发生的?如何禁用它?
英文:
I'm trying to measure the performance of Box::new()
:
fn main() {
let start = Instant::now();
let mut sum = 0;
for _ in 0..100000 {
sum += 42;
}
println!("Simple sum: {:?}", start.elapsed());
let start2 = Instant::now();
for _ in 0..100000 {
let b = Box::new(42);
Box::leak(b);
}
println!("Many heap calls: {:?}", start2.elapsed());
}
I'm getting:
Simple sum: 1.413291ms
Many heap calls: 6.9935ms
Obviously, the data doesn't look right. Box::new()
must be a much heavier operation than just 5x of +=
. Where does the optimization kick in? How to disable it?
答案1
得分: 3
请注意,以下是已翻译的内容:
确保使用 --release
标志来运行基准代码,否则结果将几乎没有意义。
在你的情况下,如果我使用 --release
运行它,我得到:
简单求和:100纳秒
许多堆调用:100纳秒
这意味着编译器完全优化掉了一切,因为你的循环没有任何副作用。除了时间外,如果一个操作没有效果,编译器允许简单地将其删除。
请注意,编译器甚至会发出警告:
警告:变量 `sum` 被赋值,但从未使用
--> src\main.rs:5:13
|
5 | let mut sum = 0;
| ^^^
|
= 注:考虑改用 `_sum`
= 注:默认情况下启用 `#[warn(unused_variables)]`
尽管如此,有些情况下,你可能希望保留操作,即使它们没有副作用,比如用于基准测试。为此,Rust 提供了 std::hint::black_box
函数,它返回与输入完全相同,但对编译器而言似乎进行了一些复杂的计算,以使编译器无法再证明输入等于输出。这可以防止编译器优化掉这个函数,以及与它相关的所有内容。
在你的情况下,这是如何防止Rust优化掉你的循环的一个示例:
use std::time::Instant;
fn main() {
let start = Instant::now();
let mut sum = 0;
for _ in 0..100000 {
sum += 42;
std::hint::black_box(sum);
}
println!("简单求和:{:?}", start.elapsed());
let start2 = Instant::now();
for _ in 0..100000 {
let b = std::hint::black_box(Box::new(42));
std::hint::black_box(Box::leak(b));
}
println!("许多堆调用:{:?}", start2.elapsed());
}
简单求和:27.2微秒
许多堆调用:2.8956毫秒
现在这些数字更有意义了。
要100%确定它没有优化掉任何重要部分,你可以随时检查汇编代码。由于包围 println!()
的汇编代码很难阅读,最好将它们提取到自己的函数中。
确保将这些函数设置为 pub
,以便它们出现在最终的汇编代码中,否则它们可能因内联而消失。
以下是示例 演示如何做:
use std::time::Instant;
pub fn 简单求和() {
let mut sum = 0;
for _ in 0..100000 {
sum += 42;
std::hint::black_box(sum);
}
}
pub fn 许多堆调用() {
for _ in 0..100000 {
let b = std::hint::black_box(Box::new(42));
std::hint::black_box(Box::leak(b));
}
}
fn main() {
let start = Instant::now();
简单求和();
println!("简单求和:{:?}", start.elapsed());
let start2 = Instant::now();
许多堆调用();
println!("许多堆调用:{:?}", start2.elapsed());
}
example::简单求和:
sub rsp, 4
mov eax, 42
mov rcx, rsp
.LBB0_1:
mov dword ptr [rsp], eax
add eax, 42
cmp eax, 4200042
jne .LBB0_1
add rsp, 4
ret
example::许多堆调用:
push r15
push r14
push rbx
sub rsp, 16
mov ebx, 100000
mov r14, qword ptr [rip + __rust_alloc@GOTPCREL]
lea r15, [rsp + 8]
.LBB1_1:
mov edi, 4
mov esi, 4
call r14
test rax, rax
je .LBB1_4
mov dword ptr [rax], 42
mov qword ptr [rsp + 8], rax
mov rax, qword ptr [rsp + 8]
mov qword ptr [rsp + 8], rax
dec ebx
jne .LBB1_1
add rsp, 16
pop rbx
pop r14
pop r15
ret
.LBB1_4:
mov edi, 4
mov esi, 4
call qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
ud2
需要注意的重要部分是 .LBB0_1:
、.LBB1_1:
和 jne .LBB0_1
以及 jne .LBB1_1
,它们是两个 for
循环。这显示循环没有被优化掉。
还请注意 mov r14, qword ptr [rip + __rust_alloc@GOTPCREL]
和 call r14
,这是进行堆分配的实际调用。因此,这个部分也没有被优化掉。
此外,注意到有趣的 cmp eax, 4200042
。这表明它重新工作了第一个循环;而不是执行:
let mut sum = 0;
for _ in 0..100000 {
sum += 42;
}
它
英文:
Make sure to run benchmark code with --release
, otherwise the results will be pretty much meaningless.
In your case, if I run it with --release
, I get:
Simple sum: 100ns
Many heap calls: 100ns
This means that the compiler completely optimized away everything, because your loops had zero side effects. If (apart from the time it would take) an operation has no effect, the compiler is allowed to simply remove it.
Note that the compiler even warns:
warning: variable `sum` is assigned to, but never used
--> src\main.rs:5:13
|
5 | let mut sum = 0;
| ^^^
|
= note: consider using `_sum` instead
= note: `#[warn(unused_variables)]` on by default
That said, there are situations where you want to keep the operation even though they have no side effect, like for benchmarking. For that, Rust provides std::hint::black_box
, which is a function that returns exactly what you give to it, but looks to the compiler as if some fancy calculation would take place so that the compiler can no longer prove that the input is equal to the output. That prevents the compiler from optimizing this function away, and with that everything that feeds into it.
In your case, this is one example of how you could prevent Rust from optimizing away your loop:
use std::time::Instant;
fn main() {
let start = Instant::now();
let mut sum = 0;
for _ in 0..100000 {
sum += 42;
std::hint::black_box(sum);
}
println!("Simple sum: {:?}", start.elapsed());
let start2 = Instant::now();
for _ in 0..100000 {
let b = std::hint::black_box(Box::new(42));
std::hint::black_box(Box::leak(b));
}
println!("Many heap calls: {:?}", start2.elapsed());
}
Simple sum: 27.2µs
Many heap calls: 2.8956ms
Now those numbers make more sense.
To be 100% sure that it didn't optimize away anything important, you could always check with the disassembly. Because asm surrounding println!()
s is hard to read, it makes sense to extract them into their own functions.
Be sure to make those functions pub
to make them show up in the final assembly, otherwis they might disappear due to inlining.
Here is how this would look:
use std::time::Instant;
pub fn simple_sum() {
let mut sum = 0;
for _ in 0..100000 {
sum += 42;
std::hint::black_box(sum);
}
}
pub fn many_heap_calls() {
for _ in 0..100000 {
let b = std::hint::black_box(Box::new(42));
std::hint::black_box(Box::leak(b));
}
}
fn main() {
let start = Instant::now();
simple_sum();
println!("Simple sum: {:?}", start.elapsed());
let start2 = Instant::now();
many_heap_calls();
println!("Many heap calls: {:?}", start2.elapsed());
}
example::simple_sum:
sub rsp, 4
mov eax, 42
mov rcx, rsp
.LBB0_1:
mov dword ptr [rsp], eax
add eax, 42
cmp eax, 4200042
jne .LBB0_1
add rsp, 4
ret
example::many_heap_calls:
push r15
push r14
push rbx
sub rsp, 16
mov ebx, 100000
mov r14, qword ptr [rip + __rust_alloc@GOTPCREL]
lea r15, [rsp + 8]
.LBB1_1:
mov edi, 4
mov esi, 4
call r14
test rax, rax
je .LBB1_4
mov dword ptr [rax], 42
mov qword ptr [rsp + 8], rax
mov rax, qword ptr [rsp + 8]
mov qword ptr [rsp + 8], rax
dec ebx
jne .LBB1_1
add rsp, 16
pop rbx
pop r14
pop r15
ret
.LBB1_4:
mov edi, 4
mov esi, 4
call qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
ud2
The important part to notice here is the .LBB0_1:
, .LBB1_1:
and the jne .LBB0_1
and jne .LBB1_1
, which are the two for
loops. This shows that the loops did not get optimized away.
Also note the mov r14, qword ptr [rip + __rust_alloc@GOTPCREL]
and call r14
, which is the actual call that does the heap allocation. So this one also didn't get optimized away.
Also, notice the interesting looking cmp eax, 4200042
. This one shows that it reworked the first loop; instead of doing:
let mut sum = 0;
for _ in 0..100000 {
sum += 42;
}
it optimized it to
let mut sum = 0;
while sum != 4200042 {
sum += 42;
}
which does in fact give the same result and reuses the sum
variable as the loop counter
Now compared to how it was before:
use std::time::Instant;
pub fn simple_sum() {
let mut sum = 0;
for _ in 0..100000 {
sum += 42;
}
}
pub fn many_heap_calls() {
for _ in 0..100000 {
let b = Box::new(42);
Box::leak(b);
}
}
fn main() {
let start = Instant::now();
simple_sum();
println!("Simple sum: {:?}", start.elapsed());
let start2 = Instant::now();
many_heap_calls();
println!("Many heap calls: {:?}", start2.elapsed());
}
example::simple_sum:
ret
example::many_heap_calls:
ret
I don't think this one requires further explanation
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论