如何确保Box::new()真的进行了堆分配?

huangapple go评论78阅读模式
英文:

How to make sure Box::new() really does heap allocation?

问题

我正在尝试测量Box::new()的性能:

fn main() {
    let start = Instant::now();
    let mut sum = 0;
    for _ in 0..100000 {
        sum += 42;
    }
    println!("简单求和: {:?}", start.elapsed());
    let start2 = Instant::now();
    for _ in 0..100000 {
        let b = Box::new(42);
        Box::leak(b);
    }
    println!("多次堆分配: {:?}", start2.elapsed());
}

我得到的结果如下:

简单求和: 1.413291毫秒
多次堆分配: 6.9935毫秒

显然,数据看起来不对。 Box::new()的开销必须比+=多5倍。优化是在何时发生的?如何禁用它?

英文:

I'm trying to measure the performance of Box::new():

fn main() {
    let start = Instant::now();
    let mut sum = 0;
    for _ in 0..100000 {
        sum += 42;
    }
    println!("Simple sum: {:?}", start.elapsed());
    let start2 = Instant::now();
    for _ in 0..100000 {
        let b = Box::new(42);
        Box::leak(b);
    }
    println!("Many heap calls: {:?}", start2.elapsed());
}

I'm getting:

Simple sum: 1.413291ms
Many heap calls: 6.9935ms

Obviously, the data doesn't look right. Box::new() must be a much heavier operation than just 5x of +=. Where does the optimization kick in? How to disable it?

答案1

得分: 3

请注意,以下是已翻译的内容:

确保使用 --release 标志来运行基准代码,否则结果将几乎没有意义。

在你的情况下,如果我使用 --release 运行它,我得到:

简单求和:100纳秒
许多堆调用:100纳秒

这意味着编译器完全优化掉了一切,因为你的循环没有任何副作用。除了时间外,如果一个操作没有效果,编译器允许简单地将其删除。

请注意,编译器甚至会发出警告:

警告:变量 `sum` 被赋值,但从未使用
 --> src\main.rs:5:13
  |
5 |     let mut sum = 0;
  |             ^^^
  |
  = 注:考虑改用 `_sum`
  = 注:默认情况下启用 `#[warn(unused_variables)]`

尽管如此,有些情况下,你可能希望保留操作,即使它们没有副作用,比如用于基准测试。为此,Rust 提供了 std::hint::black_box 函数,它返回与输入完全相同,但对编译器而言似乎进行了一些复杂的计算,以使编译器无法再证明输入等于输出。这可以防止编译器优化掉这个函数,以及与它相关的所有内容。

在你的情况下,这是如何防止Rust优化掉你的循环的一个示例:

use std::time::Instant;

fn main() {
    let start = Instant::now();
    let mut sum = 0;
    for _ in 0..100000 {
        sum += 42;
        std::hint::black_box(sum);
    }
    println!("简单求和:{:?}", start.elapsed());
    let start2 = Instant::now();
    for _ in 0..100000 {
        let b = std::hint::black_box(Box::new(42));
        std::hint::black_box(Box::leak(b));
    }
    println!("许多堆调用:{:?}", start2.elapsed());
}
简单求和:27.2微秒
许多堆调用:2.8956毫秒

现在这些数字更有意义了。

要100%确定它没有优化掉任何重要部分,你可以随时检查汇编代码。由于包围 println!() 的汇编代码很难阅读,最好将它们提取到自己的函数中。

确保将这些函数设置为 pub,以便它们出现在最终的汇编代码中,否则它们可能因内联而消失。

以下是示例 演示如何做

use std::time::Instant;

pub fn 简单求和() {
    let mut sum = 0;
    for _ in 0..100000 {
        sum += 42;
        std::hint::black_box(sum);
    }
}

pub fn 许多堆调用() {
    for _ in 0..100000 {
        let b = std::hint::black_box(Box::new(42));
        std::hint::black_box(Box::leak(b));
    }
}

fn main() {
    let start = Instant::now();
    简单求和();
    println!("简单求和:{:?}", start.elapsed());
    let start2 = Instant::now();
    许多堆调用();
    println!("许多堆调用:{:?}", start2.elapsed());
}
example::简单求和:
        sub     rsp, 4
        mov     eax, 42
        mov     rcx, rsp
.LBB0_1:
        mov     dword ptr [rsp], eax
        add     eax, 42
        cmp     eax, 4200042
        jne     .LBB0_1
        add     rsp, 4
        ret

example::许多堆调用:
        push    r15
        push    r14
        push    rbx
        sub     rsp, 16
        mov     ebx, 100000
        mov     r14, qword ptr [rip + __rust_alloc@GOTPCREL]
        lea     r15, [rsp + 8]
.LBB1_1:
        mov     edi, 4
        mov     esi, 4
        call    r14
        test    rax, rax
        je      .LBB1_4
        mov     dword ptr [rax], 42
        mov     qword ptr [rsp + 8], rax
        mov     rax, qword ptr [rsp + 8]
        mov     qword ptr [rsp + 8], rax
        dec     ebx
        jne     .LBB1_1
        add     rsp, 16
        pop     rbx
        pop     r14
        pop     r15
        ret
.LBB1_4:
        mov     edi, 4
        mov     esi, 4
        call    qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
        ud2

需要注意的重要部分是 .LBB0_1:.LBB1_1:jne .LBB0_1 以及 jne .LBB1_1,它们是两个 for 循环。这显示循环没有被优化掉。

还请注意 mov r14, qword ptr [rip + __rust_alloc@GOTPCREL]call r14,这是进行堆分配的实际调用。因此,这个部分也没有被优化掉。

此外,注意到有趣的 cmp eax, 4200042。这表明它重新工作了第一个循环;而不是执行:

    let mut sum = 0;
    for _ in 0..100000 {
        sum += 42;
    }

英文:

Make sure to run benchmark code with --release, otherwise the results will be pretty much meaningless.

In your case, if I run it with --release, I get:

Simple sum: 100ns
Many heap calls: 100ns

This means that the compiler completely optimized away everything, because your loops had zero side effects. If (apart from the time it would take) an operation has no effect, the compiler is allowed to simply remove it.

Note that the compiler even warns:

warning: variable `sum` is assigned to, but never used
 --> src\main.rs:5:13
  |
5 |     let mut sum = 0;
  |             ^^^
  |
  = note: consider using `_sum` instead
  = note: `#[warn(unused_variables)]` on by default

That said, there are situations where you want to keep the operation even though they have no side effect, like for benchmarking. For that, Rust provides std::hint::black_box, which is a function that returns exactly what you give to it, but looks to the compiler as if some fancy calculation would take place so that the compiler can no longer prove that the input is equal to the output. That prevents the compiler from optimizing this function away, and with that everything that feeds into it.

In your case, this is one example of how you could prevent Rust from optimizing away your loop:

use std::time::Instant;

fn main() {
    let start = Instant::now();
    let mut sum = 0;
    for _ in 0..100000 {
        sum += 42;
        std::hint::black_box(sum);
    }
    println!("Simple sum: {:?}", start.elapsed());
    let start2 = Instant::now();
    for _ in 0..100000 {
        let b = std::hint::black_box(Box::new(42));
        std::hint::black_box(Box::leak(b));
    }
    println!("Many heap calls: {:?}", start2.elapsed());
}
Simple sum: 27.2µs
Many heap calls: 2.8956ms

Now those numbers make more sense.

To be 100% sure that it didn't optimize away anything important, you could always check with the disassembly. Because asm surrounding println!()s is hard to read, it makes sense to extract them into their own functions.

Be sure to make those functions pub to make them show up in the final assembly, otherwis they might disappear due to inlining.

Here is how this would look:

use std::time::Instant;

pub fn simple_sum() {
    let mut sum = 0;
    for _ in 0..100000 {
        sum += 42;
        std::hint::black_box(sum);
    }
}

pub fn many_heap_calls() {
    for _ in 0..100000 {
        let b = std::hint::black_box(Box::new(42));
        std::hint::black_box(Box::leak(b));
    }
}

fn main() {
    let start = Instant::now();
    simple_sum();
    println!("Simple sum: {:?}", start.elapsed());
    let start2 = Instant::now();
    many_heap_calls();
    println!("Many heap calls: {:?}", start2.elapsed());
}
example::simple_sum:
        sub     rsp, 4
        mov     eax, 42
        mov     rcx, rsp
.LBB0_1:
        mov     dword ptr [rsp], eax
        add     eax, 42
        cmp     eax, 4200042
        jne     .LBB0_1
        add     rsp, 4
        ret

example::many_heap_calls:
        push    r15
        push    r14
        push    rbx
        sub     rsp, 16
        mov     ebx, 100000
        mov     r14, qword ptr [rip + __rust_alloc@GOTPCREL]
        lea     r15, [rsp + 8]
.LBB1_1:
        mov     edi, 4
        mov     esi, 4
        call    r14
        test    rax, rax
        je      .LBB1_4
        mov     dword ptr [rax], 42
        mov     qword ptr [rsp + 8], rax
        mov     rax, qword ptr [rsp + 8]
        mov     qword ptr [rsp + 8], rax
        dec     ebx
        jne     .LBB1_1
        add     rsp, 16
        pop     rbx
        pop     r14
        pop     r15
        ret
.LBB1_4:
        mov     edi, 4
        mov     esi, 4
        call    qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
        ud2

The important part to notice here is the .LBB0_1:, .LBB1_1: and the jne .LBB0_1 and jne .LBB1_1, which are the two for loops. This shows that the loops did not get optimized away.

Also note the mov r14, qword ptr [rip + __rust_alloc@GOTPCREL] and call r14, which is the actual call that does the heap allocation. So this one also didn't get optimized away.

Also, notice the interesting looking cmp eax, 4200042. This one shows that it reworked the first loop; instead of doing:

    let mut sum = 0;
    for _ in 0..100000 {
        sum += 42;
    }

it optimized it to

    let mut sum = 0;
    while sum != 4200042 {
        sum += 42;
    }

which does in fact give the same result and reuses the sum variable as the loop counter 如何确保Box::new()真的进行了堆分配?

Now compared to how it was before:

use std::time::Instant;

pub fn simple_sum() {
    let mut sum = 0;
    for _ in 0..100000 {
        sum += 42;
    }
}

pub fn many_heap_calls() {
    for _ in 0..100000 {
        let b = Box::new(42);
        Box::leak(b);
    }
}

fn main() {
    let start = Instant::now();
    simple_sum();
    println!("Simple sum: {:?}", start.elapsed());
    let start2 = Instant::now();
    many_heap_calls();
    println!("Many heap calls: {:?}", start2.elapsed());
}
example::simple_sum:
        ret

example::many_heap_calls:
        ret

I don't think this one requires further explanation 如何确保Box::new()真的进行了堆分配?

huangapple
  • 本文由 发表于 2023年3月31日 18:34:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/75897535.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定