2023年7月27日 08:24:01go评论137阅读模式

英文:

How is Rust --release build slower than Go?

问题

我正在尝试学习Rust的并发和并行计算，并编写了一个简单的脚本，对一个向量的向量进行迭代，就像处理图像像素一样。由于一开始我想看看使用iter和par_iter的速度差异，所以我加入了一个基本的计时器——可能不是非常准确。然而，我得到了非常高的数字。所以，我想在Go上编写一个类似的代码，以便实现简单的并发，而且性能提高了约585%！

Rust使用了--release进行测试

我还尝试使用本地线程池，但结果是一样的。我查看了使用的线程数，一度也在调整线程数，但没有成功。

我做错了什么？
（不要在意创建填充随机值的向量的性能不佳的方式）

Rust代码（约140毫秒）

use rand::Rng;
use std::time::Instant;
use rayon::prelude::*;
fn normalise(value: u16, min: u16, max: u16) -> f32 {
    (value - min) as f32 / (max - min) as f32
}
fn main() {
    let pixel_size = 9_000_000;
    let fake_image: Vec<Vec<u16>> = (0..pixel_size).map(|_| {
        (0..4).map(|_| {
            rand::thread_rng().gen_range(0..=u16::MAX)
        }).collect()
    }).collect();
    // 计时开始
    let now = Instant::now();
    let chunk_size = 300_000;
    let _normalised_image: Vec<Vec<Vec<f32>>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<Vec<f32>> = chunk.iter().map(|i| {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            vec![r, g, b, a]
        }).collect();
        normalised_chunk
    }).collect();
    // 计时结束
    let elapsed = now.elapsed();
    println!("耗时: {:.2?}", elapsed);
}

Go代码（约24毫秒）

package main
import (
	"fmt"
	"math/rand"
	"sync"
	"time"
)
func normalise(value uint16, min uint16, max uint16) float32 {
	return float32(value-min) / float32(max-min)
}
func main() {
	const pixelSize = 9000000
	var fakeImage [][]uint16
	// 创建一个新的随机数生成器
	src := rand.NewSource(time.Now().UnixNano())
	rng := rand.New(src)
	for i := 0; i < pixelSize; i++ {
		var pixel []uint16
		for j := 0; j < 4; j++ {
			pixel = append(pixel, uint16(rng.Intn(1<<16)))
		}
		fakeImage = append(fakeImage, pixel)
	}
	normalised_image := make([][4]float32, pixelSize)
	var wg sync.WaitGroup
	// 计时开始
	now := time.Now()
	chunkSize := 300_000
	numChunks := pixelSize / chunkSize
	if pixelSize%chunkSize != 0 {
		numChunks++
	}
	for i := 0; i < numChunks; i++ {
		wg.Add(1)
		go func(i int) {
			// 遍历块中的像素
			for j := i * chunkSize; j < (i+1)*chunkSize && j < pixelSize; j++ {
				// 对像素值进行归一化
				_r := normalise(fakeImage[j][0], 0, ^uint16(0))
				_g := normalise(fakeImage[j][1], 0, ^uint16(0))
				_b := normalise(fakeImage[j][2], 0, ^uint16(0))
				_a := normalise(fakeImage[j][3], 0, ^uint16(0))
				// 设置像素值
				normalised_image[j][0] = _r
				normalised_image[j][1] = _g
				normalised_image[j][2] = _b
				normalised_image[j][3] = _a
			}
			wg.Done()
		}(i)
	}
	wg.Wait()
	elapsed := time.Since(now)
	fmt.Println("耗时:", elapsed)
}

英文:

I'm trying to learn about Rust's concurrency and parallel computing and threw together a small script that iterates over a vector of vectors like it was an image's pixels. Since at first I was trying to see how much faster it gets iter vs par_iter I threw in a basic timer -- which is probably not amazingly accurate. However, I was getting crazy high numbers. So, I thought I would put together a similar piece of code on Go that allows for easy concurrency and the performance is ~585% faster!

Rust was tested with --release

I also tried using native thread pool but the results were the same. Looked at how many threads I was using and for a bit I was messing around with that as well, to no avail.

What am I doing wrong?
(don't mind the definitely not performant way of creating a random value filled vector of vectors)

Rust code (~140ms)

use rand::Rng;
use std::time::Instant;
use rayon::prelude::*;
fn normalise(value: u16, min: u16, max: u16) -&gt; f32 {
    (value - min) as f32 / (max - min) as f32
}
fn main() {
    let pixel_size = 9_000_000;
    let fake_image: Vec&lt;Vec&lt;u16&gt;&gt; = (0..pixel_size).map(|_| {
        (0..4).map(|_| {
            rand::thread_rng().gen_range(0..=u16::MAX)
        }).collect()
    }).collect();
    // Time starts now.
    let now = Instant::now();
    let chunk_size = 300_000;
    let _normalised_image: Vec&lt;Vec&lt;Vec&lt;f32&gt;&gt;&gt; = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec&lt;Vec&lt;f32&gt;&gt; = chunk.iter().map(|i| {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            vec![r, g, b, a]
        }).collect();
        normalised_chunk
    }).collect();
    // Timer ends.
    let elapsed = now.elapsed();
    println!(&quot;Time elapsed: {:.2?}&quot;, elapsed);
}

Go code (~24ms)

package main
import (
	&quot;fmt&quot;
	&quot;math/rand&quot;
	&quot;sync&quot;
	&quot;time&quot;
)
func normalise(value uint16, min uint16, max uint16) float32 {
	return float32(value-min) / float32(max-min)
}
func main() {
	const pixelSize = 9000000
	var fakeImage [][]uint16
	// Create a new random number generator
	src := rand.NewSource(time.Now().UnixNano())
	rng := rand.New(src)
	for i := 0; i &lt; pixelSize; i++ {
		var pixel []uint16
		for j := 0; j &lt; 4; j++ {
			pixel = append(pixel, uint16(rng.Intn(1&lt;&lt;16)))
		}
		fakeImage = append(fakeImage, pixel)
	}
	normalised_image := make([][4]float32, pixelSize)
	var wg sync.WaitGroup
	// Time starts now
	now := time.Now()
	chunkSize := 300_000
	numChunks := pixelSize / chunkSize
	if pixelSize%chunkSize != 0 {
		numChunks++
	}
	for i := 0; i &lt; numChunks; i++ {
		wg.Add(1)
		go func(i int) {
			// Loop through the pixels in the chunk
			for j := i * chunkSize; j &lt; (i+1)*chunkSize &amp;&amp; j &lt; pixelSize; j++ {
				// Normalise the pixel values
				_r := normalise(fakeImage[j][0], 0, ^uint16(0))
				_g := normalise(fakeImage[j][1], 0, ^uint16(0))
				_b := normalise(fakeImage[j][2], 0, ^uint16(0))
				_a := normalise(fakeImage[j][3], 0, ^uint16(0))
				// Set the pixel values
				normalised_image[j][0] = _r
				normalised_image[j][1] = _g
				normalised_image[j][2] = _b
				normalised_image[j][3] = _a
			}
			wg.Done()
		}(i)
	}
	wg.Wait()
	elapsed := time.Since(now)
	fmt.Println(&quot;Time taken:&quot;, elapsed)
}

答案1

得分: 4

加速 Rust 代码的最重要的初始更改是使用正确的类型。在 Go 中，你使用 [4]float32 表示 RGBA 四元组，而在 Rust 中你使用 Vec<f32>。性能上使用的正确类型是 [f32; 4]，它是一个已知包含 4 个浮点数的数组。已知大小的数组不需要在堆上分配内存，而 Vec 总是在堆上分配内存。这会极大地提高性能 - 在我的机器上，差异是 8 倍。

原始代码片段：

    let fake_image: Vec<Vec<u16>> = (0..pixel_size).map(|_| {
        (0..4).map(|_| {
            rand::thread_rng().gen_range(0..=u16::MAX)
        }).collect()
    }).collect();
... 
    let _normalised_image: Vec<Vec<Vec<f32>>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<Vec<f32>> = chunk.iter().map(|i| {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            vec![r, g, b, a]
        }).collect();
        normalised_chunk
    }).collect();

新的代码片段：

    let fake_image: Vec<[u16; 4]> = (0..pixel_size).map(|_| {
	let mut result: [u16; 4] = Default::default();
	result.fill_with(|| rand::thread_rng().gen_range(0..=u16::MAX));
	result
    }).collect();
...
    let _normalised_image: Vec<Vec<[f32; 4]>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<[f32; 4]> = chunk.iter().map(|i| {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            [r, g, b, a]
        }).collect();
        normalised_chunk
    }).collect();

在我的机器上，这将导致大约 7.7 倍的加速，使得 Rust 和 Go 大致相当。为每个四元组进行堆分配的开销严重拖慢了 Rust 的速度，并淹没了其他所有内容；消除这一点使得 Rust 和 Go 的性能更加平衡。

其次，你的 Go 代码中有一个小错误。在 Rust 代码中，你计算了归一化的 r、g、b 和 a，而在 Go 代码中，你只计算了 _r、_g 和 _b。我没有在我的机器上安装 Go，但我想象这给了 Go 一个轻微的不公平优势，因为你做的工作更少。

第三，你在 Rust 和 Go 中仍然没有完全做相同的事情。在 Rust 中，你将原始图像分成块，并为每个块生成一个 Vec<[f32; 4]>。这意味着你仍然有一堆块在内存中，稍后你将把它们组合成一个最终的图像。在 Go 中，你将原始块分割，并将每个块写入一个公共数组。我们可以进一步重写你的 Rust 代码以完全模仿 Go 代码。以下是在 Rust 中的实现：

let _normalized_image: Vec<[f32; 4]> = {
	let mut destination = vec![[0 as f32; 4]; pixel_size];
	
	fake_image
	    .par_chunks(chunk_size)
        // "zip" 函数允许我们同时迭代输入数组的一个块和目标数组的一个块。
	    .zip(destination.par_chunks_mut(chunk_size))
	    .for_each(|(i_chunk, d_chunk)| {
        // 检查块的长度是否相等。
		assert!(i_chunk.len() == d_chunk.len());
		for (i, d) in i_chunk.iter().zip(d_chunk) {
		    let r = normalise(i[0], 0, u16::MAX);
		    let g = normalise(i[1], 0, u16::MAX);
		    let b = normalise(i[2], 0, u16::MAX);
		    let a = normalise(i[3], 0, u16::MAX);
		    
		    *d = [r, g, b, a];
		    // 或者，我们可以使用以下循环：
		    // for j in 0..4 {
		    // 	d[j] = normalise(i[j], 0, u16::MAX);
		    // }
	    }
	});
	destination
};

现在你的 Rust 代码和 Go 代码真正做了相同的事情。我猜你会发现 Rust 代码稍微更快。

最后，如果你在实际工作中这样做，你应该尝试使用 map，如下所示：

    let _normalized_image = fake_image.par_iter().map(|&[r, b, g, a]| {
	[ normalise(r, 0, u16::MAX),
	  normalise(b, 0, u16::MAX),
	  normalise(g, 0, u16::MAX),
	  normalise(a, 0, u16::MAX),
	  ]
    }).collect::<Vec<_>>();

在我的机器上，这与手动分块一样快。

英文:

The most important initial change for speeding up your Rust code is using the correct type. In Go, you use a [4]float32 to represent an RBGA quadruple, while in Rust you use a Vec<f32>. The correct type to use for performance is [f32; 4], which is an array known to contain exactly 4 floats. An array with known size need not be heap-allocated, while a Vec is always heap-allocated. This improves your performance drastically - on my machine, it's a factor of 8 difference.

Original snippet:

    let fake_image: Vec&lt;Vec&lt;u16&gt;&gt; = (0..pixel_size).map(|_| {
        (0..4).map(|_| {
            rand::thread_rng().gen_range(0..=u16::MAX)
        }).collect()
    }).collect();
... 
    let _normalised_image: Vec&lt;Vec&lt;Vec&lt;f32&gt;&gt;&gt; = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec&lt;Vec&lt;f32&gt;&gt; = chunk.iter().map(|i| {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            vec![r, g, b, a]
        }).collect();
        normalised_chunk
    }).collect();

New snippet:

    let fake_image: Vec&lt;[u16; 4]&gt; = (0..pixel_size).map(|_| {
	let mut result: [u16; 4] = Default::default();
	result.fill_with(|| rand::thread_rng().gen_range(0..=u16::MAX));
	result
    }).collect();
...
    let _normalised_image: Vec&lt;Vec&lt;[f32; 4]&gt;&gt; = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec&lt;[f32; 4]&gt; = chunk.iter().map(|i| {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            [r, g, b, a]
        }).collect();
        normalised_chunk
    }).collect();

On my machine, this results in a roughly 7.7x speedup, bringing Rust and Go roughly to parity. The overhead of doing a heap allocation for every single quadruple slowed Rust down drastically and drowned out everything else; eliminating this puts Rust and Go on more even footing.

Second, there is a slight error in your Go code. In your Rust code, you calculate a normalized r, g, b, and a, while in your Go code, you only calculate _r, _g, and _b. I don't have Go installed on my machine, but I imagine this gives Go a slight unfair advantage over Rust, since you're doing less work.

Third, you are still not quite doing the same thing in Rust and Go. In Rust, you split the original image into chunks and, for each chunk, produce a Vec<[f32; 4]>. This means you still have a bunch of chunks sitting around in memory that you'll later have to combine into a single final image. In Go, you split the original chunks and, for each chunk, write the chunk into a common array. We can rewrite your Rust code further to perfectly mimic the Go code. Here is what this looks like in Rust:

let _normalized_image: Vec&lt;[f32; 4]&gt; = {
	let mut destination = vec![[0 as f32; 4]; pixel_size];
	
	fake_image
	    .par_chunks(chunk_size)
        // The &quot;zip&quot; function allows us to iterate over a chunk of the input 
        // array together with a chunk of the destination array.
	    .zip(destination.par_chunks_mut(chunk_size))
	    .for_each(|(i_chunk, d_chunk)| {
        // Sanity check: the chunks should be of equal length.
		assert!(i_chunk.len() == d_chunk.len());
		for (i, d) in i_chunk.iter().zip(d_chunk) {
		    let r = normalise(i[0], 0, u16::MAX);
		    let g = normalise(i[1], 0, u16::MAX);
		    let b = normalise(i[2], 0, u16::MAX);
		    let a = normalise(i[3], 0, u16::MAX);
		    
		    *d = [r, g, b, a];
		    // Alternately, we could do the following loop:
		    // for j in 0..4 {
		    // 	d[j] = normalise(i[j], 0, u16::MAX);
		    // }
	    }
	});
	destination
};

Now your Rust code and your Go code are truly doing the same thing. I suspect you'll find the Rust code is slightly faster.

Finally, if you were doing this in real life, the first thing you should try would be using map as follows:

    let _normalized_image = fake_image.par_iter().map(|&amp;[r, b, g, a]| {
	[ normalise(r, 0, u16::MAX),
	  normalise(b, 0, u16::MAX),
	  normalise(g, 0, u16::MAX),
	  normalise(a, 0, u16::MAX),
	  ]
    }).collect::&lt;Vec&lt;_&gt;&gt;();

This is just as fast as manually chunking on my machine.

答案2

得分: 1

use rand::Rng;
use std::time::Instant;
use rayon::prelude::*;
fn normalise(value: u16, min: u16, max: u16) -> f32 {
    (value - min) as f32 / (max - min) as f32
}
type PixelU16 = (u16, u16, u16, u16);
type PixelF32 = (f32, f32, f32, f32);
fn main() {
    let pixel_size = 9_000_000;
    let fake_image: Vec<PixelU16> = (0..pixel_size).map(|_| {
        let mut rng = rand::thread_rng();
        (rng.gen_range(0..=u16::MAX), rng.gen_range(0..=u16::MAX), rng.gen_range(0..=u16::MAX), rng.gen_range(0..=u16::MAX))
    }).collect();
    // 时间开始计算。
    let now = Instant::now();
    let chunk_size = 300_000;
    let _normalised_image: Vec<Vec<PixelF32>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<PixelF32> = chunk.iter().map(|i| {
            let r = normalise(i.0, 0, u16::MAX);
            let g = normalise(i.1, 0, u16::MAX);
            let b = normalise(i.2, 0, u16::MAX);
            let a = normalise(i.3, 0, u16::MAX);
            (r, g, b, a)
        }).collect::<Vec<_>>();
        normalised_chunk
    }).collect();
    // 计时结束。
    let elapsed = now.elapsed();
    println!("经过时间: {:.2?}", elapsed);
}

我已经将数组切换为元组，并且在我的机器上，这个解决方案已经比你提供的解决方案快了10倍。通过减少堆分配的数量，可以通过切割Vec并使用Arc<Mutex<Vec<Pixel>>>或一些mpsc通道来进一步提高速度。

英文:

use rand::Rng;
use std::time::Instant;
use rayon::prelude::*;
fn normalise(value: u16, min: u16, max: u16) -&gt; f32 {
(value - min) as f32 / (max - min) as f32
}
type PixelU16 = (u16, u16, u16, u16);
type PixelF32 = (f32, f32, f32, f32);
fn main() {
let pixel_size = 9_000_000;
let fake_image: Vec&lt;PixelU16&gt; = (0..pixel_size).map(|_| {
let mut rng =
rand::thread_rng();
(rng.gen_range(0..=u16::MAX), rng.gen_range(0..=u16::MAX), rng.gen_range(0..=u16::MAX), rng.gen_range(0..=u16::MAX))
}).collect();
// Time starts now.
let now = Instant::now();
let chunk_size = 300_000;
let _normalised_image: Vec&lt;Vec&lt;PixelF32&gt;&gt; = fake_image.par_chunks(chunk_size).map(|chunk| {
let normalised_chunk: Vec&lt;PixelF32&gt; = chunk.iter().map(|i| {
let r = normalise(i.0, 0, u16::MAX);
let g = normalise(i.1, 0, u16::MAX);
let b = normalise(i.2, 0, u16::MAX);
let a = normalise(i.3, 0, u16::MAX);
(r, g, b, a)
}).collect::&lt;Vec&lt;_&gt;&gt;();
normalised_chunk
}).collect();
// Timer ends.
let elapsed = now.elapsed();
println!(&quot;Time elapsed: {:.2?}&quot;, elapsed);
}

I have switched using arrays to tuple and the solution is already 10 times faster than the solution you provided on my machine. Speed could maybe even increased by cutting the Vec and using an Arc<Mutex<Vec<Pixel>>> or some mpsc channel by reducing the amount of heap allocations.

答案3

得分: 1

Vec<Vec<T>>通常不被推荐使用，因为它在缓存友好性方面表现不佳，而当你有Vec<Vec<Vec<T>>>时，情况会更糟糕。

内存分配的过程也会消耗很多时间。

一个轻微的改进是将类型更改为Vec<Vec<[T; N]>>，因为最内层的Vec<T>应该是一个固定大小的4个u16或f32。这将将我的PC上的处理时间从约110毫秒减少到11毫秒。

然而，这仍然需要大量的分配和复制。如果不需要新的向量，原地变异可能会更快。大约5毫秒。

在这里，Vec<Vec<T>>仍然不理想，而将其展平在这种特定情况下并没有显着的性能改进。访问这个嵌套数组结构中的元素将比访问一个平坦的数组慢。

以下是翻译好的内容：

Vec<Vec<T>>通常不被推荐使用，因为它在缓存友好性方面表现不佳，而当你有Vec<Vec<Vec<T>>>时，情况会更糟糕。

内存分配的过程也会消耗很多时间。

fn rev1() {
    let pixel_size = 9_000_000;
    let chunk_size = 300_000;
    let fake_image: Vec<[u16; 4]> = (0..pixel_size)
        .map(|_| {
            core::array::from_fn(|_| rand::thread_rng().gen_range(0..=u16::MAX))
        })
        .collect();
    // Time starts now.
    let now = Instant::now();
    let _normalized_image: Vec<Vec<[f32; 4]>>> = fake_image
        .par_chunks(chunk_size)
        .map(|chunk| {
            chunk
                .iter()
                .map(|rgba: &[u16; 4]| rgba.map(|v| normalise(v, 0, u16::MAX)))
                .collect()
        })
        .collect();
    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed (r1): {:.2?}", elapsed);
}

然而，这仍然需要大量的分配和复制。如果不需要新的向量，原地变异可能会更快。大约5毫秒。

pub fn rev2() {
    let pixel_size = 9_000_000;
    let chunk_size = 300_000;
    let mut fake_image: Vec<Vec<[f32; 4]>>> = (0..pixel_size / chunk_size)
        .map(|_| {
            (0..chunk_size)
                .map(|_| {
                    core::array::from_fn(|_| {
                        rand::thread_rng().gen_range(0..=u16::MAX) as f32
                    })
                })
                .collect()
        })
        .collect();
    // Time starts now.
    let now = Instant::now();
    fake_image.par_iter_mut().for_each(|chunk| {
        chunk.iter_mut().for_each(|rgba: &mut [f32; 4]| {
            rgba.iter_mut().for_each(|v: &mut _| {
                *v = normalise_f32(*v, 0f32, u16::MAX as f32)
            })
        })
    });
    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed (r2): {:.2?}", elapsed);
}

在这里，Vec<Vec<T>>仍然不理想，而将其展平在这种特定情况下并没有显着的性能改进。访问这个嵌套数组结构中的元素将比访问一个平坦的数组慢。

/// Create a new flat Vec from fake_image
pub fn rev3() {
    let pixel_size = 9_000_000;
    let _chunk_size = 300_000;
    let fake_image: Vec<[u16; 4]> = (0..pixel_size)
        .map(|_| {
            core::array::from_fn(|_| rand::thread_rng().gen_range(0..=u16::MAX))
        })
        .collect();
    // Time starts now.
    let now = Instant::now();
    let _normalized_image: Vec<[f32; 4]> = fake_image
        .par_iter()
        .map(|rgba: &[u16; 4]| rgba.map(|v| normalise(v, 0, u16::MAX)))
        .collect();
    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed (r3): {:.2?}", elapsed);
}
/// In place mutation of a flat Vec
pub fn rev4() {
    let pixel_size = 9_000_000;
    let _chunk_size = 300_000;
    let mut fake_image: Vec<[f32; 4]> = (0..pixel_size)
        .map(|_| {
            core::array::from_fn(|_| {
                rand::thread_rng().gen_range(0..=u16::MAX) as f32
            })
        })
        .collect();
    // Time starts now.
    let now = Instant::now();
    fake_image.par_iter_mut().for_each(|rgba: &mut [f32; 4]| {
        rgba.iter_mut()
            .for_each(|v: &mut _| *v = normalise_f32(*v, 0f32, u16::MAX as f32))
    });
    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed (r4): {:.2?}", elapsed);
}

英文:

Vec<Vec<T>> is usually not recommended, because it's not very cache friendly, since you have Vec<Vec<Vec<T>>> the situation is even worse.

The process of memory allocation also cost a lot of time.

A slight improvement is to change the type to Vec<Vec<[T; N]>>, since the inner most Vec<T> should be a fixed size of 4 u16 or f32. This reduced the processing time on my PC from ~110ms down to 11ms.

fn rev1() {
    let pixel_size = 9_000_000;
    let chunk_size = 300_000;
    let fake_image: Vec&lt;[u16; 4]&gt; = (0..pixel_size)
        .map(|_| {
            core::array::from_fn(|_| rand::thread_rng().gen_range(0..=u16::MAX))
        })
        .collect();
    // Time starts now.
    let now = Instant::now();
    let _normalized_image: Vec&lt;Vec&lt;[f32; 4]&gt;&gt; = fake_image
        .par_chunks(chunk_size)
        .map(|chunk| {
            chunk
                .iter()
                .map(|rgba: &amp;[u16; 4]| rgba.map(|v| normalise(v, 0, u16::MAX)))
                .collect()
        })
        .collect();
    // Timer ends.
    let elapsed = now.elapsed();
    println!(&quot;Time elapsed (r1): {:.2?}&quot;, elapsed);
}

However, this still requires a lot of allocation and copies. If a new vector is not needed, in place mutation can be even faster. ~5ms

pub fn rev2() {
    let pixel_size = 9_000_000;
    let chunk_size = 300_000;
    let mut fake_image: Vec&lt;Vec&lt;[f32; 4]&gt;&gt; = (0..pixel_size / chunk_size)
        .map(|_| {
            (0..chunk_size)
                .map(|_| {
                    core::array::from_fn(|_| {
                        rand::thread_rng().gen_range(0..=u16::MAX) as f32
                    })
                })
                .collect()
        })
        .collect();
    // Time starts now.
    let now = Instant::now();
    fake_image.par_iter_mut().for_each(|chunk| {
        chunk.iter_mut().for_each(|rgba: &amp;mut [f32; 4]| {
            rgba.iter_mut().for_each(|v: &amp;mut _| {
                *v = normalise_f32(*v, 0f32, u16::MAX as f32)
            })
        })
    });
    // Timer ends.
    let elapsed = now.elapsed();
    println!(&quot;Time elapsed (r2): {:.2?}&quot;, elapsed);
}

Here the Vec<Vec<T>> is still not ideal, while flattening it doesn't produce a significant performance improvement in this particular situation. Accessing an element in this nested array structure will be slower than a flat array.

/// Create a new flat Vec from fake_image
pub fn rev3() {
    let pixel_size = 9_000_000;
    let _chunk_size = 300_000;
    let fake_image: Vec&lt;[u16; 4]&gt; = (0..pixel_size)
        .map(|_| {
            core::array::from_fn(|_| rand::thread_rng().gen_range(0..=u16::MAX))
        })
        .collect();
    // Time starts now.
    let now = Instant::now();
    let _normalized_image: Vec&lt;[f32; 4]&gt; = fake_image
        .par_iter()
        .map(|rgba: &amp;[u16; 4]| rgba.map(|v| normalise(v, 0, u16::MAX)))
        .collect();
    // Timer ends.
    let elapsed = now.elapsed();
    println!(&quot;Time elapsed (r3): {:.2?}&quot;, elapsed);
}
/// In place mutation of a flat Vec
pub fn rev4() {
    let pixel_size = 9_000_000;
    let _chunk_size = 300_000;
    let mut fake_image: Vec&lt;[f32; 4]&gt; = (0..pixel_size)
        .map(|_| {
            core::array::from_fn(|_| {
                rand::thread_rng().gen_range(0..=u16::MAX) as f32
            })
        })
        .collect();
    // Time starts now.
    let now = Instant::now();
    fake_image.par_iter_mut().for_each(|rgba: &amp;mut [f32; 4]| {
        rgba.iter_mut()
            .for_each(|v: &amp;mut _| *v = normalise_f32(*v, 0f32, u16::MAX as f32))
    });
    // Timer ends.
    let elapsed = now.elapsed();
    println!(&quot;Time elapsed (r4): {:.2?}&quot;, elapsed);
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Rust的–release构建为什么比Go慢？

问题

答案1

答案2

答案3

当上下文没有传递时，如何获取gin上下文？

Go and colors in console

MongoDB在Go中将对象数据返回为键值对数组。

连续的Go语言中的MySQL查询在某个点之后变得非常慢。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。