2023年4月11日 06:53:41go评论52阅读模式

英文:

Why is this parallelized code spending the same amount of time as a non-parallelized code?

问题

I'm still relatively new to coding in Rust and picked up a college project to parallelize the Newton-Raphson method to a system of non-linear equations to see performance improvements. After reading Rust's official book Chapter 16 - Fearless Concurrency, I gave it a try. But as soon as I tried to run the code, I noticed that it took the same amount of time to display the results* in the terminal. My questions are:

Why is it running as slow as non-parallelized code?
What did I do wrong to make it run as slowly as non-parallelized code?
What can I do to really see some performance gains?
(*: the results produced are completely off scale - they start with some wild values like 2e31 and scale down to 0.5, for example)

Here is the code:

teste.rs

use std::sync::{Arc, Mutex};
use std::thread;

pub fn derivada_parcial(funcao: impl Fn(&[f32]) -> f32, indice: usize) -> impl Fn(&[f32]) -> f32 {
    let h: f32 = 0.001;
    move |x: &[f32]| {
        let mut x_mais_h = x.to_vec();
        let mut x_menos_h = x.to_vec();
        x_mais_h[indice] += h;
        x_menos_h[indice] -= h;
        (funcao(&x_mais_h) - funcao(&x_menos_h)) / (2.0 * h)
    }
}

pub fn par_elim_gauss_og(tam: usize, a: &mut [Vec<f32>], b: &mut [f32]) -> Vec<f32> {
    // Pivoteamento
    for i in 0..tam {
        let mut max_linha = i;
        for j in (i + 1)..tam {
            if a[j][i].abs() > a[max_linha][i].abs() {
                max_linha = j;
            }
        }
        if max_linha != i {
            a.swap(i, max_linha);
            b.swap(i, max_linha);
        }
    }

    // Parallelization?
    let atom_a = Arc::new(Mutex::new(a.to_vec()));
    let atom_b = Arc::new(Mutex::new(b.to_vec()));
    let mut handles = vec![];
    let total_threads = 24;

    // Resolution
    for num_thread in 0..total_threads {
        let atom_a = Arc::clone(&atom_a);
        let atom_b = Arc::clone(&atom_b);
        let handle = thread::spawn(move || {
            let mut a_cop = atom_a.lock().unwrap();
            let mut b_cop = atom_b.lock().unwrap();
            let inicio = (num_thread * tam) / total_threads;
            let fim = ((num_thread + 1) * tam) / total_threads;
            for i in inicio..fim {
                for j in (i + 1)..tam {
                    let m = a_cop[j][i] / a_cop[i][i];
                    for k in i..tam {
                        a_cop[j][k] -= m * a_cop[i][k];
                    }
                    b_cop[j] -= m * b_cop[i];
                }
            }
        });
        handles.push(handle);
    }

    for handle in handles {
        handle.join().unwrap();
    }
    // Substitution
    let mut x: Vec<f32> = vec![0.0; tam];
    for i in (0..tam).rev() {
        let mut soma = 0.0;
        for j in (i + 1)..tam {
            soma += a[i][j] * x[j];
        }
        x[i] = (b[i] - soma) / a[i][i];
    }
    x
}

main.rs

use rand::Rng;

use crate::teste::{derivada_parcial, par_elim_gauss_og};

mod teste;

fn main() {
    const TAM: usize = 1000;
    let a: &mut[Vec<f32>] = &mut vec![vec![0.0; TAM]; TAM];
    let x_0: &[f32] = &[1.0; TAM];
    let b: &mut [f32] = &mut [0.0; TAM];

    for i in 0..TAM {
        for j in 0..TAM {
            // Random variables to accompany the x_i variables
            let potencia: f32 = rand::thread_rng().gen_range(1.0..=6.0);
            let constante: f32 = rand::thread_rng().gen_range(1.0..=10.0);
            let aleatorio_i: i32 = rand::thread_rng().gen_range(0..=1);

            // Creation of functions and their derivatives for i
            let funcao = |x: &[f32]| (-1.0).powi(aleatorio_i) * constante * x[j].powf(potencia);
            let derivada = derivada_parcial(funcao, j);

            // Filling the A matrix and B vector (for Ax = B)
            let valor = derivada(x_0);
            a[i][j] += valor;
            b[i] += funcao(x_0) * (-1.0);
        }
    }

    let resultados = par_elim_gauss_og(TAM, a, b);
    println!("Resultado: {:#?}", resultados);
}

Side note: in main.rs, I am populating the Jacobian matrix with some random constants appended to each variable (I am trying to simulate a system of n non-linear functions with n variables). Variable names and comments are in Brazilian Portuguese.

I expected the "parallelized" code to run faster, but there was no performance gain in it. The book states that when using Arc and Mutex, the priority is thread safety and not performance; but to see no performance gains when running the code in multiple threads seems off. I think one of the problems might be that I'm not running different chunks of code in different threads? I really don't know.

英文:

Why is it running as slow as non-parallelized code?
What did I do wrong to make it run as slowly as non-parallelized code?
What can I do to really see some performance gains?
(*: the results produced are completely off scale - they start with some wild values like 2e31 and scale down to 0.5 for example)
Here is the code:

teste.rs

use std::sync::{Arc, Mutex};
use std::thread;
pub fn derivada_parcial(funcao: impl Fn(&amp;[f32]) -&gt; f32, indice: usize) -&gt; impl Fn(&amp;[f32]) -&gt; f32 {
let h: f32 = 0.001;
move |x: &amp;[f32]| {
let mut x_mais_h = x.to_vec();
let mut x_menos_h = x.to_vec();
x_mais_h[indice] += h;
x_menos_h[indice] -= h;
(funcao(&amp;x_mais_h) - funcao(&amp;x_menos_h)) / (2.0 * h)
}
}
pub fn par_elim_gauss_og(tam: usize, a: &amp;mut [Vec&lt;f32&gt;], b: &amp;mut [f32]) -&gt; Vec&lt;f32&gt; {
// Pivoteamento
for i in 0..tam {
let mut max_linha = i;
for j in (i + 1)..tam {
if a[j][i].abs() &gt; a[max_linha][i].abs() {
max_linha = j;
}
}
if max_linha != i {
a.swap(i, max_linha);
b.swap(i, max_linha);
}
}
// Paraleliza&#231;&#227;o?
let atom_a = Arc::new(Mutex::new(a.to_vec()));
let atom_b = Arc::new(Mutex::new(b.to_vec()));
let mut handles = vec![];
let total_threads = 24;
// Resolu&#231;&#227;o
for num_thread in 0..total_threads {
let atom_a = Arc::clone(&amp;atom_a);
let atom_b = Arc::clone(&amp;atom_b);
let handle = thread::spawn(move || {
let mut a_cop = atom_a.lock().unwrap();
let mut b_cop = atom_b.lock().unwrap();
let inicio = (num_thread * tam) / total_threads;
let fim = ((num_thread + 1) * tam) / total_threads;
for i in inicio..fim {
for j in (i + 1)..tam {
let m = a_cop[j][i] / a_cop[i][i];
for k in i..tam {
a_cop[j][k] -= m * a_cop[i][k];
}
b_cop[j] -= m * b_cop[i];
}
}
});
handles.push(handle);
}
for handle in handles {
handle.join().unwrap();
}
// Substitui&#231;&#227;o
let mut x: Vec&lt;f32&gt; = vec![0.0; tam];
for i in (0..tam).rev() {
let mut soma = 0.0;
for j in (i + 1)..tam {
soma += a[i][j] * x[j];
}
x[i] = (b[i] - soma) / a[i][i];
}
x
}

main.rs

use rand::Rng;
use crate::teste::{derivada_parcial, par_elim_gauss_og};
mod teste;
fn main() {
const TAM: usize = 1000;
let a: &amp;mut[Vec&lt;f32&gt;] = &amp;mut vec![vec![0.0; TAM]; TAM];
let x_0: &amp;[f32] = &amp;[1.0; TAM];
let b: &amp;mut [f32] = &amp;mut [0.0; TAM];
for i in 0..TAM {
for j in 0..TAM {
// Vari&#225;veis aleat&#243;rias que acompanhar&#227;o as vari&#225;veis x_i
let potencia: f32 = rand::thread_rng().gen_range(1.0..=6.0);
let constante: f32 = rand::thread_rng().gen_range(1.0..=10.0);
let aleatorio_i: i32 = rand::thread_rng().gen_range(0..=1);
// Cria&#231;&#227;o das fun&#231;&#245;es e suas derivadas em i
let funcao = |x: &amp;[f32]| (-1.0_f32).powi(aleatorio_i) * constante * x[j].powf(potencia);
let derivada = derivada_parcial(funcao, j);
// Preenchimento do vetor A e vetor B (de Ax = B)
let valor = derivada(x_0);
a[i][j] += valor;
b[i] += funcao(x_0) * (-1.0);
}
}
let resultados = par_elim_gauss(TAM, a, b);
println!(&quot;Resultado: {:#?}&quot;, resultados);
}

Side note: in main.rs I am populating the jacobian matrix with some random constants appended to each variable (I am trying to simulate a system of n non-linear functions with n variables). Variable names and comments are in Brazilian Portuguese.

I expected the "parallelized" code to run faster, but, there was no performance gain in it. The book states that when using Arc and Mutex the priority is thread safety and not performance; but to see no performance gains when running the code in multiple threads seems off. I think one of the problems might be that I'm not running different chunks of code in different threads? I really don't know.

答案1

得分: 2

以下是您要翻译的内容：

What did I do wrong to make it run as slowly as non-parallelized code?

你做错了什么，以致使它的运行速度与非并行化的代码一样慢？

        let handle = thread::spawn(move || {
            let mut a_cop = atom_a.lock().unwrap();
            let mut b_cop = atom_b.lock().unwrap();

你在所有线程之间共享了a和b向量，而每个线程的第一步是锁定这些向量，它们在完成之前不会释放锁。因此，最多只能有一个线程在工作，因为一个线程持有锁，而其他线程都在等待相同的锁。

What can I do to really see some performance gains?

要真正看到性能提升，你需要首先更改你的算法，以使其可以并行化。你的算法似乎既读取又写入a的不同部分，但这不是可切片的方式；你将不得不构造你的代码，以至少将写操作放在向量的不同切片中。

然后，将你的向量划分为块，并将每个块分配给不同的线程。以下是一个执行此操作的简单示例：

fn main() {
    let mut data: Vec<usize> = vec![0; 100];

    let chunk_size = 10;
    let chunk_iter = data.chunks_mut(chunk_size).enumerate();
    std::thread::scope(|s| {
        for (i, chunk) in chunk_iter {
            s.spawn(move || {
                // chunk is a &mut [u32]
                for (j, d) in chunk.iter_mut().enumerate() {
                    *d = i * 100 + j;
                }
            });
        }
    });

    println!("{data:?}");
}

请注意，这里没有锁，因此没有线程会等待另一个线程 — 这是获得高度并行性的关键因素，因为每等待锁可用的微秒都是你没有进行任何有用计算的微秒。你也不需要使用Arc，因为这些是作用域线程，它们可以借用数据；Arc主要用于需要拥有它们使用的所有数据的非作用域线程。

在这个示例中，i是线程号，j是块内的索引。如果需要的话，你可以计算整个向量中的索引（如果这对你的算法有用）为i * chunk_size + j。

请注意，从每个线程内部，不能读取块之外的向量元素。这是因为其他线程对它们自己的块具有独占访问权限；这是Rust提供安全并发的重要部分。

如果需要在写入时读取数据，可以使用“双缓冲”策略：制作数据的两个副本（两个向量），从一个读取并写入另一个，然后在适当时交换两个向量的角色，以便你从较新的一个读取并写入较旧的一个。

这可能是最佳选择，也可能不是；正如你所写，你要“并行化牛顿-拉弗森方法”，这意味着你必须在Rust的内存和并发模型的约束下，找出要使用的算法（无论是新颖的还是在文献中找到的）。

我还强烈建议你尝试rayon库。它提供了自动在多个线程上执行的“并行迭代器”。它是智能的，因为它会自动将工作拆分为适合可用核心的数量，并且并行迭代API通常使问题易于表达。然而，它确实有一些开销；如果你确切地知道自己在做什么，也许可以通过计划自己的线程来击败它，因为你知道问题的确切形状。

英文:

> What did I do wrong to make it run as slowly as non-parallelized code?

        let handle = thread::spawn(move || {
            let mut a_cop = atom_a.lock().unwrap();
            let mut b_cop = atom_b.lock().unwrap();

You share the a and b vectors between all the threads, and the first thing each thread does is take a lock on the vectors, which it does not release until it finishes. Therefore, there can be at most one thread doing work at a time, since one thread holds the locks and the other threads are all waiting for the same locks.

> What can I do to really see some performance gains?

You need to, first, change your algorithm to one that can be parallelized. Your algorithm seems to be both reading and writing different parts of a, but not in a way that is sliceable; you will have to structure your code so that at least the writes are in different slices of the vector.

Then, divide your vector into chunks and give each chunk to a different thread. Here is a simple example of doing that:

fn main() {
    let mut data: Vec&lt;usize&gt; = vec![0; 100];

    let chunk_size = 10;
    let chunk_iter = data.chunks_mut(chunk_size).enumerate();
    std::thread::scope(|s| {
        for (i, chunk) in chunk_iter {
            s.spawn(move || {
                // chunk is a &amp;mut [u32]
                for (j, d) in chunk.iter_mut().enumerate() {
                    *d = i * 100 + j;
                }
            });
        }
    });

    println!(&quot;{data:?}&quot;);
}

Notice that there are no locks, so no thread ever waits for another — this is a key element of getting high parallelism, because every microsecond spent waiting for a lock to be available is a microsecond where you're not doing any useful computation. You also don't need to use Arc, because these are scoped threads which can borrow data; Arc is mostly needed for non-scoped threads which need to own all the data they use.

In this example, i is the thread number and j is the index within the chunk. You could compute the index in the vector as a whole (if this is needed for your algorithm) as i * chunk_size + j.

Note that from within each thread, you cannot read from elements in the vector outside of the chunk. This is because the other threads have exclusive access to their own chunks; this is a key part of how Rust provides safe concurrency.

If you need to read data while also writing, you may wish to use a “double buffering” strategy: make two copies of the data (two vectors), read from one and write to the other, then when appropriate, swap the roles of the two vectors so that you're reading from the newer one and writing to the older one.

That may or may not be the best option; as you write, you have a “project to parallelize the Newton-Raphson method”, and that means you will have to figure out what algorithm to use (whether it is novel or found in the literature), within the constraints of Rust's memory and concurrency model.

I also strongly recommend that you try out the rayon library. It provides “parallel iterators” that automatically execute on multiple threads. It is intelligent in that it automatically splits the work into as many parts as suits the available cores, and the parallel iteration API often makes problems easy to express. However, it does have some overhead; if you know exactly what you're doing, you may be able to beat it by planning your own threads since you know the exact shape of the problem.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么这段并行化的代码花费的时间与非并行化的代码相同？

问题

答案1

重新绑定一个变量，使其在`match`声明中使用时具有不同的类型。

[pandas]: 从一张表创建外键到另一张表的最高性能方式？

NeoVim显示了当前工作空间之外的`std`和其他包的诊断信息。

我使用 Polars 与 ConnectorX：为什么我会收到 Dataframe 类型错误？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论