2023年7月20日 16:14:51go评论109阅读模式

英文:

Rust FFI and CUDA C Performance Discrepancy

问题

我有一个矩阵乘法核心，在使用CUDA C时，它的性能比在调用相同函数的Rust FFI版本时快10倍。

我知道我可以使用cuBLAS，但我使用这个来练习学习更高级的CUDA优化。

当查看NVIDIA的nsight系统时，我发现在使用Rust版本时，核心的执行时间非常长。在两个测试中都没有额外的开销。这很令人困惑，因为同样的核心为何在Rust中需要更长时间？

这让我认为可能是我如何为Rust编译库的问题，尤其是在使用cuBLAS时两个测试的时间相同。

这是矩阵库的build.rs文件：

use cc;
use std::{env, path::Path};
fn main() {
  println!("cargo:rerun-if-changed=cuda_kernels/cuda_kernels.cu");
  cc::Build::new()
    .cuda(true)
    .cudart("static")
    .file("cuda_kernels/cuda_kernels.cu")
    .compile("cuda_kernels");
  if let Ok(cuda_path) = env::var("CUDA_HOME") {
    println!("cargo:rustc-link-search=native={}/lib64", cuda_path);
  } else {
    println!("cargo:rustc-link-search=native=/usr/local/cuda/lib64");
  }
  println!("cargo:rustc-link-lib=dylib=cudart");
  println!("cargo:rustc-link-lib=dylib=cublas");
}

这是我的测试代码：

CUDA库的头文件：

#include <cublas_v2.h>
#include <cuda.h>
// 确保Rust中的绑定没有被编码
extern "C" {
// 杂项
void cuda_synchronize();
// 矩阵设置API
size_t register_matrix(float* data, size_t rows, size_t cols);
void unregister_matrix(size_t mat_id);
void get_matrix_data(size_t mat_id, int rows, int cols, float* data_buffer);
// 矩阵操作API
size_t cuda_matrix_multiply(size_t mat1_id, size_t mat1_rows, size_t mat1_cols, size_t mat2_id, size_t mat2_rows, size_t mat2_cols);
}

CUDA C测试：

#include <chrono>
#include <vector>
using namespace std::chrono;
#include "../cuda_kernels.cuh"
int main() {
    // 这只是为了计时，假设一切正确。
    int dim = 4096;
    std::vector<float> data;
    for (int i = 0; i < dim * dim; i++) {
        data.push_back(23.47);
    }
    // 将数据注册为4096 x 4096矩阵
    int mat1 = register_matrix(&data[0], dim, dim);
    int mat2 = register_matrix(&data[0], dim, dim);
    auto start_host = high_resolution_clock::now();
    cudaEvent_t start;
    cudaEvent_t end;
    cudaEventCreate(&start);
    cudaEventCreate(&end);
    cudaEventRecord(start);
    int num_iter = 100;
    for (int i = 0; i < num_iter; i++) {
        // 执行乘法
        int result_id = cuda_matrix_multiply(mat1, dim, dim, mat2, dim, dim);
        cuda_synchronize();
        unregister_matrix(result_id);
    }
    float gpu_time = 0;
    cudaEventRecord(end);
    cudaEventSynchronize(start);
    cudaEventSynchronize(end);
    cudaEventElapsedTime(&gpu_time, start, end);
    auto end_host = high_resolution_clock::now();
    auto cpu_time = duration_cast<milliseconds>(end_host - start_host);
    std::cout << "Average gpu function time was: " << gpu_time / num_iter << " ms" << std::endl;
    std::cout << "Including overhead was: " << (float)cpu_time.count() / num_iter << " ms" << std::endl;
    // 好吧，Rust基准测试的开销有问题。在这里花费184.3毫秒的东西在那里花费了1.3秒。
}

现在在Rust一侧，这是与CUDA C版本相同的CUDA函数的绑定：

Bindings.rs：

use std::ffi::c_float;
extern "C" {
  pub fn cuda_synchronize();
  pub fn register_matrix(data: *const c_float, rows: usize, cols: usize) -> usize;
  pub fn unregister_matrix(mat_id: usize) -> usize;
  pub fn cuda_matrix_multiply(
    mat1_id: usize,
    mat1_rows: usize,
    mat1_cols: usize,
    mat2_buffer: usize,
    mat2_rows: usize,
    mat2_cols: usize,
  ) -> usize;
}

Rust基准测试与CUDA C版本相同：

rust_bench_test.rs：

// 查看Rust FFI和C++基准测试之间为何存在13倍差异
use std::time::Instant;
use matrix_lib::bindings::{
  cuda_matrix_multiply, cuda_synchronize, register_matrix, unregister_matrix,
};
#[test]
fn mat_mult_benchmark() {
  // 用于生成随机数
  let mat_dim = 4096;
  let id_1;
  let id_2;
  unsafe {
    id_1 = register_matrix(vec![0.0; mat_dim * mat_dim].as_ptr(), mat_dim, mat_dim);
    id_2 = register_matrix(vec![0.0; mat_dim * mat_dim].as_ptr(), mat_dim, mat_dim);
  }
  let num_iterations = 100;
  let start = Instant::now();
  let mut result_id = 0;
  for _ in 0..num_iterations {
    unsafe {
      result_id = cuda_matrix_multiply(id_1, mat_dim, mat_dim, id_2, mat_dim, mat_dim);
      cuda_synchronize();
      unregister_matrix(result_id);
    }
  }
  unsafe { cuda_synchronize() }
  let elapsed = start.elapsed();
  println!(
    "\n=================================\nTime per iteration: {} ms\n=================================",
    elapsed.as_millis() as f64 / num_iterations as f64
  );
  print!("{}", result_id);
  assert!(false);
}

英文:

I have a matrix multiplication kernel that when timed in cuda c is 10x faster than when calling the same functions over rust ffi.

IK I can use cuBLAS but I was using this as an exercise to learn more advanced cuda optimizations.

When looking at nvidia's nsight systems, I see the kernel taking extremely long when using the rust version. Basically no overhead in both tests. This is confusing, as how can same kernel take longer in Rust?

It makes me think it has to be an issue with how I compiled the library for Rust. Especially since the timings in both tests are identical when using cuBLAS.

Here is the build.rs for the matrix library

use cc;
use std::{env, path::Path};
fn main() {
  println!(&quot;cargo:rerun-if-changed=cuda_kernels/cuda_kernels.cu&quot;);
  // let cublas_path =
  //   Path::new(&quot;C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.2/lib/x64/cublas.lib&quot;);
  cc::Build::new()
    .cuda(true)
    .cudart(&quot;static&quot;)
    // .object(cublas_path)
    .file(&quot;cuda_kernels/cuda_kernels.cu&quot;)
    .compile(&quot;cuda_kernels&quot;);
  if let Ok(cuda_path) = env::var(&quot;CUDA_HOME&quot;) {
    println!(&quot;cargo:rustc-link-search=native={}/lib64&quot;, cuda_path);
  } else {
    println!(&quot;cargo:rustc-link-search=native=/usr/local/cuda/lib64&quot;);
  }
  println!(&quot;cargo:rustc-link-lib=dylib=cudart&quot;);
  println!(&quot;cargo:rustc-link-lib=dylib=cublas&quot;);
}

And here is the code for my tests.

CUDA Headers for my library

#include &lt;cublas_v2.h&gt;
#include &lt;cuda.h&gt;
// Make sure bindings are not mangled for rust
extern &quot;C&quot; {
// Misc
void cuda_synchronize();
// Matrix Setup API 
size_t register_matrix(float* data, size_t rows, size_t cols);
void unregister_matrix(size_t mat_id);
void get_matrix_data(size_t mat_id, int rows, int cols, float* data_buffer);
// Matrix operation API
size_t cuda_matrix_multiply(size_t mat1_id, size_t mat1_rows, size_t mat1_cols, size_t mat2_id, size_t mat2_rows, size_t mat2_cols);
}

CUDA C Test

#include &lt;chrono&gt;
#include &lt;vector&gt;
using namespace std::chrono;
#include &quot;../cuda_kernels.cuh&quot;
int main() {
    // This is just for timing, assumes everything is correct.
    int dim = 4096;
    std::vector&lt;float&gt; data;
    for (int i = 0; i &lt; dim * dim; i++) {
        data.push_back(23.47);
    }
    // Register data as a 4096 x 4096 matrix
    int mat1 = register_matrix(&amp;data[0], dim, dim);
    int mat2 = register_matrix(&amp;data[0], dim, dim);
    auto start_host = high_resolution_clock::now();
    cudaEvent_t start;
    cudaEvent_t end;
    cudaEventCreate(&amp;start);
    cudaEventCreate(&amp;end);
    cudaEventRecord(start);
    int num_iter = 100;
    for (int i = 0; i &lt; num_iter; i++) {
        // Perform multiplication
        int result_id = cuda_matrix_multiply(mat1, dim, dim, mat2, dim, dim);
        cuda_synchronize();
        unregister_matrix(result_id);
    }
    float gpu_time = 0;
    cudaEventRecord(end);
    cudaEventSynchronize(start);
    cudaEventSynchronize(end);
    cudaEventElapsedTime(&amp;gpu_time, start, end);
    auto end_host = high_resolution_clock::now();
    auto cpu_time = duration_cast&lt;milliseconds&gt;(end_host - start_host);
    std::cout &lt;&lt; &quot;Average gpu function time was: &quot; &lt;&lt; gpu_time / num_iter &lt;&lt; &quot; ms&quot; &lt;&lt; std::endl;
    std::cout &lt;&lt; &quot;Including overhead was: &quot; &lt;&lt; (float)cpu_time.count() / num_iter &lt;&lt; &quot; ms&quot; &lt;&lt; std::endl;
    // Okay something is wrong with the overhead on rust benchmark. Something taking 184.3 ms here is taking 1.3 seconds there.
}

Now on the Rust Side here is the bindings for the cuda functions

Bindings.rs

use std::ffi::c_float;
extern &quot;C&quot; {
  pub fn cuda_synchronize();
  pub fn register_matrix(data: *const c_float, rows: usize, cols: usize) -&gt; usize;
  pub fn unregister_matrix(mat_id: usize) -&gt; usize;
  pub fn cuda_matrix_multiply(
    mat1_id: usize,
    mat1_rows: usize,
    mat1_cols: usize,
    mat2_buffer: usize,
    mat2_rows: usize,
    mat2_cols: usize,
  ) -&gt; usize;
}

And the Rust benchmark that is the same as CUDA C version

rust_bench_test.rs

// See why there is a 13x discrepancy between speed of rust ffi and c++ benchmarks
use std::time::Instant;
use matrix_lib::bindings::{
  cuda_matrix_multiply, cuda_synchronize, register_matrix, unregister_matrix,
};
#[test]
fn mat_mult_benchmark() {
  // Random numbers for generation
  let mat_dim = 4096;
  let id_1;
  let id_2;
  unsafe {
    id_1 = register_matrix(vec![0.0; mat_dim * mat_dim].as_ptr(), mat_dim, mat_dim);
    id_2 = register_matrix(vec![0.0; mat_dim * mat_dim].as_ptr(), mat_dim, mat_dim);
  }
  let num_iterations = 100;
  let start = Instant::now();
  let mut result_id = 0;
  for _ in 0..num_iterations {
    unsafe {
      result_id = cuda_matrix_multiply(id_1, mat_dim, mat_dim, id_2, mat_dim, mat_dim);
      cuda_synchronize();
      unregister_matrix(result_id);
    }
  }
  unsafe { cuda_synchronize() }
  let elapsed = start.elapsed();
  println!(
    &quot;\n=================================\nTime per iteration: {} ms\n=================================&quot;,
    elapsed.as_millis() as f64 / num_iterations as f64
  );
  print!(&quot;{}&quot;, result_id);
  assert!(false);
}

答案1

得分: 4

你将你的基准写成了#[test]。cargo test 使用test配置，默认情况下与dev配置相同，该配置没有优化并启用了调试断言。这是因为测试通常需要进行调试（因此禁用了优化），应该检查更多的边缘情况（因此启用了调试断言），并且应该提供清晰的堆栈跟踪（因此禁用了优化，特别是内联）。

你可以运行cargo test --release，但最好通过将你的基准设置为实际的基准目标来使用正确的默认配置。cargo bench 相反，使用bench配置，默认情况下与release配置相同，该配置具有opt_level = 3且没有调试断言。

目前在Rust中运行基准的推荐工具是criterion；如果按照其文档使用它，那么你将拥有一个基准目标并默认进行优化构建。你也可以跳过criterion，继续使用自己的测量代码；主要的是你需要将文件移到benches/rust_bench_test.rs，声明基准目标如下

[[bench]]
name = &quot;rust_bench_test&quot;
harness = false

并编写一个fn main()而不是#[test]函数。

英文:

You wrote your benchmark as a #[test]. cargo test uses the test profile, which is by default the same as the dev profile, which has no optimization and enables debug assertions. This default behavior is because tests often need to be debugged (so optimizations are disabled), should check more edge cases (so debug assertions are enabled), and should give clear stack traces (so optimizations, particularly inlining, are disabled).

You can run cargo test --release, but it is better to work with the right default by making your benchmark be an actual benchmark target. cargo bench instead uses the bench profile, which is by default the same as the release profile, which has opt_level = 3 and no debug assertions.

The currently recommended tool to run benchmarks in Rust is criterion; if you use it as per its documentation then you will have a benchmark target and optimized builds of it by default. You can also skip criterion and continue using your own measurement code; the main thing is that you will need to move the file to benches/rust_bench_test.rs, declare the benchmark target like this

[[bench]]
name = &quot;rust_bench_test&quot;
harness = false

and write a fn main() instead of a #[test] function.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Rust FFI 和 CUDA C 性能差异

问题

答案1

自定义缓存对齐在Rust中

为什么一些仅在特质限制中使用类型参数的代码可以编译而无需PhantomData？

In Rust can you associate a constant value to each enum variant and then collect those values at compile time?

如何在 Windows 10 的命令提示符中编译和运行时包含第三方的 jar 文件？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。