2023年5月30日 06:40:29go评论113阅读模式

英文:

method to make this for-loop faster in C++ possibly with NVidia?

问题

I understand that you want a translation of the code and its associated details, excluding the code itself. Here's a summary of your situation:

您希望了解以下情况的概要，不包括代码本身：

You have a C++ function called rtop that you want to optimize for performance.
您有一个名为rtop的C++函数，您希望对其性能进行优化。
You've tried using OpenMP parallelization and achieved some speedup.
您已尝试使用OpenMP并行化，并取得了一些加速效果。
You encountered difficulties with OpenMP GPU offloading, likely due to compiler flags.
您在使用OpenMP GPU卸载时遇到了困难，可能是由于编译器标志。
You're open to other suggestions for optimization.
您愿意听取其他优化建议。
You're considering whether converting rtop into a CUDA kernel would benefit its performance.
您正在考虑是否将rtop转换为CUDA内核会有助于提高其性能。

Please let me know if you need any specific information or assistance related to these points.

英文:

I want to make a C++ function faster. I am asking you about potential ways to do it.

I can use up to 32 OMP threads.

I can use an NVidia GPU.

A MWE for the function is:

#include &lt;iostream&gt;
#include &lt;complex&gt;
#include &lt;cmath&gt;
typedef std::numeric_limits&lt;double&gt; dbl;
#define _USE_MATH_DEFINES
#include &lt;omp.h&gt;
const std::complex&lt;double&gt; I(0.0, 1.0); // imaginary unit, I*I = -1
std::complex&lt;double&gt; zero_imag (0.0, 0.0);
const int N_rs = 1500;
const int l_max = 70;
const int lmax = 70;
const int N_thetas = l_max + 1;
const int N_phis = 2 * l_max + 2;
const int N_ps = 600;
const int nphi = 2 * l_max + 2;
const double sqrt_of_2_over_pi = sqrt( 2.0 / M_PI );
void rtop(std::complex&lt;double&gt; * Psi_outer_spec,
          std::complex&lt;double&gt; * Psi_outer_spec_plm,
          double * BJ,
          double * wrk,
          std::complex&lt;double&gt; * wrk2,
          double * ris_without_ends,
          double * r_primes_without_ends,
          double * weights_Lobatto_without_ends
         )
{
    int l, kk, kkk, m;
    long int idx, idxx, idxxx;
    // #pragma omp parallel for firstprivate (wrk2) private(l, kkk, idx, m, kk, idxx, idxxx) schedule(static)
    // #pragma omp target teams distribute parallel for firstprivate(wrk2) private(l, kkk, idx, m, kk, idxx, idxxx)
    for (int i = 0; i &lt;= (N_ps - 1); i++) { // THIS IS THE BOTTLENECK !!!
       
        std::complex&lt;double&gt; sum1 = std::complex&lt;double&gt; (0.0, 0.0); // each thread creates a sum1 on its own
        for (l = 0; l &lt;= lmax; l++) {
            for (kkk = 0; kkk &lt;= (N_rs-1); kkk++) {
                idx = i * (N_rs*(l_max+1)) + kkk * (l_max+1) + l;
                wrk2[kkk] = pow(-I, l) * BJ[idx] * wrk[kkk];
            }
            for (m = 0; m &lt;= (nphi-1); m++) {
                sum1 = zero_imag;
                for (kk = 0; kk &lt;= (N_rs-1); kk++) {
                    idxx = kk * (N_thetas*N_phis) + l * N_phis + m;
                    sum1 += Psi_outer_spec[idxx] * wrk2[kk];
                }
                idxxx = i * (N_thetas*N_phis) + l * N_phis + m;
                Psi_outer_spec_plm[idxxx] = sum1 * sqrt_of_2_over_pi;
                                       
            }
            // END for m loop
        }
        // END for l loop
    }    
    // END for i loop
}
int main() {
    double * wrk = new double [N_rs];
    std::complex&lt;double&gt; * wrk2 = new std::complex&lt;double&gt; [N_rs];
    double * ris_without_ends = new double [N_rs];
    double * r_primes_without_ends = new double [N_rs];
    double * weights_Lobatto_without_ends = new double [N_rs];
    double * BJ = new double [N_ps * N_rs * (l_max+1)];
    std::complex&lt;double&gt; * Psi_outer_spec = new std::complex&lt;double&gt; [N_rs * N_thetas * N_phis];
    std::complex&lt;double&gt; * Psi_outer_spec_plm = new std::complex&lt;double&gt; [N_ps * N_thetas * N_phis];
    rtop(Psi_outer_spec, Psi_outer_spec_plm, BJ, wrk, wrk2, ris_without_ends, r_primes_without_ends, weights_Lobatto_without_ends);
   
    return 0;
}

The associated CMakeLists.txt is:

cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
set(CMAKE_VERBOSE_MAKEFILE ON)
set(CMAKE_C_COMPILER &quot;gcc&quot;)
set(CMAKE_CXX_COMPILER &quot;g++&quot;)
project(trial)
set(CMAKE_CXX_FLAGS &quot;${CMAKE_CXX_FLAGS} -O3 -pedantic -Wall&quot;)
find_package(OpenMP)
add_executable(trial trial.cpp)
if(OpenMP_CXX_FOUND)
target_link_libraries(trial PUBLIC OpenMP::OpenMP_CXX)
endif()
set_property(TARGET trial PROPERTY CXX_STANDARD 17)

Compile as: $ cmake .. then $ cmake --build . --config Release.

My output is:

-- The C compiler identification is GNU 11.3.0
-- The CXX compiler identification is GNU 11.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /apps20/sw/eb/software/GCCcore/11.3.0/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /apps20/sw/eb/software/GCCcore/11.3.0/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_C: -fopenmp (found version &quot;4.5&quot;)
-- Found OpenMP_CXX: -fopenmp (found version &quot;4.5&quot;)
-- Found OpenMP: TRUE (found version &quot;4.5&quot;)
-- Configuring done
-- Generating done
-- Build files have been written to: /work4/clf/ouatu/trial_for_SO/build

Then for the build:

[ 50%] Building CXX object CMakeFiles/trial.dir/trial.cpp.o
[100%] Linking CXX executable trial
[100%] Built target trial

What I have tried:

With OpenMP parallel for, I do get speedup.
I fail with OpenMP GPU-offloading (it seems my compiler flags do not
make the offloading possible). (these flags are hidden from the shown CMakeLists.txt for this MWE)
I am open to any other suggestions.

For example, would rtop benefit from being a CUDA kernel? Is it hard to make it that way?

Thank you!

答案1

得分: 6

以下是您要翻译的部分：

我建议使用一些优化和调整的OpenMP版本。快速审查一些更改以及需要注意的事项：

整个与wrk2[kkk] = pow(-I, l) * ...有关的内容都是多余的。首先，pow(-I, l)是一种优雅但昂贵的方式，只表示4个不同的值。其次，它只用作点乘的因子。您可以将整个内容折叠到最终的乘法sum1 * sqrt_of_2_over_pi中。这还允许wrk2是实数值，这也将内部循环从复杂复杂的点积变为复杂实数的点积。

像idx = i * (N_rs*(l_max+1)) + kkk * (l_max+1) + l这样的多维索引计算应该按照霍纳方法进行，以避免多余的乘法。这可能是一个小问题，但也更清晰。例如，在这里idx = (i * N_rs + kkk) * (l_max + 1) + l。另外，请小心您的索引变量。它们都是int类型。特别是3维数组很快就会增长到多个GiB的大小，到那时您将遇到整数溢出。如果您担心这可能会成为一个问题，请切换到std::ptrdiff_t。

BJ和Psi_outer_spec_plm的迭代顺序不是理想的。如果可能的话，BJ应该交换两个内部维度，以获得更好的数据局部性，这也将允许初始化wrk2的循环进行向量化。Psi_outer_spec更糟糕，因为您在最内层循环中迭代了外部维度。但是，我假设选择了这个顺序是为了与Psi_outer_spec_plm保持一致，对于这一点是好的。无论如何，这种更高的步幅会阻止向量化。

我不明白为什么要在使用它们的作用域之外声明计数器和索引变量。即使是现代C标准也允许在for循环内部声明它们，更不用说C++了。对于并行化，您希望限制共享或意外共享变量的数量。

谈到共享数据，据我所见，线程可能重叠的唯一共享内存是wrk2数组。它可以简单地按线程分配，这将带我们进入最终的实现。

#   pragma omp parallel
    {
        auto wrk2 = std::make_unique<double[]>(N_rs);
#       pragma omp for collapse(2) nowait
        for (int i = 0; i <= (N_ps - 1); i++) {
            for (int l = 0; l <= lmax; l++) {
                for (int kkk = 0; kkk <= (N_rs-1); kkk++) {
                    int idx = (i * N_rs + kkk) * (lmax + 1) + l;
                    wrk2[kkk] = BJ[idx] * wrk[kkk];
                }
                constexpr std::complex<double> I(0., 1.);
                std::complex<double> factor(-sqrt_of_2_over_pi);
                if(l & 1)
                    factor *= I;
                if(l & 2)
                    factor = -factor;
                for (int m = 0; m <= (N_phis-1); m++) {
                    std::complex<double> sum1;
                    for (int kk = 0; kk <= (N_rs-1); kk++) {
                        int idx = (kk * N_thetas + l) * N_phis + m;
                        sum1 += Psi_outer_spec[idx] * wrk2[kk];
                    }
                    int idx = (i * N_thetas + l) * N_phis + m;
                    Psi_outer_spec_plm[idx] = sum1 * factor;
                }
            }
        }
    }

请注意，通常的pragma omp parallel for被分成了一个omp parallel和一个单独的omp for，以允许分配临时内存。collapse(2)意味着两个外部循环都被并行化。

其他需要考虑的事项：

内部点积可能可以通过加速的BLAS库或类似的方式更快地计算。我认为Eigen在这里应该工作得很好，但您可能需要稍微调整它以使其与此内存布局一起使用。
看起来我们可以将m循环更改为矩阵-向量乘积，这可能会通过BLAS库解决一些向量化/内存访问问题。
由于您问了编译选项，-march=native或您想要的任何基线架构应该是有价值的。-mavx2 -mfma可能是一个不错的折衷，可以处理所有相对较新的CPU，而不过分专门化二进制文件。

编辑：矩阵-向量乘积

回到将m循环转移到矩阵向量乘积的想法，我们必须重新解释我们使用的Psi_outer_spec片段，将其视为矩阵。我选择了列主矩阵，因为我想在这一步中使用Eigen3。

行数是N_phi（循环计数m）
列数是N_rs（循环计数kk）
从一个列到下一个列，我们有一个步幅/也称为领先维度，为N_phi * N_theta
左

英文:

I suggest an OpenMP version with some optimizations and tweaks. A quick review of some changes and what to look out for:

The whole business with wrk2[kkk] = pow(-I, l) * ... is doubly redundant. For one, pow(-I, l) is an elegant but expensive way of expressing just 4 different values. Second, it's only used as a factor in a dot product. You can fold the whole thing into the final multiplication sum1 * sqrt_of_2_over_pi. That also allows wrk2 to be real-valued, which also turns the innermost loop from a complex-complex dot product to a complex-real dot product.

The multi-dimensional index calculations like idx = i * (N_rs*(l_max+1)) + kkk * (l_max+1) + l should be done following the Horner method to avoid redundant multiplications. More of a nitpick but also clearner. For example here idx = (i * N_rs + kkk) * (l_max+1) + l. While we're at it, be careful with your index variables. They are all int. Especially the 3-dimensional arrays can quickly grow to multiple GiB in size at which point you will experience integer overflows. Switch to std::ptrdiff_t if you fear that this may become a problem.

The order of iterations over BJ and Psi_outer_spec_plm aren't ideal. If possible, BJ should swap the two inner dimensions for better locality of data, which would also allow for vectorization of the loop initializing wrk2. Psi_outer_spec is even worse because you iterate along the outer dimension in your innermost loop. However, I assume this order was picked so that it is the same as with Psi_outer_spec_plm and for that it is good. In any case, this higher stride prevents vectorization.

I don't see a reason why you declare counter and index variables outside the scope where they are used. Even modern C standards allow declaring them inside for loops, let alone C++. For parallelization, you want to limit the number of shared or accidentally shared variables.

Speaking of shared data, as far as I can see, the only shared memory in which threads may overlap is the wrk2 array. That can simply be allocated per thread, which brings us to the final implementation.

#   pragma omp parallel
    {
        auto wrk2 = std::make_unique&lt;double[]&gt;(N_rs);
#       pragma omp for collapse(2) nowait
        for (int i = 0; i &lt;= (N_ps - 1); i++) {
            for (int l = 0; l &lt;= lmax; l++) {
                for (int kkk = 0; kkk &lt;= (N_rs-1); kkk++) {
                    int idx = (i * N_rs + kkk) * (lmax + 1) + l;
                    wrk2[kkk] = BJ[idx] * wrk[kkk];
                }
                constexpr std::complex&lt;double&gt; I(0., 1.);
                std::complex&lt;double&gt; factor(-sqrt_of_2_over_pi);
                if(l &amp; 1)
                    factor *= I;
                if(l &amp; 2)
                    factor = -factor;
                for (int m = 0; m &lt;= (N_phis-1); m++) {
                    std::complex&lt;double&gt; sum1;
                    for (int kk = 0; kk &lt;= (N_rs-1); kk++) {
                        int idx = (kk * N_thetas + l) * N_phis + m;
                        sum1 += Psi_outer_spec[idx] * wrk2[kk];
                    }
                    int idx = (i * N_thetas + l) * N_phis + m;
                    Psi_outer_spec_plm[idx] = sum1 * factor;
                }
            }
        }
    }

Note how the usual pragma omp parallel for is split into an omp parallel and a separate omp for to allow for allocating the temporary memory. The collapse(2) means that both outer loops are parallelized.

Other things to consider:

The inner dot product may be computed faster by an accelerated BLAS library or something similar. I think Eigen should work well here but one might need to coerce it a bit into working with this memory layout
It looks somewhat like we could change the m loop into a matrix-vector product, which might solve some of our vectorization / memory access issues via a BLAS library
Since you asked about compile options, -march=native or whatever baseline architecture you want should be worthwhile here. -mavx2 -mfma may be a good compromise to handle all relatively recent CPUs without specializing the binary too much

Edit: Matrix-vector product

Coming back to the idea to offload the loop over m into a matrix vector product, we have to reinterpret the slice of Psi_outer_spec that we use as a matrix. I choose a column-major matrix since I want to use Eigen3 for this step.

The number of rows are N_phi (loop counter m)
The number of columns are N_rs (loop counter kk)
From one column to the next we have a stride / a.k.a. leading dimension of N_phi * N_theta
The offset of the top left corner is l * N_phis

Assuming this is correct, we can map our arrays into Eigen vectors and matrices and let it handle the transposed accesses. This turns everything below the wrk2 initialization into this code

using MatrixMap = Eigen::Map&lt;const Eigen::MatrixXcd,
        Eigen::Unaligned, Eigen::OuterStride&lt;&gt;&gt;;
MatrixMap Psi_slice(
        Psi_outer_spec + l * N_phis /*top left corner*/,
        N_phis /*rows*/, N_rs /*cols*/,
        Eigen::OuterStride&lt;&gt;(N_phis * N_thetas));
const auto wrk2_mapped = Eigen::VectorXd::Map(wrk2.get(), N_rs);
auto Psi_plm_mapped = Eigen::VectorXcd::Map(
        Psi_outer_spec_plm + (i * N_thetas + l) * N_phis, N_phis);
Psi_plm_mapped.noalias() = Psi_slice * wrk2_mapped * factor;

And now this step obviously raises the question of whether we can turn the whole thing into a matrix-matrix product with some pre- or post-processing, which might take care of the entire parallelization and potential offloading to GPUs. And this is why I asked for a mathematical description instead of doing this wild goose chase through the code

Edit 2: Matrix-matrix product

It is indeed possible to rewrite it as a matrix-matrix product. The trick is the observation that Psi_outer_spec is independent from i. Therefore if we switch the two outer loops, we can compute all values for one l over all i in one operation.

While doing so, I switch back to wrk2 being complex and including the factor. This technically requires more compute time and memory but with a matrix-matrix product, you may want to dispatch to a BLAS backend, either directly for example via OpenBLAS, via Eigen's backends or even GPU acceleration such as CuBLAS. And for that you need a complex-complex multiplication.

Eigen::MatrixXcd wrk2mat(N_rs, N_ps);
for (int l = 0; l &lt;= lmax; l++) {
    std::complex&lt;double&gt; factor(-sqrt_of_2_over_pi);
    if(l &amp; 1)
        factor *= I;
    if(l &amp; 2)
        factor = -factor;
#   pragma omp parallel for
    for (int i = 0; i &lt;= N_ps - 1; i++) {
        for (int k = 0; k &lt;= N_rs - 1; ++k) {
            int idx = (i * N_rs + k) * (lmax + 1) + l;
            wrk2mat(k, i) = BJ[idx] * wrk[k] * factor;
        }
    }
    using ConstMatrixMap = Eigen::Map&lt;const Eigen::MatrixXcd,
            Eigen::Unaligned, Eigen::OuterStride&lt;&gt;&gt;;
    ConstMatrixMap Psi_slice(
            Psi_outer_spec + l * N_phis /*top left corner*/,
            N_phis /*rows*/, N_rs /*cols*/,
            Eigen::OuterStride&lt;&gt;(N_phis * N_thetas));
    using MatrixMap = Eigen::Map&lt;Eigen::MatrixXcd,
            Eigen::Unaligned, Eigen::OuterStride&lt;&gt;&gt;;
    MatrixMap Psi_plm_mapped(
            Psi_outer_spec_plm + l * N_phis,
            N_phis, N_ps,
            Eigen::OuterStride&lt;&gt;((lmax + 1) * N_phis));
    Psi_plm_mapped.noalias() = Psi_slice * wrk2mat;
}

The matrix-matrix product should be internally parallelized as long as the matrices are large enough. If this is not always the case, you can wrap the whole thing into an runtime-optional parallel block. Roughly like this:

bool small_matrices = ...;
#pragma omp parallel if(small_matrices)
{
    Eigen::MatrixXcd wrk2mat(N_rs, N_ps);
#   pragma omp for nowait
    for (int l = 0; l &lt;= lmax; l++) {
        ...
    }
}

Since OpenMP normally deactivates nested parallelization, this will automatically deactivate all the inner parallel sections and run them sequentially.

答案2

得分: 3

以下是翻译好的内容：

使用GPU和CuBLAS的10倍更快的解决方案。还修复了当前接受的答案中的错误结果

计算机规格测试：

Ryzen 2950X，四通道RAM 2133MHz
2080ti（在FP64方面表现不佳）
PCIE 3.0 x16通道
Ubuntu 20.04，CUDA 12.1，Eigen 3.3.9
我不知道Eigen使用的后端是什么，因为这不是我的电脑，我认为它是MKL

在此计算机上每个实现的时间成本：

原始代码：约75秒
Eigen代码OMP_NUM_THREADS=1：8秒
Eigen代码OMP_NUM_THREADS=16：1.1秒
CUDA + CuBLAS：0.26秒

我通过使用MyTimer和每个CUDA调用后的cudaDeviceSynchronize()来测量每个段（未在下面的代码中显示）。当不进行测量时，总运行时间略低，因为某些传输/计算正在重叠（我认为）。

BJ，d_wrk，Psi_outer_spec从主机到设备的传输：约50毫秒
compute_wrk2mat：约15毫秒
cublasZgemmStridedBatched：约180毫秒
Psi_outer_spec_plm从设备到主机的传输：约10毫秒

预测2080ti与A100的性能：

FP32：13.45 TFLOP与19.75 TFLOP
FP64：0.42 TFLOP与9.75 TFLOP
存储带宽：616 GB/s与1555 GB/s
PCIE 4.0比3.0快2倍

因此，我预测A100将在(50毫秒 + 10毫秒) * 0.5（pcie_transfer） + 15毫秒 * 616 / 1555（memory_bound_kernel） + 180毫秒 / 10（fp64_compute_bound_kernel 约为50毫秒。如果有人有空，请运行一个基准测试，因为我也很好奇。

最后，是代码部分。首先，接受的答案（非常有趣且分析得很好）中有一个小错误，导致它与您在问题中的原始代码输出不同。

std::complex<double> factor(-sqrt_of_2_over_pi);
if(l & 1)
    factor *= I;
if(l & 2)
    factor = -factor;

它应该是

std::complex<double> factor(sqrt_of_2_over_pi);
if(l & 1)
    factor *= -I;
if(l & 2)
    factor = -factor;

以下是完整的可运行程序，用于基准测试并检查本帖中所有3种实现的正确性。GPU版本使用RtopCalculator类对象而不是函数，以便可以重用资源（设备数组）而不是每次调用函数时分配/释放。该类是RAII，并且在析构时会安全释放资源。

我不使用固定内存，因为预期的用例是从Numpy传递数组（OP在评论中如此表示）。因此，使用普通内存是更好的实际基准测试。

#include <iostream>
#include <complex>
#include <cmath>
#include <random>
#include <chrono>
#include <fstream>
#include <Eigen/Dense>
#include <omp.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <thrust/execution_policy.h>
#include <thrust/device_vector.h>
#include <thrust/complex.h>
// ... 翻译的部分继续
int main(int argc, char* argv[]) {
    // ... 翻译的部分继续
    return 0;
}

要运行并测试，请使用run.sh。

例如：

./run.sh 1 # 测试 Eigen 1 线程 vs GPU
./run.sh 16 # 测试 Eigen 16 线程 vs GPU
./run.sh 16 1 # 测试所有 3 个版本。请记住，原始代码非常慢。

这个程序可以转化为一个共享库，可以导入到Python中并与Tensorflow/Torch一起使用。但这超出了问题的范围，所以您可以给我发电子邮件。

英文:

10x faster solution using GPU and CuBLAS. Also fix wrong result bug from current accepted answer

Test computer specs:

Ryzen 2950X, quad channel RAM 2133MHz
2080ti (terrible at FP64)
PCIE 3.0 x16 lanes
Ubuntu 20.04, CUDA 12.1, Eigen 3.3.9
IDK what backend Eigen is using because it's not my PC, I think it's MKL

The time cost for each implementation using this PC:

Original code: ~75s
Eigen code OMP_NUM_THREADS=1: 8s
Eigen code OMP_NUM_THREADS=16: 1.1s
CUDA + CuBLAS: 0.26s.

I measure each segment by using MyTimer and cudaDeviceSynchronize() after each CUDA call (not shown in the code below). The total run time when not measuring is slightly lower, because some transfer/compute is being overlapped (I think).

BJ, d_wrk, Psi_outer_spec Host to Device transfer: ~50ms
compute_wrk2mat: ~15ms
cublasZgemmStridedBatched: ~180ms
Psi_outer_spec_plm Device to Host transfer: ~10ms

Predicting 2080ti vs A100 performance:

FP32: 13.45 TFLOP vs 19.75 TFLOP
FP64: 0.42 TFLOP vs 9.75 TFLOP
Memory bandwidth: 616 GB/s vs 1555 GB/s
PCIE 4.0 is 2x faster than 3.0

So I predict the A100 will run this in (50ms + 10ms) * 0.5 (pcie_transfer) + 15ms * 616 / 1555 (memory_bound_kernel) + 180ms / 10 (fp64_compute_bound_kernel ~~50ms . If anyone has free time, please run a benchmark because I'm also curious.

Finally, the code. First, the accepted answer (super interesting and well analyzed, btw) has a small mistake that causes it to output different results compared to your original code in the question.

    std::complex&lt;double&gt; factor(-sqrt_of_2_over_pi);
    if(l &amp; 1)
        factor *= I;
    if(l &amp; 2)
        factor = -factor;

it should be

    std::complex&lt;double&gt; factor(sqrt_of_2_over_pi);
    if(l &amp; 1)
        factor *= -I;
    if(l &amp; 2)
        factor = -factor;

The code below is a full runnable program that benchmark + check correctness of all 3 implementations in this post. The GPU version uses a class RtopCalculator object instead of a function, so that it can reuse resources (device arrays) instead of allocate/free each time the function is called. The class is RAII, and will safely free resources when destructed.

I dont use pinned memory because the expected use case is passing array from Numpy (OP says so in comment). So using normal memory is a better practical benchmark.

#include &lt;iostream&gt;
#include &lt;complex&gt;
#include &lt;cmath&gt;
#include &lt;random&gt;
#include &lt;chrono&gt;
#include &lt;fstream&gt;
#include &lt;Eigen/Dense&gt;
#include &lt;omp.h&gt;
#include &lt;cuda_runtime.h&gt;
#include &lt;cublas_v2.h&gt;
#include &lt;thrust/execution_policy.h&gt;
#include &lt;thrust/device_vector.h&gt;
#include &lt;thrust/complex.h&gt;
//------------
// Utility stuffs to test
// Helper function to check cuBLAS status
#define CUBLAS_CHECK(err)                                                                          \
do {                                                                                           \
cublasStatus_t err_ = (err);                                                               \
if (err_ != CUBLAS_STATUS_SUCCESS) {                                                       \
printf(&quot;cublas error %d at %s:%d\n&quot;, err_, __FILE__, __LINE__);                        \
throw std::runtime_error(&quot;cublas error&quot;);                                              \
}                                                                                          \
} while (0)
inline void gpuAssert(cudaError_t code, const char *file, int line, bool printing = false)
{   
if (code != cudaSuccess)
{
std::string mess = std::string(&quot;GPUassert: &quot;) + std::string(cudaGetErrorString(code)) 
+ &quot; &quot; + std::string(file) + &quot; &quot; + std::to_string(line);
if (printing) std::cout &lt;&lt; mess &lt;&lt; std::endl;
throw std::runtime_error(mess.c_str());
}
auto lastError = cudaGetLastError();
if (lastError != cudaSuccess)
{
std::string mess = std::string(&quot;GPUassert: &quot;) + std::string(cudaGetErrorString(lastError)) 
+ &quot; &quot; + std::string(file) + &quot; &quot; + std::to_string(line);
std::cout &lt;&lt; &quot;UNDETECTED_ERROR &quot; &lt;&lt; mess &lt;&lt; std::endl;
throw std::runtime_error(mess.c_str());
}
}
// CUDA API error checking
#define CUDA_CHECK(ans) { gpuAssert((ans), __FILE__, __LINE__, true); }
#define CUDA_CHECK_NOLOG(ans) { gpuAssert((ans), __FILE__, __LINE__); }
double eps_ = 1e-6;
template &lt;typename num_t&gt;
num_t rcmp(num_t a, num_t b, num_t eps = eps_) {
if (std::isnan(a) &amp;&amp; std::isnan(b)) return 0;
if (std::isnan(a + b)) return NAN;
num_t t = (a - b) / (std::max(std::abs(a), std::abs(b)) + 1e-18);
return t &lt; -eps ? -1 : +eps &lt; t;
}
class MyTimer {
std::chrono::time_point&lt;std::chrono::system_clock&gt; start;
public:
void startCounter() {
start = std::chrono::system_clock::now();
}
int64_t getCounterNs() {
return std::chrono::duration_cast&lt;std::chrono::nanoseconds&gt;(std::chrono::system_clock::now() - start).count();
}
int64_t getCounterMs() {
return std::chrono::duration_cast&lt;std::chrono::milliseconds&gt;(std::chrono::system_clock::now() - start).count();
}
double getCounterMsPrecise() {
return std::chrono::duration_cast&lt;std::chrono::nanoseconds&gt;(std::chrono::system_clock::now() - start).count()
/ 1000000.0;
}
};
//----------------
//----------------
//----------------
std::mt19937 rander(42);
double myrand() {
return double(rander() % 10000) / (rander() % 10000 + 1);
}
const std::complex&lt;double&gt; I(0.0, 1.0); // imaginary unit, I*I = -1
std::complex&lt;double&gt; zero_imag (0.0, 0.0);
const double sqrt_of_2_over_pi = sqrt( 2.0 / M_PI   );
void GenData(
int64_t N_rs, int64_t l_max, int64_t N_ps,
std::complex&lt;double&gt;* Psi_outer_spec, double* BJ, double* wrk
)
{
int64_t N_thetas = l_max + 1;
int64_t N_phis = 2 * l_max + 2;
for (int64_t i = 0; i &lt; N_rs; i++) wrk[i] = myrand();
for (int64_t i = 0; i &lt; N_ps * N_rs * N_thetas; i++) BJ[i] = myrand();
for (int64_t i = 0; i &lt; N_rs * N_thetas * N_phis; i++) {
Psi_outer_spec[i].real(myrand());
Psi_outer_spec[i].imag(myrand());
}
}
void rtop(
int64_t N_rs, int64_t l_max, int64_t N_ps,
const std::complex&lt;double&gt;* __restrict__ Psi_outer_spec,
std::complex&lt;double&gt;* __restrict__ Psi_outer_spec_plm,
const double* __restrict__ BJ,
const double* __restrict__ wrk,
std::complex&lt;double&gt;* __restrict__ wrk2,
double* __restrict__ ris_without_ends = nullptr,
double* __restrict__ r_primes_without_ends = nullptr,
double* __restrict__ weights_Lobatto_without_ends = nullptr
)
{        
int64_t N_thetas = l_max + 1;
int64_t N_phis = 2 * l_max + 2;
int64_t l, kk, kkk, m;
int64_t idx, idxx, idxxx;
for (int64_t i = 0; i &lt;= (N_ps - 1); i++) { // THIS IS THE BOTTLENECK !!!
std::complex&lt;double&gt; sum1 = std::complex&lt;double&gt; (0.0, 0.0); // each thread creates a sum1 on its own
for (l = 0; l &lt;= l_max; l++) {
for (kkk = 0; kkk &lt;= (N_rs-1); kkk++) {
idx = i * (N_rs*(l_max+1)) + kkk * (l_max+1) + l;
wrk2[kkk] = pow(-I, l) * BJ[idx] * wrk[kkk];
}
for (m = 0; m &lt;= (N_phis-1); m++) {
sum1 = zero_imag;
for (kk = 0; kk &lt;= (N_rs-1); kk++) {
idxx = kk * (N_thetas*N_phis) + l * N_phis + m;
sum1 += Psi_outer_spec[idxx] * wrk2[kk];
}
idxxx = i * (N_thetas*N_phis) + l * N_phis + m;
Psi_outer_spec_plm[idxxx] = sum1 * sqrt_of_2_over_pi;
}
// END for m loop
}
// END for l loop
}    
// END for i loop
}
void rtop_eigen(
int64_t N_rs, int64_t l_max, int64_t N_ps,
const std::complex&lt;double&gt;* __restrict__ Psi_outer_spec,
std::complex&lt;double&gt;* __restrict__ Psi_outer_spec_plm,
const double* __restrict__ BJ,
const double* __restrict__ wrk,
std::complex&lt;double&gt;* __restrict__ wrk2,
double* __restrict__ ris_without_ends = nullptr,
double* __restrict__ r_primes_without_ends = nullptr,
double* __restrict__ weights_Lobatto_without_ends = nullptr
)
{
int64_t N_thetas = l_max + 1;
int64_t N_phis = 2 * l_max + 2;
Eigen::MatrixXcd wrk2mat(N_rs, N_ps);
for (int64_t l = 0; l &lt;= l_max; l++) {
std::complex&lt;double&gt; factor(sqrt_of_2_over_pi);
if(l &amp; 1)
factor *= -I;
if(l &amp; 2)
factor = -factor;
#   pragma omp parallel for
for (int64_t i = 0; i &lt;= N_ps - 1; i++) {
for (int64_t k = 0; k &lt;= N_rs - 1; ++k) {
int64_t idx = (i * N_rs + k) * (l_max + 1) + l;
wrk2mat(k, i) = BJ[idx] * wrk[k] * factor;
}
}
using ConstMatrixMap = Eigen::Map&lt;const Eigen::MatrixXcd,
Eigen::Unaligned, Eigen::OuterStride&lt;&gt;&gt;;
ConstMatrixMap Psi_slice(
Psi_outer_spec + l * N_phis /*top left corner*/,
N_phis /*rows*/, N_rs /*cols*/,
Eigen::OuterStride&lt;&gt;(N_phis * N_thetas));
using MatrixMap = Eigen::Map&lt;Eigen::MatrixXcd,
Eigen::Unaligned, Eigen::OuterStride&lt;&gt;&gt;;
MatrixMap Psi_plm_mapped(
Psi_outer_spec_plm + l * N_phis,
N_phis, N_ps,
Eigen::OuterStride&lt;&gt;((l_max + 1) * N_phis));
Psi_plm_mapped.noalias() = Psi_slice * wrk2mat;
}
}
namespace {
__global__
void compute_wrk2mat(
int64_t N_ps, int64_t N_rs, int64_t l_max,
const double* __restrict__ BJ,
const double* __restrict__ wrk,
cuDoubleComplex* __restrict__ wrk2mat
)
{
constexpr double sqrt_of_2_over_pi = 0.79788456080286535587989;
for (int l = 0; l &lt;= (int)l_max; l++) {
cuDoubleComplex* wrk2mat_offset = wrk2mat + l * N_rs * N_ps;
double factor_real = sqrt_of_2_over_pi;
double factor_imag = 0;
if (l &amp; 1) {
double temp_real = factor_real;
factor_real = factor_imag;
factor_imag = -temp_real;
}
if (l &amp; 2) {
factor_real = -factor_real;
factor_imag = -factor_imag;
}
for (int i = blockIdx.x; i &lt; N_ps; i += gridDim.x)
for (int k = threadIdx.x; k &lt; N_rs; k += blockDim.x) {
int64_t idx = (i * N_rs + k) * (l_max + 1) + l;
wrk2mat_offset[k + i * N_rs].x = BJ[idx] * wrk[k] * factor_real;
wrk2mat_offset[k + i * N_rs].y = BJ[idx] * wrk[k] * factor_imag;
}
}
}
}
class RtopCalculator {
private:
cudaStream_t main_stream_;
cudaStream_t side_stream_;
cublasHandle_t cublas_handle_;
int64_t N_rs_;
int64_t l_max_;
int64_t N_thetas_;
int64_t N_phis_;
int64_t N_ps_;
thrust::device_vector&lt;double&gt; d_BJ_;
thrust::device_vector&lt;double&gt; d_wrk_;
thrust::device_vector&lt;thrust::complex&lt;double&gt;&gt; d_wrk2mat_;
thrust::device_vector&lt;thrust::complex&lt;double&gt;&gt; d_Psi_outer_spec_;
thrust::device_vector&lt;thrust::complex&lt;double&gt;&gt; d_Psi_outer_spec_plm_;
void allocate_internal() {
d_BJ_.resize(N_rs_ * N_ps_ * N_thetas_);
d_wrk_.resize(N_rs_);
d_wrk2mat_.resize(N_thetas_ * N_rs_ * N_ps_);
d_Psi_outer_spec_.resize(N_rs_ * N_thetas_ * N_phis_);
d_Psi_outer_spec_plm_.resize(N_ps_ * N_thetas_ * N_phis_);
}
public:
RtopCalculator() {
cudaStreamCreate(&amp;main_stream_);
cudaStreamCreate(&amp;side_stream_);
cublasCreate(&amp;cublas_handle_);
cublasSetStream(cublas_handle_, main_stream_);
}
~RtopCalculator() {
cudaStreamDestroy(main_stream_);
cudaStreamDestroy(side_stream_);
cublasDestroy(cublas_handle_);
}
void allocate(int64_t N_rs, int64_t l_max, int64_t N_ps) {
N_rs_ = N_rs;
l_max_ = l_max;
N_thetas_ = l_max + 1;
N_phis_ = 2 * l_max + 2;
N_ps_ = N_ps;
allocate_internal();
}
void compute(
int64_t N_rs, int64_t l_max, int64_t N_ps,
const double* __restrict__ BJ,
const double* __restrict__ wrk,
const double* __restrict__ Psi_outer_spec, // std::complex&lt;double&gt;
double* __restrict__ Psi_outer_spec_plm,
double* __restrict__ ris_without_ends = nullptr,
double* __restrict__ r_primes_without_ends = nullptr,
double* __restrict__ weights_Lobatto_without_ends = nullptr
)
{
allocate(N_rs, l_max, N_ps);        
int64_t N_phis = N_phis_;
int64_t N_thetas = N_thetas_;
double* d_BJ = thrust::raw_pointer_cast(d_BJ_.data());
double* d_wrk = thrust::raw_pointer_cast(d_wrk_.data());
thrust::complex&lt;double&gt;* d_wrk2mat = thrust::raw_pointer_cast(d_wrk2mat_.data());
thrust::complex&lt;double&gt;* d_Psi_outer_spec = thrust::raw_pointer_cast(d_Psi_outer_spec_.data());
thrust::complex&lt;double&gt;* d_Psi_outer_spec_plm = thrust::raw_pointer_cast(d_Psi_outer_spec_plm_.data());
// the ordering of the next 4 statements are intended to interleave data transfer and compute
// Cost 1
cudaMemcpyAsync(d_BJ, BJ, N_rs * N_ps * (l_max + 1) * sizeof(double), cudaMemcpyHostToDevice, main_stream_);
cudaMemcpyAsync(d_wrk, wrk, N_rs * sizeof(double), cudaMemcpyHostToDevice, main_stream_);                
compute_wrk2mat&lt;&lt;&lt;512, 256, 0, main_stream_&gt;&gt;&gt;(
N_ps, N_rs, l_max, d_BJ, d_wrk, reinterpret_cast&lt;cuDoubleComplex*&gt;(d_wrk2mat)
);
cudaMemcpyAsync(
d_Psi_outer_spec, Psi_outer_spec,
N_rs * N_thetas * N_phis * sizeof(std::complex&lt;double&gt;),
cudaMemcpyHostToDevice, side_stream_
);
CUDA_CHECK(cudaStreamSynchronize(main_stream_));
CUDA_CHECK(cudaStreamSynchronize(side_stream_));
// Cost 2
int64_t M = N_phis;
int64_t K = N_rs;
int64_t N = N_ps;
int64_t lda = N_phis * N_thetas;
int64_t ldb = N_rs;
int64_t ldc = (l_max + 1) * N_phis;
cuDoubleComplex* d_A = reinterpret_cast&lt;cuDoubleComplex*&gt;(d_Psi_outer_spec);
cuDoubleComplex* d_B = reinterpret_cast&lt;cuDoubleComplex*&gt;(d_wrk2mat);
cuDoubleComplex* d_C = reinterpret_cast&lt;cuDoubleComplex*&gt;(d_Psi_outer_spec_plm);
int64_t strideA = N_phis;
int64_t strideB = N_rs * N_ps;
int64_t strideC = N_phis;
std::complex&lt;double&gt; alpha(1.0, 0.0);
std::complex&lt;double&gt; beta(0.0, 0.0);
CUBLAS_CHECK(cublasZgemmStridedBatched(
cublas_handle_,
CUBLAS_OP_N, CUBLAS_OP_N,
M, N, K,
(cuDoubleComplex*)&amp;alpha,
d_A, lda, strideA,
d_B, ldb, strideB,
(cuDoubleComplex*)&amp;beta,
d_C, ldc, strideC,
l_max + 1
));
// Cost 3
cudaMemcpyAsync(
Psi_outer_spec_plm, reinterpret_cast&lt;double*&gt;(d_Psi_outer_spec_plm),
N_ps * N_thetas * N_phis * sizeof(std::complex&lt;double&gt;),
cudaMemcpyDeviceToHost, main_stream_
);
CUDA_CHECK(cudaStreamSynchronize(main_stream_));
}
};
int main(int argc, char* argv[]) {
bool full_test = 0;
if (argc &gt; 1) full_test = 1;
std::cout &lt;&lt; &quot;Full test = &quot; &lt;&lt; full_test &lt;&lt; &quot;\n&quot;;
const int64_t N_rs = 1500;
const int64_t l_max = 70;
const int64_t N_thetas = l_max + 1;
const int64_t N_phis = 2 * l_max + 2;
const int64_t N_ps = 600;
MyTimer timer;
double total_cost[3] = {0};
double* wrk = new double [N_rs];
std::complex&lt;double&gt;* wrk2 = new std::complex&lt;double&gt; [N_rs * N_ps];
double* BJ = new double [N_ps * N_rs * (l_max+1)];
std::complex&lt;double&gt;* Psi_outer_spec = new std::complex&lt;double&gt; [N_rs * N_thetas * N_phis];
std::complex&lt;double&gt;* Psi_outer_spec_plm_0 = new std::complex&lt;double&gt; [N_ps * N_thetas * N_phis];
std::complex&lt;double&gt;* Psi_outer_spec_plm_1 = new std::complex&lt;double&gt; [N_ps * N_thetas * N_phis];
std::complex&lt;double&gt;* Psi_outer_spec_plm_2 = new std::complex&lt;double&gt; [N_ps * N_thetas * N_phis];
RtopCalculator calculator;
calculator.allocate(N_rs, l_max, N_ps);
int ntest = 5;
int wrong = 0;
for (int t = 1; t &lt;= ntest; t++) {
std::cout &lt;&lt; &quot;Start test &quot; &lt;&lt; t &lt;&lt; &quot;\n&quot;;
GenData(N_rs, l_max, N_ps, Psi_outer_spec, BJ, wrk);
if (full_test) {
timer.startCounter();        
rtop(N_rs, l_max, N_ps, Psi_outer_spec, Psi_outer_spec_plm_0, BJ, wrk, wrk2);
total_cost[0] += timer.getCounterMsPrecise();
}
timer.startCounter();
rtop_eigen(N_rs, l_max, N_ps, Psi_outer_spec, Psi_outer_spec_plm_1, BJ, wrk, wrk2);
total_cost[1] += timer.getCounterMsPrecise();
timer.startCounter();
calculator.compute(
N_rs, l_max, N_ps,
BJ,
wrk,
reinterpret_cast&lt;double*&gt;(Psi_outer_spec),
reinterpret_cast&lt;double*&gt;(Psi_outer_spec_plm_2)
);
total_cost[2] += timer.getCounterMsPrecise();
std::cout &lt;&lt; &quot;cost = &quot; &lt;&lt; total_cost[0] &lt;&lt; &quot; &quot; &lt;&lt; total_cost[1] &lt;&lt; &quot; &quot; &lt;&lt; total_cost[2] &lt;&lt; &quot;\n&quot;;
for (int i = 0; i &lt; N_ps; i++) 
for (int l = 0; l &lt; N_thetas; l++)
for (int m = 0; m &lt; N_phis; m++) {
int64_t idx = i * (N_thetas * N_phis) + l * N_phis + m;
auto res0 = Psi_outer_spec_plm_0[idx];
auto res1 = Psi_outer_spec_plm_1[idx];
auto res2 = Psi_outer_spec_plm_2[idx];
if (full_test) {
if (rcmp(res0.real(), res1.real()) || rcmp(res0.imag(), res1.imag()) ||
rcmp(res0.real(), res2.real()) || rcmp(res0.imag(), res2.imag())
) {
std::cout &lt;&lt; &quot;Error at (i=&quot; &lt;&lt; i &lt;&lt; &quot;,l=&quot; &lt;&lt; l &lt;&lt; &quot;,m=&quot; &lt;&lt; m &lt;&lt; &quot;): &quot; &lt;&lt; res0 &lt;&lt; &quot;; &quot; &lt;&lt; res1 &lt;&lt; &quot; &quot; &lt;&lt; res2 &lt;&lt; &quot;\n&quot;;
wrong++;
if (wrong == 20) exit(1);
}
} else {
if (rcmp(res1.real(), res2.real()) || rcmp(res1.imag(), res2.imag())) {
std::cout &lt;&lt; &quot;Error at (i=&quot; &lt;&lt; i &lt;&lt; &quot;,l=&quot; &lt;&lt; l &lt;&lt; &quot;,m=&quot; &lt;&lt; m &lt;&lt; &quot;): &quot; &lt;&lt; res1 &lt;&lt; &quot;; &quot; &lt;&lt; res2 &lt;&lt; &quot;\n&quot;;                
wrong++;
if (wrong == 20) exit(1);
}
}
}
}
return 0;
}

To run and test, use run.sh.

threads=$1  # The custom variable (number of threads) passed as the first argument
if [ -z &quot;$threads&quot; ]; then
threads=8
fi
export OMP_NUM_THREADS=$threads
export OPENBLAS_NUM_THREADS=$threads
nvcc -o main quantum.cu -O3 -std=c++17 -lcudart -lcublas -Xcompiler -march=native -Xcompiler -fopenmp -arch=native
echo &quot;Running with $1 threads&quot;
time ./main $2

For example:

./run.sh 1 # test Eigen 1 thread vs GPU
./run.sh 16 # test Eigen 16 thread vs GPU
./run.sh 16 1 # test all 3 versions. Remember the original code is very slow.

It's possible to turn this into a shared library which can be imported into Python and use with Tensorflow/Torch. But it's out of the scope of the question, so you can send me an email.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在C++中加速这个for循环的方法，可能使用NVidia技术。

问题

答案1

编辑：矩阵-向量乘积

Edit: Matrix-vector product

Edit 2: Matrix-matrix product

答案2

OpenGL窗口内容复制了其后面的内容。

Understanding OpenGL perspective projection matrix, Setting the near plane below 1.0

在类构造函数中验证 argv[] 输入

如何将ofstream用作结构体的一部分 – 函数参数。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。