如何调整CUDA内核的SM利用率(跨整个GPU)?

huangapple go评论80阅读模式
英文:

How to tune the SM utilization (across the entire GPU) of a CUDA kernel?

问题

我想尝试使用CUDA MPS(Multi-Process Service)进行实验。我想要在GPU上同时启动许多CUDA内核(并发),以便每个内核在不与其他内核同时执行的情况下几乎具有相同的端到端运行时间。

例如,我有一个内核,KernA。我想要在GPU上同时启动许多这个内核KernA的实例。因此,我想要手动控制(即使将其硬编码到.cu程序中也可以)每个实例的SM利用率。如果我同时启动两个KernA内核实例,我希望每个实例单独具有50%的SM利用率。[输入大小对我来说不重要。我只想要一个内核启动,我可以调整它们各自的SM利用率]。因此,当它们两个在GPU上使用CUDA MPS同时运行时,每个内核的50%大致相当于GPU的100%,因此,每个实例单独运行所需的时间将与运行这些2个内核一样长。

类似地,如果我想同时启动3个内核,我希望每个实例的SM利用率约为33%,以便这3个内核总共占用GPU的100%

此外,如果我想同时启动4个内核,我希望每个实例的SM利用率约为25%,以便这4个内核总共占用GPU的100%

...

根据这一思路,如果我想同时启动n个内核,我希望每个实例的SM利用率约为(100/n)%,以便这n个内核总共占用GPU的100%

第一种我尝试采用的方法是:

我尝试使用CUDA运行时API:cudaOccupancyMaxActiveBlocksPerMultiprocessor。我尝试了这里给出的示例程序。

但是,我在上述方法中遇到的问题如下(我使用的是RTX2080 GPU):

  • 我在此处输入块大小。对于块大小=32,我得到每个SM的50%利用率。
  • 我将块大小减半,如16、8、4、2、1... SM的利用率也减半,即25、12.5、...
  • 我使用的GPU的warp大小是32。如果有的话,我使用块大小小于warp大小;据我所知,GPU系统会填充虚拟线程,以使总共有32个线程为一个warp,以便这32个线程可以以锁步方式工作。由于这种填充,我猜测我不能同时启动超过两个内核(例如块大小为16和4个并发内核),并期望它们会很好地打包在GPU中,就像我上面提到的示例一样。
  • 我可以调整的另一个参数是块的数量。上述API设置了每个SM的最大活动块。因此,如果对于一个内核,API将numBlocks变量指定为16并报告占用率:50%,我们不能只设置numBlocks=8并启动内核(即kernA<<<8, 32>>>())并期望整个GPU的利用率为25%,因为我的RTX2080有46个流处理多处理器(SMs)。为什么呢?因为API会为单个SM调整numBlocks,所以在所有SM上期望的GPU利用率将是25/46%

第二种我尝试采用的方法是,保持blockSize固定;我正在改变n的值,从而通过试错来手动调整numBlocks=n/blockSize的值,并通过nvidia-smi命令检查GPU利用率的瞬时变化来进行调整。但我觉得这种方法不太具体。对于较小的内核,执行时间非常短,以至于内核在watch -n 0.1 nvidia-smi命令中不显示GPU%利用率的任何变化。

我正在使用这个简单的内核进行测试:

__global__ void MyKernel(float* a, float* b, float* c, int n)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n)
    {
        c[idx] = a[idx] + b[idx];
    }
}

此外,除了使用watch -n 0.1 nvidia-smi命令之外,还有确认我的内核在执行时是否使用了GPU的x%的特定方式吗?例如,ncu等性能分析工具,如何检查?

英文:

I want to try experiments with CUDA MPS (Multi-Process Service). I want to launch many CUDA kernels on the GPU simultaneously (concurrently), such that each kernel without concurrent execution with the other kernels has almost the same end-to-end running time.

For example, I have a kernel, KernA. And I would like to launch many instances of this kernel KernA on the GPU simultaneously. So, I want to manually control (even hard-coding it into the .cu program is fine) the SM utilization of each instance. If I launch two kernel instances of KernA simultaneously, I want each instance alone to have an SM utilization of 50%. [The input size does not matter to me. All I want is a kernel launch, where I can tune the SM utilization of each of them]. So that when two of them are run simultaneously on the GPU using CUDA MPS, 50% of each would roughly make up 100% of the GPU, and hence, time to run each instance alone would take the same as running those 2 kernels.

Similarly, if I want to launch 3 kernels simultaneously, I want each instance to have an SM utilization of around 33% so that 3 of these kernels would make up a total of 100% of the GPU.

Also, if I want to launch 4 kernels simultaneously, I want each instance to have an SM utilization of around 25% so that 4 of these kernels would make up a total of 100%.

...

Going by this line, if I want to launch n kernels simultaneously, I want each instance to have an SM utilization of around (100/n) % such that n of these kernels would make up a total of 100%.


The first approach which I tried to take:

I tried using the CUDA runtime API : cudaOccupancyMaxActiveBlocksPerMultiprocessor. I tried out the program example given here.

But, the problem I get in this above approach is as follows ( I am using RTX2080 GPU):

  • I tried to take the block size as input here. For block size = 32, I am getting 50% utilization per SM.
  • I half the block size like 16, 8, 4, 2, 1... The utilization per SM halves as well, i.e., 25, 12.5, ...
  • Warp size of the GPU that I am using is 32. If any, I use block size < warp size; as far as I know, the GPU system shall pad dummy threads to make up a total of 32 threads for a warp so that these total of 32 threads can work in a lock-step manner. Due to this padding, I guess I cannot launch more than two kernels concurrently (say with block size = 16 and 4 concurrent kernels) and expect them to get nicely packed in the GPU, as the example I mentioned above.
  • The other parameter which I can twig is the number of blocks. The API mentioned above sets the maximum active block per SM. So, if for a kernel, the API designates the numBlocks variable to say 16 and reports Occupancy: 50%, we cannot just set numBlocks=8 and launch the kernel (i.e., kernA&lt;&lt;&lt;8, 32&gt;&gt;&gt;()) and expect 25% utilization of the entire GPU because my RTX2080 has 46 streaming multiprocessors (SMs). Why? Since the API adjusts numBlocks for a single SM, the expected GPU utilization over all the SMs shall be 25/46 %.

The second approach which I tried to take is, keeping blockSize fixed; I am changing the value of n and hence causing numBlocks = n/ blockSize to adjust manually by hit-and-trial and checking the instantaneous change in the GPU utilization through the nvidia-smi command. But I do not find this approach quite concrete. And for smaller kernels, the execution time is so small that the kernel finishes execution without showing any change in the GPU % utilization in the watch -n 0.1 nvidia-smi command.


I am using this simple kernel for testing:

__global__ void MyKernel(float* a, float* b, float* c, int n)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx &lt; n)
    {
        c[idx] = a[idx] + b[idx];
    }
}


Also, any specific way to confirm that my kernel, while executing, is making use of x% of the GPU apart from using watch -n 0.1 nvidia-smi command? Any profiling tool, for example, ncu, and how to check that?

答案1

得分: 1

每个算法和每种硬件组合都会有不同的平衡点,适用于每个特定的使用百分比。您可以编写一个基准测试工具,逐渐增加要启动的CUDA线程数,直到达到100%的使用率(或者直到所有工作的总运行时间开始精确线性增加)。然后只需选择最接近线程块中线程数的整数倍,将工作分成N个具有相等百分比的客户端。

尽管效率较低,您也可以这样做:

  • 启动所有客户端内核
  • 测量(重叠)所用时间
  • 通过在每次新内核启动时迭代地减去/添加(CUDA)线程来均衡客户端的时间
  • 重复,直到时间均衡。
英文:

Every algorithm & every hardware combination would have a different balance for every specific usage-percentage. You can write a benchmarking tool that gradually increments number of CUDA threads to launch until it reaches 100% usage (or until the total run-time of all work starts increasing exactly linearly). Then just pick closest integer-multiple of number of threads per block to separate the work into N clients with equal percentage.

Though less efficiently, you can do this too:

  • launch all clients kernels
  • measure (overlapped) time taken
  • equalize times by subtracting/adding (CUDA) threads to clients iteratively on each new kernel launch
  • repeat until timings are equalized

huangapple
  • 本文由 发表于 2023年7月24日 00:27:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76749279.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定