英文:
How to tune the SM utilization (across the entire GPU) of a CUDA kernel?
问题
我想尝试使用CUDA MPS(Multi-Process Service)进行实验。我想要在GPU上同时启动许多CUDA内核(并发),以便每个内核在不与其他内核同时执行的情况下几乎具有相同的端到端运行时间。
例如,我有一个内核,KernA
。我想要在GPU上同时启动许多这个内核KernA
的实例。因此,我想要手动控制(即使将其硬编码到.cu
程序中也可以)每个实例的SM利用率。如果我同时启动两个KernA
内核实例,我希望每个实例单独具有50%
的SM利用率。[输入大小对我来说不重要。我只想要一个内核启动,我可以调整它们各自的SM利用率]。因此,当它们两个在GPU上使用CUDA MPS同时运行时,每个内核的50%
大致相当于GPU的100%
,因此,每个实例单独运行所需的时间将与运行这些2
个内核一样长。
类似地,如果我想同时启动3
个内核,我希望每个实例的SM利用率约为33%
,以便这3个内核总共占用GPU的100%
。
此外,如果我想同时启动4
个内核,我希望每个实例的SM利用率约为25%
,以便这4个内核总共占用GPU的100%
。
...
根据这一思路,如果我想同时启动n
个内核,我希望每个实例的SM利用率约为(100/n)%
,以便这n
个内核总共占用GPU的100%
。
第一种我尝试采用的方法是:
我尝试使用CUDA运行时API:cudaOccupancyMaxActiveBlocksPerMultiprocessor。我尝试了这里给出的示例程序。
但是,我在上述方法中遇到的问题如下(我使用的是RTX2080
GPU):
- 我在此处输入块大小。对于块大小=32,我得到每个SM的
50%
利用率。 - 我将块大小减半,如
16、8、4、2、1
... SM的利用率也减半,即25、12.5、...
- 我使用的GPU的warp大小是
32
。如果有的话,我使用块大小小于warp大小;据我所知,GPU系统会填充虚拟线程,以使总共有32个线程为一个warp,以便这32个线程可以以锁步方式工作。由于这种填充,我猜测我不能同时启动超过两个内核(例如块大小为16和4个并发内核),并期望它们会很好地打包在GPU中,就像我上面提到的示例一样。 - 我可以调整的另一个参数是块的数量。上述API设置了每个SM的最大活动块。因此,如果对于一个内核,API将
numBlocks
变量指定为16
并报告占用率:50%
,我们不能只设置numBlocks=8
并启动内核(即kernA<<<8, 32>>>()
)并期望整个GPU的利用率为25%
,因为我的RTX2080有46个流处理多处理器(SMs)。为什么呢?因为API会为单个SM调整numBlocks
,所以在所有SM上期望的GPU利用率将是25/46%
。
第二种我尝试采用的方法是,保持blockSize
固定;我正在改变n
的值,从而通过试错来手动调整numBlocks=n/blockSize
的值,并通过nvidia-smi
命令检查GPU利用率的瞬时变化来进行调整。但我觉得这种方法不太具体。对于较小的内核,执行时间非常短,以至于内核在watch -n 0.1 nvidia-smi
命令中不显示GPU%利用率的任何变化。
我正在使用这个简单的内核进行测试:
__global__ void MyKernel(float* a, float* b, float* c, int n)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n)
{
c[idx] = a[idx] + b[idx];
}
}
此外,除了使用watch -n 0.1 nvidia-smi
命令之外,还有确认我的内核在执行时是否使用了GPU的x%
的特定方式吗?例如,ncu
等性能分析工具,如何检查?
英文:
I want to try experiments with CUDA MPS (Multi-Process Service). I want to launch many CUDA kernels on the GPU simultaneously (concurrently), such that each kernel without concurrent execution with the other kernels has almost the same end-to-end running time.
For example, I have a kernel, KernA.
And I would like to launch many instances of this kernel KernA
on the GPU simultaneously. So, I want to manually control (even hard-coding it into the .cu
program is fine) the SM utilization of each instance. If I launch two kernel instances of KernA
simultaneously, I want each instance alone to have an SM utilization of 50%
. [The input size does not matter to me. All I want is a kernel launch, where I can tune the SM utilization of each of them]. So that when two of them are run simultaneously on the GPU using CUDA MPS, 50%
of each would roughly make up 100%
of the GPU, and hence, time to run each instance alone would take the same as running those 2
kernels.
Similarly, if I want to launch 3
kernels simultaneously, I want each instance to have an SM utilization of around 33%
so that 3 of these kernels would make up a total of 100%
of the GPU.
Also, if I want to launch 4
kernels simultaneously, I want each instance to have an SM utilization of around 25%
so that 4 of these kernels would make up a total of 100%
.
...
Going by this line, if I want to launch n
kernels simultaneously, I want each instance to have an SM utilization of around (100/n) %
such that n
of these kernels would make up a total of 100%
.
The first approach which I tried to take:
I tried using the CUDA runtime API : cudaOccupancyMaxActiveBlocksPerMultiprocessor. I tried out the program example given here.
But, the problem I get in this above approach is as follows ( I am using RTX2080
GPU):
- I tried to take the block size as input here. For block size = 32, I am getting
50%
utilization per SM. - I half the block size like
16, 8, 4, 2, 1
... The utilization per SM halves as well, i.e.,25, 12.5, ...
- Warp size of the GPU that I am using is
32
. If any, I use block size < warp size; as far as I know, the GPU system shall pad dummy threads to make up a total of 32 threads for a warp so that these total of32
threads can work in a lock-step manner. Due to this padding, I guess I cannot launch more than two kernels concurrently (say with block size = 16 and 4 concurrent kernels) and expect them to get nicely packed in the GPU, as the example I mentioned above. - The other parameter which I can twig is the number of blocks. The API mentioned above sets the maximum active block per SM. So, if for a kernel, the API designates the
numBlocks
variable to say16
and reportsOccupancy: 50%
, we cannot just setnumBlocks=8
and launch the kernel (i.e.,kernA<<<8, 32>>>()
) and expect25%
utilization of the entire GPU because my RTX2080 has 46 streaming multiprocessors (SMs). Why? Since the API adjustsnumBlocks
for a single SM, the expected GPU utilization over all the SMs shall be25/46 %.
The second approach which I tried to take is, keeping blockSize
fixed; I am changing the value of n
and hence causing numBlocks = n/ blockSize
to adjust manually by hit-and-trial and checking the instantaneous change in the GPU utilization through the nvidia-smi
command. But I do not find this approach quite concrete. And for smaller kernels, the execution time is so small that the kernel finishes execution without showing any change in the GPU % utilization in the watch -n 0.1 nvidia-smi
command.
I am using this simple kernel for testing:
__global__ void MyKernel(float* a, float* b, float* c, int n)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n)
{
c[idx] = a[idx] + b[idx];
}
}
Also, any specific way to confirm that my kernel, while executing, is making use of x%
of the GPU apart from using watch -n 0.1 nvidia-smi
command? Any profiling tool, for example, ncu
, and how to check that?
答案1
得分: 1
每个算法和每种硬件组合都会有不同的平衡点,适用于每个特定的使用百分比。您可以编写一个基准测试工具,逐渐增加要启动的CUDA线程数,直到达到100%的使用率(或者直到所有工作的总运行时间开始精确线性增加)。然后只需选择最接近线程块中线程数的整数倍,将工作分成N个具有相等百分比的客户端。
尽管效率较低,您也可以这样做:
- 启动所有客户端内核
- 测量(重叠)所用时间
- 通过在每次新内核启动时迭代地减去/添加(CUDA)线程来均衡客户端的时间
- 重复,直到时间均衡。
英文:
Every algorithm & every hardware combination would have a different balance for every specific usage-percentage. You can write a benchmarking tool that gradually increments number of CUDA threads to launch until it reaches 100% usage (or until the total run-time of all work starts increasing exactly linearly). Then just pick closest integer-multiple of number of threads per block to separate the work into N clients with equal percentage.
Though less efficiently, you can do this too:
- launch all clients kernels
- measure (overlapped) time taken
- equalize times by subtracting/adding (CUDA) threads to clients iteratively on each new kernel launch
- repeat until timings are equalized
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论