英文:
CUDA math function register usage
问题
我正在尝试理解在使用一些内置的CUDA数学操作时所产生的显著寄存器使用情况,比如atan2()
或除法,以及如何减少/消除寄存器使用情况。
我正在使用以下程序:
#include <stdint.h>
#include <cuda_runtime.h>
extern "C" {
__global__ void kernel(float* out) {
uint32_t n = threadIdx.x + blockIdx.x*blockDim.x;
out[n] = atan2f(static_cast<float>(n), 2.0f);
}
}
int main(int argc, char const* argv[]) {
float* d_ary;
cudaMalloc(&d_ary, 32);
kernel<<<1,32>>>(d_ary);
float ary[32];
cudaMemcpy(ary, d_ary, 32, cudaMemcpyDeviceToHost);
}
并使用以下方式构建它:
nvcc -arch=sm_80 -Xptxas="-v" kernel.cu
对内核进行性能分析会产生以下附图中的结果。
当调用atan2()
时,寄存器使用量急剧增加(或者是atan2()
内的某个函数调用),寄存器数量增加了100多个。据我所知,这似乎是因为atan2()
没有被内联。除了使用像use_fast_math
这样的编译器标志之外,是否有其他方法可以使这些更昂贵的浮点操作内联起来?
编辑:
@njuffa 指出,导致寄存器急剧增加的函数调用与atan2
内部调用的慢路径有关,该慢路径调用了一个未内联的内部CUDA函数。经过一些测试,寄存器急剧增加似乎与任何未内联的函数调用(CALL.ABS.NOINC
)相关。带有__noinline__
修饰符的任何设备函数都会导致相同的现象。此外,嵌套的__noinline__
调用会导致由Nsisght报告的活动寄存器数进一步增加,增加到255的上限。
英文:
I am trying to understand the significant register usage incurred when using a few of the built-in CUDA math ops like atan2()
or division and how the register usage might be reduced/eliminated.
I'm using the following program:
#include <stdint.h>
#include <cuda_runtime.h>
extern "C" {
__global__ void kernel(float* out) {
uint32_t n = threadIdx.x + blockIdx.x*blockDim.x;
out[n] = atan2f(static_cast<float>(n), 2.0f);
}
}
int main(int argc, char const* argv[]) {
float* d_ary;
cudaMalloc(&d_ary, 32);
kernel<<<1,32>>>(d_ary);
float ary[32];
cudaMemcpy(ary, d_ary, 32, cudaMemcpyDeviceToHost);
}
and building it with:
nvcc -arch=sm_80 -Xptxas="-v" kernel.cu
Profiling the kernel produces results in the image attached below.
The massive spike in register usage occurs when atan2()
is called (or some function within atan2), increasing the register count by more than 100. As far as I can tell this seems to be due to the fact that atan2()
is not inlined. Is there any way to get these more expensive floating point operations to get inlined other than resorting to compiler flags like use_fast_math
?
EDIT:
@njuffa pointed out that the function call causing the register spike is associated with a slow path taken within atan2 which calls into an internal CUDA function that is not inlined. After some testing the register spike seems to be associated with any non-inlined function call (CALL.ABS.NOINC
). Any device function decorated with __noinline__
results in the same phenomenon. Further, nested __noinline__
calls result in the live register count reported by Nsight increasing even further, up to the cap of 255.
答案1
得分: 0
我在NVIDIA的Nsight Computer论坛上发布了关于这个问题的帖子,并被告知这是一个错误,将在未来的版本中修复。
链接:https://forums.developer.nvidia.com/t/contraditory-register-count-report-when-calling-a-non-inlined-function/259908
英文:
I posted about this on the Nsight Computer forums and was informed that it is a bug and will be fixed in a future release.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论