CUDA数学函数寄存器使用

huangapple go评论80阅读模式
英文:

CUDA math function register usage

问题

我正在尝试理解在使用一些内置的CUDA数学操作时所产生的显著寄存器使用情况,比如atan2()或除法,以及如何减少/消除寄存器使用情况。

我正在使用以下程序:

#include <stdint.h>
#include <cuda_runtime.h>

extern "C" {
	__global__ void kernel(float* out) {
		uint32_t n = threadIdx.x + blockIdx.x*blockDim.x;
		out[n] = atan2f(static_cast<float>(n), 2.0f);
	}
}

int main(int argc, char const* argv[]) {
	float* d_ary;
	cudaMalloc(&d_ary, 32);
	kernel<<<1,32>>>(d_ary);
	float ary[32];
	cudaMemcpy(ary, d_ary, 32, cudaMemcpyDeviceToHost);
}

并使用以下方式构建它:

nvcc -arch=sm_80 -Xptxas="-v" kernel.cu

对内核进行性能分析会产生以下附图中的结果。

当调用atan2()时,寄存器使用量急剧增加(或者是atan2()内的某个函数调用),寄存器数量增加了100多个。据我所知,这似乎是因为atan2()没有被内联。除了使用像use_fast_math这样的编译器标志之外,是否有其他方法可以使这些更昂贵的浮点操作内联起来?

CUDA数学函数寄存器使用

编辑:

@njuffa 指出,导致寄存器急剧增加的函数调用与atan2内部调用的慢路径有关,该慢路径调用了一个未内联的内部CUDA函数。经过一些测试,寄存器急剧增加似乎与任何未内联的函数调用(CALL.ABS.NOINC)相关。带有__noinline__修饰符的任何设备函数都会导致相同的现象。此外,嵌套的__noinline__调用会导致由Nsisght报告的活动寄存器数进一步增加,增加到255的上限。

英文:

I am trying to understand the significant register usage incurred when using a few of the built-in CUDA math ops like atan2() or division and how the register usage might be reduced/eliminated.

I'm using the following program:

#include <stdint.h>
#include <cuda_runtime.h>

extern "C" {
	__global__ void kernel(float* out) {
		uint32_t n = threadIdx.x + blockIdx.x*blockDim.x;
		out[n] = atan2f(static_cast<float>(n), 2.0f);
	}
}

int main(int argc, char const* argv[]) {
	float* d_ary;
	cudaMalloc(&d_ary, 32);
	kernel<<<1,32>>>(d_ary);
	float ary[32];
	cudaMemcpy(ary, d_ary, 32, cudaMemcpyDeviceToHost);
}

and building it with:

nvcc -arch=sm_80 -Xptxas="-v" kernel.cu

Profiling the kernel produces results in the image attached below.

The massive spike in register usage occurs when atan2() is called (or some function within atan2), increasing the register count by more than 100. As far as I can tell this seems to be due to the fact that atan2() is not inlined. Is there any way to get these more expensive floating point operations to get inlined other than resorting to compiler flags like use_fast_math?

CUDA数学函数寄存器使用

EDIT:

@njuffa pointed out that the function call causing the register spike is associated with a slow path taken within atan2 which calls into an internal CUDA function that is not inlined. After some testing the register spike seems to be associated with any non-inlined function call (CALL.ABS.NOINC). Any device function decorated with __noinline__ results in the same phenomenon. Further, nested __noinline__ calls result in the live register count reported by Nsight increasing even further, up to the cap of 255.

答案1

得分: 0

我在NVIDIA的Nsight Computer论坛上发布了关于这个问题的帖子,并被告知这是一个错误,将在未来的版本中修复。

链接:https://forums.developer.nvidia.com/t/contraditory-register-count-report-when-calling-a-non-inlined-function/259908

英文:

I posted about this on the Nsight Computer forums and was informed that it is a bug and will be fixed in a future release.

https://forums.developer.nvidia.com/t/contraditory-register-count-report-when-calling-a-non-inlined-function/259908

huangapple
  • 本文由 发表于 2023年7月14日 08:21:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76683998.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定