2023年7月14日 08:21:00go评论111阅读模式

英文:

CUDA math function register usage

问题

我正在尝试理解在使用一些内置的CUDA数学操作时所产生的显著寄存器使用情况，比如atan2()或除法，以及如何减少/消除寄存器使用情况。

我正在使用以下程序：

#include &lt;stdint.h&gt;
#include &lt;cuda_runtime.h&gt;
extern &quot;C&quot; {
	__global__ void kernel(float* out) {
		uint32_t n = threadIdx.x + blockIdx.x*blockDim.x;
		out[n] = atan2f(static_cast&lt;float&gt;(n), 2.0f);
	}
}
int main(int argc, char const* argv[]) {
	float* d_ary;
	cudaMalloc(&amp;d_ary, 32);
	kernel&lt;&lt;&lt;1,32&gt;&gt;&gt;(d_ary);
	float ary[32];
	cudaMemcpy(ary, d_ary, 32, cudaMemcpyDeviceToHost);
}

并使用以下方式构建它：

nvcc -arch=sm_80 -Xptxas=&quot;-v&quot; kernel.cu

对内核进行性能分析会产生以下附图中的结果。

当调用atan2()时，寄存器使用量急剧增加（或者是atan2()内的某个函数调用），寄存器数量增加了100多个。据我所知，这似乎是因为atan2()没有被内联。除了使用像use_fast_math这样的编译器标志之外，是否有其他方法可以使这些更昂贵的浮点操作内联起来？

编辑：

@njuffa 指出，导致寄存器急剧增加的函数调用与atan2内部调用的慢路径有关，该慢路径调用了一个未内联的内部CUDA函数。经过一些测试，寄存器急剧增加似乎与任何未内联的函数调用(CALL.ABS.NOINC)相关。带有__noinline__修饰符的任何设备函数都会导致相同的现象。此外，嵌套的__noinline__调用会导致由Nsisght报告的活动寄存器数进一步增加，增加到255的上限。

英文:

I am trying to understand the significant register usage incurred when using a few of the built-in CUDA math ops like atan2() or division and how the register usage might be reduced/eliminated.

I'm using the following program:

#include &lt;stdint.h&gt;
#include &lt;cuda_runtime.h&gt;
extern &quot;C&quot; {
	__global__ void kernel(float* out) {
		uint32_t n = threadIdx.x + blockIdx.x*blockDim.x;
		out[n] = atan2f(static_cast&lt;float&gt;(n), 2.0f);
	}
}
int main(int argc, char const* argv[]) {
	float* d_ary;
	cudaMalloc(&amp;d_ary, 32);
	kernel&lt;&lt;&lt;1,32&gt;&gt;&gt;(d_ary);
	float ary[32];
	cudaMemcpy(ary, d_ary, 32, cudaMemcpyDeviceToHost);
}

and building it with:

nvcc -arch=sm_80 -Xptxas=&quot;-v&quot; kernel.cu

Profiling the kernel produces results in the image attached below.

The massive spike in register usage occurs when atan2() is called (or some function within atan2), increasing the register count by more than 100. As far as I can tell this seems to be due to the fact that atan2() is not inlined. Is there any way to get these more expensive floating point operations to get inlined other than resorting to compiler flags like use_fast_math?

EDIT:

@njuffa pointed out that the function call causing the register spike is associated with a slow path taken within atan2 which calls into an internal CUDA function that is not inlined. After some testing the register spike seems to be associated with any non-inlined function call (CALL.ABS.NOINC). Any device function decorated with __noinline__ results in the same phenomenon. Further, nested __noinline__ calls result in the live register count reported by Nsight increasing even further, up to the cap of 255.

答案1

得分: 0

我在NVIDIA的Nsight Computer论坛上发布了关于这个问题的帖子，并被告知这是一个错误，将在未来的版本中修复。

链接：https://forums.developer.nvidia.com/t/contraditory-register-count-report-when-calling-a-non-inlined-function/259908

英文:

I posted about this on the Nsight Computer forums and was informed that it is a bug and will be fixed in a future release.

https://forums.developer.nvidia.com/t/contraditory-register-count-report-when-calling-a-non-inlined-function/259908

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

CUDA数学函数寄存器使用

问题

答案1

无法向关闭的通信请求 Python Spyder IDE 错误。

String Manipulation in Numba Cuda: Clip first k characters from a string array, k comes from another array

CUDA转置核心随机失败

Cuda 使用模板类 / 将 Lambda 传递给非类函数

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。