寄存器是如何分配给GPU内的线程的?

huangapple go评论68阅读模式
英文:

How are registers allocated to threads inside a GPU?

问题

在GPU内部,每个线程的寄存器数量是如何确定的?我想知道GPU是否有每个SM可以分配给线程的65536个寄存器,这些寄存器是否都分配给了在SM上运行的活动线程块?所以,现在我有一个CUDA程序,每个线程块有1024个线程,每个块有65536个可用寄存器。我的困惑是,分析器显示每个线程只得到了40个寄存器。另一个观察是,在它的汇编代码中,每个线程实际上使用了确切的64个寄存器,这意味着如果分配了那么多线程,性能可能会更好。为什么它不是64个?是谁做出了这个决定?这是在编译时根据计算能力还是在运行时等决定的呢?

编辑:
这里是示例代码及其汇编。我在代码末尾看到了 %f64 以得出上述观点。

https://godbolt.org/z/eMzW8dY19

英文:

How is the number of registers per thread decided inside the GPU? I want to see if the GPU has 65536 registers per SM that it can allocate among the threads, do these registers all get allocated to the active thread block running on the SM? So, right now, I have a CUDA program where I have 1024 threads per thread block and 65536 available registers per block. My confusion is, the profiler says each thread only gets 40 registers. Another observation is that each thread actually makes use of exactly 64 registers in its assembly code, which means the performance could've been better if it was assigned that number of threads. Why doesn't it get 64? Who makes this decision? Is it decided at compile time per compute capability or runtime, etc?

Edit:
Here is the sample code and its assembly. I'm looking at %f64 at the end of the code to conclude the point above.
https://godbolt.org/z/eMzW8dY19

答案1

得分: 3

"在GPU内部,每个线程的寄存器数量是如何确定的?

实际(非PTX虚拟)寄存器分配是在运行ptxas工具(nvcc编译器驱动工具链的一部分)或等效工具时确定的,该工具是驱动API加载程序或NVRTC机制的一部分。

ptxas是将PTX转换为SASS(机器码)的工具。SASS实际上在GPU上运行,而PTX不会。必须首先将PTX转换为SASS。

PTX和PTX中的虚拟寄存器系统对于理解这些概念并不有用。在PTX中可以定义的虚拟寄存器数量基本上没有限制,而在PTX中定义的虚拟寄存器数量对GPU硬件中实际寄存器的使用方式一点信息都没有。PTX对于这类研究并不有用。

这个时候寄存器的分配完全是静态确定的。您可以通过将-Xptxas=-v编译开关传递给nvcc来获取一些证据,当您的nvcc编译命令已经指定了一个有效的SASS目标时。没有运行时的变化性(忽略通过CUDA JIT PTX->SASS转换机制产生的“变化性”;这里关注的是SASS而不是PTX。一旦定义了SASS,就没有运行时的变化性了)。

这些寄存器是否都分配给了在SM上运行的活动线程块?

分配的寄存器数量将由每个线程的寄存器数、一些粒度/取整效应以及每个线程块的线程数(即这两者的乘积)决定。这些寄存器的数量将从SM中的总寄存器中“划出”,在一个线程块被CUDA工作分配器(CWD或CUDA块调度器)“放置”在该SM上的时候,CWD会进行这个操作。只有在有足够数量的寄存器可用于分配时,CWD才会将一个线程块放置在那个SM上。

并非所有寄存器(例如65536或SM容量)都会自动或总是为一个单独的线程块分配。这将取决于该线程块的实际需求。如果CWD决定在该SM上放置另一个线程块,那么剩余/未分配的寄存器可以在将来使用。CUDA SM具有同时支持多个线程块的能力,并为每个线程块分配寄存器。除非有足够数量的未分配寄存器以满足潜在线程块的需求,否则CWD不会在该SM上放置新的线程块。

我的困惑是,分析器显示每个线程只有40个寄存器。另一个观察是,每个线程在其汇编代码中实际使用了确切的64个寄存器,

分析器报告的数字是正确的(其中包括可能包含在-Xptxas=-v输出中的粒度/取整效应)。您的困惑在于您试图通过PTX来理解发生了什么。不要这样做。对于这次讨论,这是不相关的。"

英文:

>How is the number of registers per thread decided inside the GPU?

Actual (non-PTX-virtual) register assignments are determined at the point of running the ptxas tool on your code (part of the nvcc compiler driver toolchain), or the equivalent tool as part of the driver API loader or the NVRTC mechanism.

ptxas is the tool that converts PTX to SASS (machine code). SASS is the thing that actually runs on a GPU, PTX is not. PTX must first be converted to SASS.

PTX and the virtual register system in PTX are not useful for understanding of these concepts. There is essentially no limit to the number of virtual registers that can be defined in PTX, and the number of virtual registers defined in PTX tells you nothing at all about how actual registers will be used in GPU hardware. PTX is not useful for this sort of study.

The register assignments are entirely statically determined at this point. You can get some evidence of this by passing -Xptxas=-v compile switch to nvcc when your nvcc compile command has specified a valid SASS target. There is no runtime variability (ignoring the "variability" that would come about via the CUDA JIT PTX->SASS conversion mechanism; the item in focus here is SASS not PTX. Once the SASS is defined, there is no runtime variability.)

>do these registers all get allocated to the active thread block running on the SM?

The number of registers allocated will be determined by the registers per thread, some granularity/rounding effects, and the number of threads per threadblock (i.e. the product of these). This quantity of registers will be "carved out" of the total available in the SM, at the point at which a threadblock is "deposited" on that SM, by the CUDA Work Distributor (CWD or CUDA block scheduler). The CWD will not deposit a block until a sufficient number of registers are available to be allocated.

The entire complement of registers (e.g. 65536 or whatever the SM capacity is) are not automatically or always allocated for a single threadblock. It will depend on the actual needs of that threadblock. Remaining/unallocated registers can be used in the future if the CWD decides to deposit another threadblock on that SM. CUDA SMs have the ability to support multiple threadblocks simultaneously, with registers allocated for each. Unless unallocated registers are available in sufficient quantity to meet the needs of a prospective threadblock, the CWD will not deposit a new threadblock on that SM.

>My confusion is, the profiler says each thread only gets 40 registers. Another observation is that each thread actually makes use of exactly 64 registers in its assembly code,

The profiler reported number is correct (and it includes the granularity/rounding effects, which may or may not be included in the -Xptxas=-v output.) Your confusion is that you are attempting to understand what is happening via the PTX. Do not do that. It is irrelevant for this discussion.

huangapple
  • 本文由 发表于 2023年2月18日 08:28:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/75490372.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定