混淆关于CUDA核心数量和并行线程数量的情况

huangapple go评论73阅读模式
英文:

Confusion around no of CUDA Cores and the number of parallel threads

问题

我有NVIDIA Corporation TU117 [GeForce GTX 1650]作为我的GPU。规格如下:

SM的数量 = 14
每个SM的cuda核心数 - 64
总cuda核心数 - 64 * 14 = 896

我已经阅读过,块是顺序执行的,一个SM一次只能执行一个块。每个cuda核心可以同时执行32个线程(线程束)。因此,每个SM可以同时执行64 * 32 = 2048个线程。但是每个块的最大线程数是1024。所以,当我将一个块分配给一个SM时(如前所述,不能将多个块分配给一个SM),这意味着该SM最多可以同时运行1024个线程(受最大块线程大小的限制)。

这是否意味着SM仅以其最大容量的一半运行(2048/1024)?

因此,我可以同时运行的最大线程数是1024(最大块线程大小)乘以14(SM的数量)。这正确吗?

此外,NVIDIA Corporation TU117 [GeForce GTX 1650]上提供的数据是否正确?

英文:

I have NVIDIA Corporation TU117 [GeForce GTX 1650] as my GPU. The specs are as below.<br>

No of SM's = 14<br>
No of cuda cores per SM - 64<br>
Total no of cuda cores - 64*14 = 896<br>
<br>
I have read that blocks are executed sequentially and that one SM can execute one block at once<br>
Each cuda core can execute 32 threads at once simultaneously(warps). So, each SM can execute 64 times 32 = 2048 threads simultaneously. But the max no of threads per block is 1024. So, when i assign one block to one SM (as told earlier, cannot assign multiple blocks to one SM), that means that SM can max run 1024 threads at once(limited by max block thread size<br>

Doesn't that mean that SM only runs at half (2048/1024) of it's max capacity?<br>
So, the max no of threads i can run simultaneously is 1024(max block thread size) times 14(no of SM's). Is that correct?<br>
Also, is the data given here on NVIDIA Corporation TU117 [GeForce GTX 1650] correct?

答案1

得分: 2

>I have read that blocks are executed sequentially and that one SM can execute one block at once

我已经阅读过块是按顺序执行的,并且一个SM一次只能执行一个块。

>Each cuda core can execute 32 threads at once simultaneously(warps).

每个CUDA核心可以同时执行32个线程(warp)。

>So, each SM can execute 64 times 32 = 2048 threads simultaneously.

因此,每个SM可以同时执行64乘以32等于2048个线程。

>Doesn't that mean that SM only runs at half (2048/1024) of its max capacity?

这是否意味着SM只以其最大容量的一半运行(2048/1024)?

>So, the max no of threads i can run simultaneously is 1024(max block thread size) times 14(no of SM's). Is that correct?

因此,我可以同时运行的最大线程数是1024(最大块线程大小)乘以14(SM的数量)。这正确吗?

>Also, is the data given here on NVIDIA Corporation TU117 [GeForce GTX 1650] correct?

此处提供的关于NVIDIA Corporation TU117 [GeForce GTX 1650]的数据是否正确?

Please note that these translations are based on the provided text and may not capture the full context or nuances of the original content.

英文:

>I have read that blocks are executed sequentially and that one SM can execute one block at once

I'm not sure where you read that. It's incorrect. An SM can have multiple blocks resident and the warp schedulers in an SM on a cycle by cycle basis can choose any assigned warps to issue instructions from.

>Each cuda core can execute 32 threads at once simultaneously(warps).

No, not correct. A cuda core as that term is used by NVIDIA marketing most closely corresponds to a single precision functional unit in an SM. On any given clock cycle, a cuda core can be issued a single FADD, FMUL, or FFMA instruction for a single thread in a warp. To handle such an instruction warp-wide in a single cycle would require 32 cuda cores, just for that one warp, for that one instruction.

>So, each SM can execute 64 times 32 = 2048 threads simultaneously.

The previous statements are incorrect, and so this calculation is also. That is not how you compute the number of simultaneous threads that can be handled by an SM. A Turing SM can handle 1024 threads max as indicated here ("maximum number of resident threads per SM"). It's a hardware limit.

>Doesn't that mean that SM only runs at half (2048/1024) of it's max capacity?

The max capacity of the Turing SM is 1024 threads. There is no inconsistency.

>So, the max no of threads i can run simultaneously is 1024(max block thread size) times 14(no of SM's). Is that correct?

Yes, the maximum number of threads that can be simultaneously "in-flight" on any CUDA GPU is the maximum number of threads per SM times the number of SMs. This doesn't mean that a kernel launch is numerically limited to that many threads. Additional threads will "wait in the wings" until threads currently running on a SM complete and retire, making space for additional threads to be deposited.

>Also, is the data given here on NVIDIA Corporation TU117 [GeForce GTX 1650] correct?

You can get the number of SMs and the max number of threads per multiprocessor on any CUDA GPU by running the deviceQuery sample code.

Unit 3 of this online training series may be of interest, as it covers GPU architecture and warp scheduling info. In addition, there are numerous question here on the cuda SO tag that cover these topics. Here is one example, there are many others.

huangapple
  • 本文由 发表于 2023年7月13日 17:52:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76678083.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定