2023年7月13日 17:52:14go评论188阅读模式

英文:

Confusion around no of CUDA Cores and the number of parallel threads

问题

我有NVIDIA Corporation TU117 [GeForce GTX 1650]作为我的GPU。规格如下：

SM的数量 = 14
每个SM的cuda核心数 - 64
总cuda核心数 - 64 * 14 = 896

我已经阅读过，块是顺序执行的，一个SM一次只能执行一个块。每个cuda核心可以同时执行32个线程（线程束）。因此，每个SM可以同时执行64 * 32 = 2048个线程。但是每个块的最大线程数是1024。所以，当我将一个块分配给一个SM时（如前所述，不能将多个块分配给一个SM），这意味着该SM最多可以同时运行1024个线程（受最大块线程大小的限制）。

这是否意味着SM仅以其最大容量的一半运行（2048/1024）？

因此，我可以同时运行的最大线程数是1024（最大块线程大小）乘以14（SM的数量）。这正确吗？

此外，NVIDIA Corporation TU117 [GeForce GTX 1650]上提供的数据是否正确？

英文:

I have NVIDIA Corporation TU117 [GeForce GTX 1650] as my GPU. The specs are as below.

No of SM's = 14 
No of cuda cores per SM - 64 
Total no of cuda cores - 64*14 = 896 
 
I have read that blocks are executed sequentially and that one SM can execute one block at once 
Each cuda core can execute 32 threads at once simultaneously(warps). So, each SM can execute 64 times 32 = 2048 threads simultaneously. But the max no of threads per block is 1024. So, when i assign one block to one SM (as told earlier, cannot assign multiple blocks to one SM), that means that SM can max run 1024 threads at once(limited by max block thread size

Doesn't that mean that SM only runs at half (2048/1024) of it's max capacity? 
So, the max no of threads i can run simultaneously is 1024(max block thread size) times 14(no of SM's). Is that correct? 
Also, is the data given here on NVIDIA Corporation TU117 [GeForce GTX 1650] correct?

答案1

得分: 2

>I have read that blocks are executed sequentially and that one SM can execute one block at once

我已经阅读过块是按顺序执行的，并且一个SM一次只能执行一个块。

>Each cuda core can execute 32 threads at once simultaneously(warps).

每个CUDA核心可以同时执行32个线程（warp）。

>So, each SM can execute 64 times 32 = 2048 threads simultaneously.

因此，每个SM可以同时执行64乘以32等于2048个线程。

>Doesn't that mean that SM only runs at half (2048/1024) of its max capacity?

这是否意味着SM只以其最大容量的一半运行（2048/1024）？

>So, the max no of threads i can run simultaneously is 1024(max block thread size) times 14(no of SM's). Is that correct?

因此，我可以同时运行的最大线程数是1024（最大块线程大小）乘以14（SM的数量）。这正确吗？

>Also, is the data given here on NVIDIA Corporation TU117 [GeForce GTX 1650] correct?

此处提供的关于NVIDIA Corporation TU117 [GeForce GTX 1650]的数据是否正确？

Please note that these translations are based on the provided text and may not capture the full context or nuances of the original content.

英文:

>I have read that blocks are executed sequentially and that one SM can execute one block at once

I'm not sure where you read that. It's incorrect. An SM can have multiple blocks resident and the warp schedulers in an SM on a cycle by cycle basis can choose any assigned warps to issue instructions from.

>Each cuda core can execute 32 threads at once simultaneously(warps).

No, not correct. A cuda core as that term is used by NVIDIA marketing most closely corresponds to a single precision functional unit in an SM. On any given clock cycle, a cuda core can be issued a single FADD, FMUL, or FFMA instruction for a single thread in a warp. To handle such an instruction warp-wide in a single cycle would require 32 cuda cores, just for that one warp, for that one instruction.

>So, each SM can execute 64 times 32 = 2048 threads simultaneously.

The previous statements are incorrect, and so this calculation is also. That is not how you compute the number of simultaneous threads that can be handled by an SM. A Turing SM can handle 1024 threads max as indicated here ("maximum number of resident threads per SM"). It's a hardware limit.

>Doesn't that mean that SM only runs at half (2048/1024) of it's max capacity?

The max capacity of the Turing SM is 1024 threads. There is no inconsistency.

>So, the max no of threads i can run simultaneously is 1024(max block thread size) times 14(no of SM's). Is that correct?

Yes, the maximum number of threads that can be simultaneously "in-flight" on any CUDA GPU is the maximum number of threads per SM times the number of SMs. This doesn't mean that a kernel launch is numerically limited to that many threads. Additional threads will "wait in the wings" until threads currently running on a SM complete and retire, making space for additional threads to be deposited.

>Also, is the data given here on NVIDIA Corporation TU117 [GeForce GTX 1650] correct?

You can get the number of SMs and the max number of threads per multiprocessor on any CUDA GPU by running the deviceQuery sample code.

Unit 3 of this online training series may be of interest, as it covers GPU architecture and warp scheduling info. In addition, there are numerous question here on the cuda SO tag that cover these topics. Here is one example, there are many others.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

混淆关于CUDA核心数量和并行线程数量的情况

问题

答案1

RuntimeError: CUDA错误: 设备上没有可执行的内核图像 (rastervision)

为什么带有填充字段的结构体运行更快呢？

如何同时使用CUDA代码加速Tensorflow-gpu

索引溢出在带有嵌套线程的for循环中。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论