2023年7月24日 17:05:21go评论88阅读模式

英文:

Are blocks of only one thread efficient?

问题

假设没有线程间通信和GPU上的其他进程，

在N <= 32且32 < N < 32 * SM数量的情况下，哪个更快，启动N个1个线程的块，还是N/32个32（warp大小）个线程的块？

我假设当N足够小的时候，块大小为32在延迟方面是最佳的，因为同一个SM会并行执行一组线程。请告诉我是否有更好的块大小。

英文:

Assuming there are no inter-thread communications and no other processes on the GPU,

which is faster, launching N blocks of 1 thread, or N/32 blocks of 32 (warp size) threads? when N <= 32 and 32 < N < 32 * number_of_SMs respectively?

I'm assuming block size of 32 is optimal in terms of latency when N is small enough, since a warp will be executed in parallel by the same SM. Please let me know if there are better block sizes.

答案1

得分: 2

只有一个线程的块是否高效？

几乎可以说不是。GPU是专注于并行执行的处理器，特别是在单个物理核心内（执行完整的线程块）。参见这个答案。

哪个更快，启动N个只有1个线程的块，还是启动N/32个有32个（线程束大小）线程的块？

对于合理的N，N/32个有32个线程的块会更快，但在你的情况下...

当N ≤ 32

如果你的问题规模如此之小，这个问题实际上变得不那么重要了。“不要为小事烦恼”：对你的应用程序进行性能分析，看看实际上哪部分花费了大部分时间。

现在，如果你将问题划分为N个独立工作单元，每个单元需要很长时间才能完成 - 这些线程之间没有通信 - 那么你对问题的划分就是低效的。让更多的线程来处理现在由单个线程完成的工作。

32 < N < 32 * SM数量

仍然太小，与之前的情况一样。

我假设块大小为32是最佳的

通常情况下并非如此。

如果有更好的块大小，请告诉我。

这取决于你的内核的具体情况。也请参见这个答案。虽然你通常可以估计出一个好的块维度的选择，但通常也需要经验性地检查，即使只是为了验证你的假设。

英文:

> Are blocks of only one thread efficient?

No, almost by definition. GPUs are processors focused on parallel execution, particularly within the single physical core (which executes complete blocks of threads). See this answer.

> which is faster, launching N blocks of 1 thread, or N/32 blocks of 32 (warp size) threads?

N/32 blocks of 32 threads - for reasonable N; but in your case...

> when N <= 32

If your problem size is so small, the question becomes practically irrelevant. "Don't sweat the small stuff": Profile your app and see what actually takes the bulk of time.

Now, if you've partitioned your problem into N units of independent work which take a huge amount of time each - with no communication among the threads doing that work - your partition of the problem is inefficient. Let more threads work on what is now a single thread's work.

> 32 < N < 32 * number_of_SMs

Still too small, same point as before.

> I'm assuming block size of 32 is optimal

It usually isn't.

> Please let me know if there are better block sizes.

It depends on the specifics of your kernel. See also this answer. And while you can often estimate what a good choice of block dimensions might be - you usually also need to empirically check, if only to verify your assumptions.

答案2

得分: 0

一个线程的块是最低效的选择！
一开始使用32个线程。
通常，在我的模拟开始时，我会进行一个测试阶段，测试不同的网格配置，通常有32、64、128、256、512、1024个线程，以找到最有效的配置。要记住，根据计算负载和内存压力的不同，最佳选择可能会改变。

英文:

Block of one thread is the most inefficient choice!
The beginning is to use 32 threads.
I usually have a test phase at the start of my simulations where I test different grid configurations, typically 32,64,128,256,512,1024 threads to find the most efficient configuration. Bearing in mind that depending on the computational load/memory pressure, the best choice may change.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

仅有一个线程的块是否高效？

问题

答案1

答案2

处理CUDA中大小不确定的输出

Device-wide synchronization in SYCL on NVIDIA GPUs

为什么无法使用相同指针启动并发内核？

CUDA是否有等价于OpenCL的shuffle操作？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。