仅有一个线程的块是否高效?

huangapple go评论64阅读模式
英文:

Are blocks of only one thread efficient?

问题

假设没有线程间通信和GPU上的其他进程,

N <= 3232 < N < 32 * SM数量的情况下,哪个更快,启动N1个线程的块,还是N/3232(warp大小)个线程的块?


我假设当N足够小的时候,块大小为32在延迟方面是最佳的,因为同一个SM会并行执行一组线程。请告诉我是否有更好的块大小。

英文:

Assuming there are no inter-thread communications and no other processes on the GPU,

which is faster, launching N blocks of 1 thread, or N/32 blocks of 32 (warp size) threads? when N &lt;= 32 and 32 &lt; N &lt; 32 * number_of_SMs respectively?


I'm assuming block size of 32 is optimal in terms of latency when N is small enough, since a warp will be executed in parallel by the same SM. Please let me know if there are better block sizes.

答案1

得分: 2

只有一个线程的块是否高效?

几乎可以说不是。GPU是专注于并行执行的处理器,特别是在单个物理核心内(执行完整的线程块)。参见这个答案

哪个更快,启动N个只有1个线程的块,还是启动N/32个有32个(线程束大小)线程的块?

对于合理的N,N/32个有32个线程的块会更快,但在你的情况下...

当N ≤ 32

如果你的问题规模如此之小,这个问题实际上变得不那么重要了。“不要为小事烦恼”:对你的应用程序进行性能分析,看看实际上哪部分花费了大部分时间。

现在,如果你将问题划分为N个独立工作单元,每个单元需要很长时间才能完成 - 这些线程之间没有通信 - 那么你对问题的划分就是低效的。让更多的线程来处理现在由单个线程完成的工作。

32 < N < 32 * SM数量

仍然太小,与之前的情况一样。

我假设块大小为32是最佳的

通常情况下并非如此。

如果有更好的块大小,请告诉我。

这取决于你的内核的具体情况。也请参见这个答案。虽然你通常可以估计出一个好的块维度的选择,但通常也需要经验性地检查,即使只是为了验证你的假设。

英文:

> Are blocks of only one thread efficient?

No, almost by definition. GPUs are processors focused on parallel execution, particularly within the single physical core (which executes complete blocks of threads). See this answer.

> which is faster, launching N blocks of 1 thread, or N/32 blocks of 32 (warp size) threads?

N/32 blocks of 32 threads - for reasonable N; but in your case...

> when N <= 32

If your problem size is so small, the question becomes practically irrelevant. "Don't sweat the small stuff": Profile your app and see what actually takes the bulk of time.

Now, if you've partitioned your problem into N units of independent work which take a huge amount of time each - with no communication among the threads doing that work - your partition of the problem is inefficient. Let more threads work on what is now a single thread's work.

> 32 < N < 32 * number_of_SMs

Still too small, same point as before.

> I'm assuming block size of 32 is optimal

It usually isn't.

> Please let me know if there are better block sizes.

It depends on the specifics of your kernel. See also this answer. And while you can often estimate what a good choice of block dimensions might be - you usually also need to empirically check, if only to verify your assumptions.

答案2

得分: 0

一个线程的块是最低效的选择!
一开始使用32个线程。
通常,在我的模拟开始时,我会进行一个测试阶段,测试不同的网格配置,通常有32、64、128、256、512、1024个线程,以找到最有效的配置。要记住,根据计算负载和内存压力的不同,最佳选择可能会改变。

英文:

Block of one thread is the most inefficient choice!
The beginning is to use 32 threads.
I usually have a test phase at the start of my simulations where I test different grid configurations, typically 32,64,128,256,512,1024 threads to find the most efficient configuration. Bearing in mind that depending on the computational load/memory pressure, the best choice may change.

huangapple
  • 本文由 发表于 2023年7月24日 17:05:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/76752913.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定