随机排列某些 OpenCL/CUDA 索引

huangapple go评论68阅读模式
英文:

Randomizing OpenCL/CUDA indexing by some permutation

问题

通常,我在OpenCL中使用整数大小vdimxvdimyvdimzfloat体积。在内核的开头,通常有以下代码:

int i = get_global_id(0);
int j = get_global_id(1);
int k = get_global_id(2);

看起来OpenCL运行时在给定时间非常接近地分配ijk。因此,如果vdimx=vdimy=vdimz=1024,第一组索引可能是(0,0,0)(0,0,1)(0,0,2),而不是(123,5,951)(350,3,221)(1,1021,151)。但在某些应用程序中,例如CT投影仪,体素的接近导致它们投影到相同的像素上,从而导致原子操作变慢,因为多个GPU线程试图写入内存中的相同位置。

我想避免这种情况,例如:

int i = get_global_id(0);
int j = get_global_id(1);
int k = get_global_id(2);
i = perm(i);
j = perm(j);
k = perm(k);

以随机化体素的位置。现在的问题是如何高效且正确地实现排列。例如,如果我这样做:

#define perm(i) ((i * 10) % vdimx);

我是否会获得对于任何vdimx的一一映射?当我乘以不可整除vdimx的质数时,是否会获得排列?是否有其他随机化索引的技术?

英文:

Typically I work on the float volume of the integer sizes vdimx, vdimy, vdimz in OpenCL. On the beginning of the kernel, I typically have the code such as

int i = get_global_id(0);
int j = get_global_id(1);
int k = get_global_id(2);

It seems that OpenCL runtime allocates i,j,k very close to each other at given time. So if vdimx=vdimy=vdimz=1024 it is likely that the first indices run will be (0,0,0), (0,0,1) (0,0,2) and not (123,5,951), (350,3,221), (1,1021,151). But in certain applications, e.g. CT projector, closeness of the voxels causes that they project to the same pixels and in turn it causes atomic operations to be slow as multiple GPU threads are trying to write to the same position in memory.

I would like to avoid this to have e.g.

int i = get_global_id(0);
int j = get_global_id(1);
int k = get_global_id(2);
i = perm(i);
j = perm(j);
k = perm(k);

to randomize positions of the voxels. Now the question is how to implement the permutation efficiently and correctly. E.g. if I do

#define perm(i) ((i * 10) % vdimx);

will I obtain 1-1 mapping for any vdimx? When I multiply by prime number which does not divide vidimx do I obtain permutation? Is there some other technique to randomize indices?

答案1

得分: 1

不确切是对提出的问题的答案,但希望能解决潜在问题的答案:处理冲突。

当你访问内存时,一个线程束中的所有线程需要等待,直到获取了线程束中所有线程的数据。如果所有数据在物理上彼此相邻,那就可以在一两次内存事务中解决。但如果你的数据位于随机位置,所有线程需要等待完成 32 次内存事务(更不用说缓存的崩溃了)。

这就是为什么我建议你不要随机化你的索引。相反,集中精力提高结果的写入速度。尝试在线程束/块级别解决原子冲突,然后再提交到全局内存。

例如:每个块中的线程都知道要写入哪个单元。然后,它需要将自己的单元索引与线程束或块中其他线程想要写入的索引进行比较。如果它们相同,就需要选择某种方式来决定胜者(例如,具有较低 threadIdx 的线程获胜)。然后,只有获胜者对全局执行原子操作。

在一个块中选择 n 个线程中的获胜者是一件可以使用寄存器和共享内存以 log-n 步骤完成的事情,而且速度会比后续的全局原子操作更快。我不确定最有效的方法,但更直接的方法是将所有值按索引排序,然后每个线程检查其前驱是否具有相同的索引。

我刚刚找到了一个关于线程束级比特递增排序的示例:https://tschmidt23.github.io/cse599i/CSE%20599%20I%20Accelerated%20Computing%20-%20Programming%20GPUs%20Lecture%2018.pdf

如果你知道更好的选择获胜者的方法,请分享一下!即使不是针对 CUDA 的,也应该有一些关于这个主题的文章。

英文:

Not exactly an answer to the posed question, but hopefully an answer to the underlying problem: dealing with conflicts.

When you access your memory, all threads in a warp need to wait until data for all threads in the warp are fetched. If all the data is physically next to each other, that is resolved in one or two memory transactions. But if your data is located at random places, all threads need to wait for 32 memory transactions to complete (not to mention cache trashing).

That is why I would advise you against randomizing your indices. Instead, focus your efforts to make writing the results faster. Try to resolve the atomic conflicts on the warp/block level, before committing to global memory.

For example: each thread in a block knows into which cell it wants to write to. It then needs to compare its own cell index with the index that other threads in a warp or block want to write to. If they are the same, a winner should be chosen somehow (e.g. a thread with a lower threadIdx wins). Then, only the winners do an atomic operation on global.

Picking winners among n threads in a block is something that (I believe) can be done in log-n steps with reduction algorithms, using only registers and shared memory and will be faster than any subsequent atomic operation on global. I am not sure about the most efficient approach, but a more straightforward is to simply sort all values by index, and then each thread checks if its predecessor has the same index.

I just found an example of a warp-level bitonic sort over here: https://tschmidt23.github.io/cse599i/CSE%20599%20I%20Accelerated%20Computing%20-%20Programming%20GPUs%20Lecture%2018.pdf

If you know a better approach for choosing the winners, please do share! There must be some articles on the topic, even if not CUDA-specific.

huangapple
  • 本文由 发表于 2023年3月4日 01:01:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/75629925.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定