英文:
Cellular automata on GPU with WGSL
问题
我正在编写一个物理仿真程序,类似于细胞自动机。每一步都依赖于前一步,但更精确地说,每个单元需要自身的状态和其直接邻居的状态来计算其新状态。我使用两个缓冲区,在每一步交替使用它们的角色(多次读取/单次写入)。
我正在使用WGSL(WebGPU),目前,对于每一步(整个网格更新,即t+1
),我调用一个调度(以确保步骤之间的同步),但性能相当慢。 (编辑:因为我没有正确使用工作组)
我尝试在着色器中直接使用循环执行步骤,但我无法在每个步骤之间同步所有工作组。因为我怀疑CPU和GPU之间的通信是限制因素。(剧透:不,不是)
<s>我尝试使用storageBarrier
和workgroupBarrier
,但不起作用(同步未发生)。尽管如此,如果我只使用两个连续的步骤,它们之间有一个障碍,我可以将性能提高2倍,这意味着我在调度期间浪费了大部分时间。而且结果几乎完美(意味着某些同步没有发生,但并没有对结果产生太大影响)。</s>
编辑:前面的段落是一个误解,我的测试结果是误导性的。
我了解到根据当前的WGSL规范,不可能在单个调度中同步所有工作组。但我不明白为什么会有workgroupBarrier
和storageBarrier
??
如何强制所有工作组在细胞自动机的每一步之间同步?
但更一般地说,我猜我不是第一个在GPU上编写具有直接邻居依赖性的细胞自动机的人:
如何在GPU上编写快速细胞自动机?
英文:
I am writing a physic simulation which is like a cellular automata. Each steps dependents on the previous one, but more precisely, each cell needs the state of itself and its direct neighbors to compute its new state. I am using two buffers, alternating roles at each step (multiple reads / single write).
I am using WGSL (WebGPU), and for the moment, for every step (whole grid update, in other word t+1
) I call a dispatch (to ensure synchronization between steps), but it results in quite slow performances. (EDIT: because I was not making use of workgroup properly)
I tried to performs the steps with a loop directly in the shader but I am unable to synchronize all work group between each step. Because I was supicious that the comunication between CPU and GPU was the limiting factor. (SPOILER ALERT: no, it is not)
<s>I tried using storageBarrier
and workgroupBarrier
, which does not work (synchronization does not occur). Nonetheless, if I only use two successive steps with one barrier between them, I increase performance by 2, meaning I am loosing most of the time during dispatch. And the result is almost perfect (meaning some synchronization did not happen but did not affect that much the result).</s>
EDIT: the previous paragraph is a misunderstanding, the result of my test was misleading.
I read that it is impossible to synchronize all work groups in a single dispatch with the current specification of WGSL. But then I don't understand why is there a workgroupBarrier
and a storageBarrier
??
How can I force all work groups to synchronize between each step of cellular automata ?
But more generally, I guess I am not the first person writing a cellular automata on the GPU with this direct neighbor dependency:
How to write fast cellular automata using GPU ?
答案1
得分: 3
我不确定你编写程序的具体方式。我猜测你可能正在尝试在同一个缓冲区中进行读取和写入?
通常,细胞自动机是使用两个缓冲区编码的。一个用于上一步的状态(只读),另一个用于当前步骤的新状态(只写)。每次调用可以从上一步中读取多个值,并通常在当前缓冲区中写入一个值。
在每一步的结束,你可以交换它们。这样,你就不需要任何障碍,并且可以在图形或计算管线中实现它。
英文:
I'm not sure how exactly you're going about writing your program. I'm guessing compute and maybe you're trying to read and write to the same buffer?
Usually cellular automata is coded using two buffers. One for the state in the last step (read-only) and one for the new state in the current step (write-only). Each invocation can read multiple values from the previous step and usually writes one value on the current buffer.
At the end of each step, you can swap them. You should not need any barriers this way and can be implemented in either graphics or compute pipelines.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论