2023年2月14日 01:40:01go评论69阅读模式

英文:

Device-wide synchronization in SYCL on NVIDIA GPUs

问题

问题
是否有一种方法可以等待特定设备上的所有命令，而不必明确调用每个队列上的wait()？

英文:

Context
I'm porting a complex CUDA application to SYCL which uses multiple cudaStream to launch the kernels. In addition, it also uses the default Stream in some cases, forcing a device-wide synchronization.

Problem
Cuda Streams can be mapped quite easily to in order SYCL Queues, however when encountering a device-wide syncronization point (i.e. cudaDeviceSyncronize()), I must explicitly wait on all the queues as queue::wait() waits just on the commands submitted to that queue.

Question
Is there a way to wait on all the commands for a specific device, without having to explicitly call wait() on every queue?

答案1

得分: 1

通常情况下，有两种方法可以模仿SYCL中的这种行为。
您可以等待每个队列，就像您建议的那样。
您可以等待组成您的CUDA流的所有事件，使用event::wait(const std::vector<event> &)或event::wait_and_throw(const std::vector<event> &)。

前者恰恰是您所建议的，但当然这会等待整个队列清空。第二个选项允许您仅等待事件完成，而不必等待整个队列。

无论哪种情况，您都需要进行一些记录以确保在继续执行算法之前等待您希望完成的每个项目。

正如Sri提到的，您可以使用SYCLomatic，SYCLomatic翻译此代码的方式是创建一个循环遍历所有队列并执行与1中相同的等待的函数。

希望这有所帮助，但抽象稍有不同，不是一行代码的解决方案

英文:

In general there are two ways you might be able to mimic this behavior I SYCL.

You can wait on every queue as you suggest
You can wait on all the events that comprise your CUDA stream using event::wait(const std::vector<event> &) or event::wait_and_throw(const std::vector<event> &)

The former is precisely what you suggest, but of course then you are waiting on the whole queue to empty. The second option allows you to wait just for the events to complete without waiting on the whole queue.

In either case though, you do have to do some book keeping to ensure that you are waiting on each item you expect to complete before proceeding with your algorithm.

As Sri mentioned, you can use SYCLomatic and they way that SYCLomatic translates this code is to create a function that loops over all the queues and performs the waits as in 1 above.

Hopefully this helps, wish it was a one liner as well, but the abstractions are slightly different

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Device-wide synchronization in SYCL on NVIDIA GPUs

问题

答案1

RuntimeError: 期望所有张量在相同的设备上，但至少发现两个不同的设备

MPI_Scatterv from Intel MPI (mpiifort) using MPI data types is much slower (23 times) compared to flattening array and scattering. Why it could be?

在OpenCL中，__kernel和KERNEL_FQ之间的区别是什么？

彩虹表是使用GPU还是CPU运行的？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。