2023年2月8日 20:02:25go评论89阅读模式

英文:

Reading global flag does not work for CPU>GPU data exchange in CUDA

问题

I try to do a simple uni-directional communication between a CPU and a K80 GPU using CUDA. I want to have a bool cancel flag that resides in global memory and is polled by all running GPU/kernel threads. The flag should default to false and can be set by a CPU/host thread to true during ongoing computation. The GPU/kernel threads then should exit.

This is what I tried. I have simplified code. I removed error checking and application logic (including the application logic that prevents concurrent access to cancelRequested).

On the host side, global definition (.cpp):

// Host side thread safety of this pointer is covered by application logic
volatile bool* cancelRequested = nullptr;

On the host side in the compute thread (.cpp):

initialize(&amp;cancelRequested);
compute(cancelRequested);
finalize(&amp;cancelRequested);

On the host side in a main thread (.cpp):

cancel(cancelRequested); // Called after init is finished

Host routines (.cu file):

void initialize(volatile bool** pCancelRequested)
{
   cudaMalloc(const_cast&lt;bool**&gt;(pCancelRequested), sizeof(bool));
   const bool aFalse = false;
   cudaMemcpy(*const_cast&lt;bool**&gt;(pCancelRequested), &amp;aFalse, sizeof(bool), cudaMemcpyHostToDevice);
}
void compute(volatile bool* pCancelRequested) 
{
   ....
   computeKernel&lt;&lt;&lt;pBlocksPerGPU, aThreadsPerBlock&gt;&gt;&gt;(pCancelRequested);
   cudaDeviceSynchronize(); // Non-busy wait
   ....
}
void finalize(volatile bool** pCancelRequested)
{
   cudaFree(*const_cast&lt;bool**&gt;(pCancelRequested));
   *pCancelRequested = nullptr;
}
void cancel(volatile bool* pCancelRequested)
{
   const bool aTrue = true;
   cudaMemcpy(const_cast&lt;bool*&gt;(pCancelRequested), &amp;aTrue, sizeof(bool), cudaMemcpyHostToDevice);
}

Device routines (.cu file):

__global__ void computeKernel(volatile bool* pCancelRequested)
{
   while (someCondition)
   {
      // Computation step here
      if (*pCancelRequested)
      {
         printf(&quot;-&gt; Cancel requested!\n&quot;);
         return;
      }
   }
}

The code runs fine. But it does never enter the cancel case. I read back the false and true values in initialize() and cancel() successfully and checked them using gdb. I.e. writing to the global flag works fine, at least from the host side viewpoint. However, the kernels never see the cancel flag set to true and exit normally from the outer while loop.

Any idea why this doesn't work?

英文:

This is what I tried. I have simplified code. I removed error checking and application logic (including the application logic that prevents concurrent access to cancelRequested).

On the host side, global definition (.cpp):

// Host side thread safety of this pointer is covered by application logic
volatile bool* cancelRequested = nullptr;

On the host side in the compute thread (.cpp):

initialize(&amp;cancelRequested);
compute(cancelRequested);
finalize(&amp;cancelRequested);

On the host side in a main thread (.cpp):

cancel(cancelRequested); // Called after init is finished

Host routines (.cu file):

void initialize(volatile bool** pCancelRequested)
{
   cudaMalloc(const_cast&lt;bool**&gt;(pCancelRequested), sizeof(bool));
   const bool aFalse = false;
   cudaMemcpy(*const_cast&lt;bool**&gt;(pCancelRequested), &amp;aFalse, sizeof(bool), cudaMemcpyHostToDevice);
}
void compute(volatile bool* pCancelRequested) 
{
   ....
   computeKernel&lt;&lt;&lt;pBlocksPerGPU, aThreadsPerBlock&gt;&gt;&gt;(pCancelRequested);
   cudaDeviceSynchronize(); // Non-busy wait
   ....
}
void finalize(volatile bool** pCancelRequested)
{
   cudaFree(*const_cast&lt;bool**&gt;(pCancelRequested));
   *pCancelRequested = nullptr;
}
void cancel(volatile bool* pCancelRequested)
{
   const bool aTrue = true;
   cudaMemcpy(const_cast&lt;bool*&gt;(pCancelRequested), &amp;aTrue, sizeof(bool), cudaMemcpyHostToDevice);
}

Device routines (.cu file):

__global__ void computeKernel(volatile bool* pCancelRequested)
{
   while (someCondition)
   {
      // Computation step here
      if (*pCancelRequested)
      {
         printf(&quot;-&gt; Cancel requested!\n&quot;);
         return;
      }
   }
}

The code runs fine. But it does never enter the cancel case. I read back the false and true values in initialize() and cancel() successfully and checked them using gdb. I.e. writing to the global flag works fine, at least from host side view point. However the kernels never see the cancel flag set to true and exit normally from the outer while loop.

Any idea why this doesn't work?

答案1

得分: 2

你的方法存在一个根本性问题，即CUDA流将阻止它正常工作。

CUDA流有两个基本原则：

发布到同一流中的项不会重叠，它们将串行执行。
发布到不同创建的流中的项有可能重叠，CUDA没有定义这些操作的顺序。

即使你不明确使用流，你仍在操作“默认流”上，并且相同的流语义适用。

我在这个简短的摘要中没有覆盖关于流的所有内容。你可以在这个在线培训系列的第7单元中了解更多关于CUDA流的信息。

由于CUDA流的存在，以下代码：

computeKernel<<<pBlocksPerGPU, aThreadsPerBlock>>>(pCancelRequested);

和

cudaMemcpy(const_cast<bool*>(pCancelRequested), &aTrue, sizeof(bool), cudaMemcpyHostToDevice);

不可能重叠（它们都发布到相同的“默认”CUDA流中，因此第1条规则表示它们不可能重叠）。但是如果你想要“通知”正在运行的内核，重叠是必要的。我们必须允许cudaMemcpy操作与内核同时进行。

我们可以通过直接应用CUDA流（注意第2条规则），将复制操作和计算（内核）操作放入不同的创建的流中，以允许它们重叠。当我们这样做时，事情按预期工作：

$ cat t2184.cu
#include <iostream>
#include <unistd.h>
__global__ void k(volatile int *flag){
  while (*flag != 0);
}
int main(){
  int *flag, *h_flag = new int;
  cudaStream_t s[2];
  cudaStreamCreate(s+0);
  cudaStreamCreate(s+1);
  cudaMalloc(&flag, sizeof(h_flag[0]));
  *h_flag = 1;
  cudaMemcpy(flag, h_flag, sizeof(h_flag[0]), cudaMemcpyHostToDevice);
  k<<<32, 256, 0, s[0]>>>(flag);
  sleep(5);
  *h_flag = 0;
  cudaMemcpyAsync(flag, h_flag, sizeof(h_flag[0]), cudaMemcpyHostToDevice, s[1]);
  cudaDeviceSynchronize();
}

$ nvcc -o t2184 t2184.cu
$ compute-sanitizer ./t2184
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$


注意：
- 虽然从静态文本输出中不明显，但程序在退出前大约花费5秒。如果你注释掉像`*h_flag = 0;`这样的行，程序将挂起，表明标志信号方法正常工作。
- 请注意使用`volatile`。这是[必要的](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#volatile-qualifier)，以告诉编译器对该数据的任何访问必须是实际访问，编译器不允许进行修改，以防止内存读取或写入发生在预期位置之外。
这种主机到设备的信号行为也可以实现，而无需明确使用流，而是使用主机固定内存作为信号位置，因为它对主机和设备代码都是“可见”的，可以“同时”访问。以下是一个示例：
```cpp
#include <iostream>
#include <unistd.h>
__global__ void k(volatile int *flag){
  while (*flag != 0);
}
int main(){
  int *flag;
  cudaHostAlloc(&flag, sizeof(flag[0]), cudaHostAllocDefault);
  *flag = 1;
  k<<<32, 256>>>(flag);
  sleep(5);
  *flag = 0;
  cudaDeviceSynchronize();
}

对于其他信号示例，例如从设备到主机的信号，其他读者可能对这个感兴趣。

英文:

The fundamental problem I see with your approach is that cuda streams will prevent it from working.

CUDA streams have two basic principles:

Items issued into the same stream will not overlap; they will serialize.
Items issued into separate created streams have the possibility to overlap; there is no defined ordering of those operations provided by CUDA.

Even if you don't explicitly use streams, you are operating in the "default stream" and the same stream semantics apply.

I'm not covering everything there is to know about streams in this brief summary. You can learn more about CUDA streams in unit 7 of this online training series

Because of CUDA streams, this:

 computeKernel&lt;&lt;&lt;pBlocksPerGPU, aThreadsPerBlock&gt;&gt;&gt;(pCancelRequested);

and this:

 cudaMemcpy(const_cast&lt;bool*&gt;(pCancelRequested), &amp;aTrue, sizeof(bool), cudaMemcpyHostToDevice);

could not possibly overlap (they are being issued into the same "default" CUDA stream, and so rule 1 above says that they cannot possibly overlap). But overlap is essential if you want to "signal" the running kernel. We must allow the cudaMemcpy operation to take place at the same time that the kernel is running.

We can fix this via a direct application of CUDA streams (taking note of rule 2 above), to put the copy operation and the compute (kernel) operation into separate created streams, so as to allow them to overlap. When we do that, things work as desired:

$ cat t2184.cu
#include &lt;iostream&gt;
#include &lt;unistd.h&gt;
__global__ void k(volatile int *flag){
  while (*flag != 0);
}
int main(){
  int *flag, *h_flag = new int;
  cudaStream_t s[2];
  cudaStreamCreate(s+0);
  cudaStreamCreate(s+1);
  cudaMalloc(&amp;flag, sizeof(h_flag[0]));
  *h_flag = 1;
  cudaMemcpy(flag, h_flag, sizeof(h_flag[0]), cudaMemcpyHostToDevice);
  k&lt;&lt;&lt;32, 256, 0, s[0]&gt;&gt;&gt;(flag);
  sleep(5);
  *h_flag = 0;
  cudaMemcpyAsync(flag, h_flag, sizeof(h_flag[0]), cudaMemcpyHostToDevice, s[1]);
  cudaDeviceSynchronize();
}
$ nvcc -o t2184 t2184.cu
$ compute-sanitizer ./t2184
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$

NOTES:

Although not evident from the static text printout, the program spends approximately 5 seconds before exiting. If you comment out a line such as *h_flag = 0; then the program will hang, indicating that the flag signal method is working correctly.
Note the use of volatile. This is necessary to instruct the compiler that any access to that data must be an actual access, the compiler is not allowed to make modifications that would prevent a memory read or write from occurring at the expected location.

This kind of host->device signal behavior can also be realized without explicit use of streams, but with host pinned memory as the signalling location, since it is "visible" to both host and device code, "simultaneously". Here is an example:

#include &lt;iostream&gt;
#include &lt;unistd.h&gt;
__global__ void k(volatile int *flag){
  while (*flag != 0);
}
int main(){
  int *flag;
  cudaHostAlloc(&amp;flag, sizeof(flag[0]), cudaHostAllocDefault);
  *flag = 1;
  k&lt;&lt;&lt;32, 256&gt;&gt;&gt;(flag);
  sleep(5);
  *flag = 0;
  cudaDeviceSynchronize();
}

For other examples of signalling, such as from device to host, other readers may be interested in this.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Reading global flag does not work for CPU>GPU data exchange in CUDA

问题

答案1

Am I doing something wrong or is this a bug in Go's C compiler?

函数定义是否存在于头文件中？

如何在C中使一个函数返回一个左值？

Coarse-grained multithreading和Fine-grained multithreading的管道约束为什么不同？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。