Reading global flag does not work for CPU>GPU data exchange in CUDA

huangapple go评论58阅读模式
英文:

Reading global flag does not work for CPU>GPU data exchange in CUDA

问题

I try to do a simple uni-directional communication between a CPU and a K80 GPU using CUDA. I want to have a bool cancel flag that resides in global memory and is polled by all running GPU/kernel threads. The flag should default to false and can be set by a CPU/host thread to true during ongoing computation. The GPU/kernel threads then should exit.

This is what I tried. I have simplified code. I removed error checking and application logic (including the application logic that prevents concurrent access to cancelRequested).

On the host side, global definition (.cpp):

// Host side thread safety of this pointer is covered by application logic
volatile bool* cancelRequested = nullptr; 

On the host side in the compute thread (.cpp):

initialize(&cancelRequested);
compute(cancelRequested);
finalize(&cancelRequested);

On the host side in a main thread (.cpp):

cancel(cancelRequested); // Called after init is finished

Host routines (.cu file):

void initialize(volatile bool** pCancelRequested)
{
   cudaMalloc(const_cast<bool**>(pCancelRequested), sizeof(bool));
   const bool aFalse = false;
   cudaMemcpy(*const_cast<bool**>(pCancelRequested), &aFalse, sizeof(bool), cudaMemcpyHostToDevice);
}

void compute(volatile bool* pCancelRequested) 
{
   ....
   computeKernel<<<pBlocksPerGPU, aThreadsPerBlock>>>(pCancelRequested);
   cudaDeviceSynchronize(); // Non-busy wait
   ....
}

void finalize(volatile bool** pCancelRequested)
{
   cudaFree(*const_cast<bool**>(pCancelRequested));
   *pCancelRequested = nullptr;
}

void cancel(volatile bool* pCancelRequested)
{
   const bool aTrue = true;
   cudaMemcpy(const_cast<bool*>(pCancelRequested), &aTrue, sizeof(bool), cudaMemcpyHostToDevice);
}

Device routines (.cu file):

__global__ void computeKernel(volatile bool* pCancelRequested)
{
   while (someCondition)
   {
      // Computation step here

      if (*pCancelRequested)
      {
         printf("-> Cancel requested!\n");
         return;
      }
   }
}

The code runs fine. But it does never enter the cancel case. I read back the false and true values in initialize() and cancel() successfully and checked them using gdb. I.e. writing to the global flag works fine, at least from the host side viewpoint. However, the kernels never see the cancel flag set to true and exit normally from the outer while loop.

Any idea why this doesn't work?

英文:

I try to do a simple uni-directional communication between a CPU and a K80 GPU using CUDA. I want to have a bool cancel flag that resides in global memory and is polled by all running GPU/kernel threads. The flag should default to false and can be set by a CPU/host thread to true during ongoing computation. The GPU/kernel threads then should exit.

This is what I tried. I have simplified code. I removed error checking and application logic (including the application logic that prevents concurrent access to cancelRequested).

On the host side, global definition (.cpp):

// Host side thread safety of this pointer is covered by application logic
volatile bool* cancelRequested = nullptr; 

On the host side in the compute thread (.cpp):

initialize(&cancelRequested);
compute(cancelRequested);
finalize(&cancelRequested);

On the host side in a main thread (.cpp):

cancel(cancelRequested); // Called after init is finished

Host routines (.cu file):

void initialize(volatile bool** pCancelRequested)
{
   cudaMalloc(const_cast<bool**>(pCancelRequested), sizeof(bool));
   const bool aFalse = false;
   cudaMemcpy(*const_cast<bool**>(pCancelRequested), &aFalse, sizeof(bool), cudaMemcpyHostToDevice);
}

void compute(volatile bool* pCancelRequested) 
{
   ....
   computeKernel<<<pBlocksPerGPU, aThreadsPerBlock>>>(pCancelRequested);
   cudaDeviceSynchronize(); // Non-busy wait
   ....
}

void finalize(volatile bool** pCancelRequested)
{
   cudaFree(*const_cast<bool**>(pCancelRequested));
   *pCancelRequested = nullptr;
}

void cancel(volatile bool* pCancelRequested)
{
   const bool aTrue = true;
   cudaMemcpy(const_cast<bool*>(pCancelRequested), &aTrue, sizeof(bool), cudaMemcpyHostToDevice);
}

Device routines (.cu file):

__global__ void computeKernel(volatile bool* pCancelRequested)
{
   while (someCondition)
   {
      // Computation step here

      if (*pCancelRequested)
      {
         printf("-> Cancel requested!\n");
         return;
      }
   }
}

The code runs fine. But it does never enter the cancel case. I read back the false and true values in initialize() and cancel() successfully and checked them using gdb. I.e. writing to the global flag works fine, at least from host side view point. However the kernels never see the cancel flag set to true and exit normally from the outer while loop.

Any idea why this doesn't work?

答案1

得分: 2

你的方法存在一个根本性问题,即CUDA流将阻止它正常工作。

CUDA流有两个基本原则:

  1. 发布到同一流中的项不会重叠,它们将串行执行。
  2. 发布到不同创建的流中的项有可能重叠,CUDA没有定义这些操作的顺序。

即使你不明确使用流,你仍在操作“默认流”上,并且相同的流语义适用。

我在这个简短的摘要中没有覆盖关于流的所有内容。你可以在这个在线培训系列的第7单元中了解更多关于CUDA流的信息。

由于CUDA流的存在,以下代码:

computeKernel<<<pBlocksPerGPU, aThreadsPerBlock>>>(pCancelRequested);

cudaMemcpy(const_cast<bool*>(pCancelRequested), &aTrue, sizeof(bool), cudaMemcpyHostToDevice);

不可能重叠(它们都发布到相同的“默认”CUDA流中,因此第1条规则表示它们不可能重叠)。但是如果你想要“通知”正在运行的内核,重叠是必要的。我们必须允许cudaMemcpy操作与内核同时进行。

我们可以通过直接应用CUDA流(注意第2条规则),将复制操作和计算(内核)操作放入不同的创建的流中,以允许它们重叠。当我们这样做时,事情按预期工作:

$ cat t2184.cu
#include <iostream>
#include <unistd.h>

__global__ void k(volatile int *flag){

  while (*flag != 0);
}

int main(){

  int *flag, *h_flag = new int;
  cudaStream_t s[2];
  cudaStreamCreate(s+0);
  cudaStreamCreate(s+1);
  cudaMalloc(&flag, sizeof(h_flag[0]));
  *h_flag = 1;
  cudaMemcpy(flag, h_flag, sizeof(h_flag[0]), cudaMemcpyHostToDevice);
  k<<<32, 256, 0, s[0]>>>(flag);
  sleep(5);
  *h_flag = 0;
  cudaMemcpyAsync(flag, h_flag, sizeof(h_flag[0]), cudaMemcpyHostToDevice, s[1]);
  cudaDeviceSynchronize();
}

$ nvcc -o t2184 t2184.cu
$ compute-sanitizer ./t2184
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$


注意:

- 虽然从静态文本输出中不明显,但程序在退出前大约花费5秒。如果你注释掉像`*h_flag = 0;`这样的行,程序将挂起,表明标志信号方法正常工作。
- 请注意使用`volatile`。这是[必要的](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#volatile-qualifier),以告诉编译器对该数据的任何访问必须是实际访问,编译器不允许进行修改,以防止内存读取或写入发生在预期位置之外。

这种主机到设备的信号行为也可以实现,而无需明确使用流,而是使用主机固定内存作为信号位置,因为它对主机和设备代码都是“可见”的,可以“同时”访问。以下是一个示例:

```cpp
#include <iostream>
#include <unistd.h>

__global__ void k(volatile int *flag){

  while (*flag != 0);
}

int main(){

  int *flag;
  cudaHostAlloc(&flag, sizeof(flag[0]), cudaHostAllocDefault);
  *flag = 1;
  k<<<32, 256>>>(flag);
  sleep(5);
  *flag = 0;
  cudaDeviceSynchronize();
}

对于其他信号示例,例如从设备到主机的信号,其他读者可能对这个感兴趣。

英文:

The fundamental problem I see with your approach is that cuda streams will prevent it from working.

CUDA streams have two basic principles:

  1. Items issued into the same stream will not overlap; they will serialize.
  2. Items issued into separate created streams have the possibility to overlap; there is no defined ordering of those operations provided by CUDA.

Even if you don't explicitly use streams, you are operating in the "default stream" and the same stream semantics apply.

I'm not covering everything there is to know about streams in this brief summary. You can learn more about CUDA streams in unit 7 of this online training series

Because of CUDA streams, this:

 computeKernel&lt;&lt;&lt;pBlocksPerGPU, aThreadsPerBlock&gt;&gt;&gt;(pCancelRequested);

and this:

 cudaMemcpy(const_cast&lt;bool*&gt;(pCancelRequested), &amp;aTrue, sizeof(bool), cudaMemcpyHostToDevice);

could not possibly overlap (they are being issued into the same "default" CUDA stream, and so rule 1 above says that they cannot possibly overlap). But overlap is essential if you want to "signal" the running kernel. We must allow the cudaMemcpy operation to take place at the same time that the kernel is running.

We can fix this via a direct application of CUDA streams (taking note of rule 2 above), to put the copy operation and the compute (kernel) operation into separate created streams, so as to allow them to overlap. When we do that, things work as desired:

$ cat t2184.cu
#include &lt;iostream&gt;
#include &lt;unistd.h&gt;

__global__ void k(volatile int *flag){

  while (*flag != 0);
}

int main(){

  int *flag, *h_flag = new int;
  cudaStream_t s[2];
  cudaStreamCreate(s+0);
  cudaStreamCreate(s+1);
  cudaMalloc(&amp;flag, sizeof(h_flag[0]));
  *h_flag = 1;
  cudaMemcpy(flag, h_flag, sizeof(h_flag[0]), cudaMemcpyHostToDevice);
  k&lt;&lt;&lt;32, 256, 0, s[0]&gt;&gt;&gt;(flag);
  sleep(5);
  *h_flag = 0;
  cudaMemcpyAsync(flag, h_flag, sizeof(h_flag[0]), cudaMemcpyHostToDevice, s[1]);
  cudaDeviceSynchronize();
}

$ nvcc -o t2184 t2184.cu
$ compute-sanitizer ./t2184
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$

NOTES:

  • Although not evident from the static text printout, the program spends approximately 5 seconds before exiting. If you comment out a line such as *h_flag = 0; then the program will hang, indicating that the flag signal method is working correctly.
  • Note the use of volatile. This is necessary to instruct the compiler that any access to that data must be an actual access, the compiler is not allowed to make modifications that would prevent a memory read or write from occurring at the expected location.

This kind of host->device signal behavior can also be realized without explicit use of streams, but with host pinned memory as the signalling location, since it is "visible" to both host and device code, "simultaneously". Here is an example:

#include &lt;iostream&gt;
#include &lt;unistd.h&gt;

__global__ void k(volatile int *flag){

  while (*flag != 0);
}

int main(){

  int *flag;
  cudaHostAlloc(&amp;flag, sizeof(flag[0]), cudaHostAllocDefault);
  *flag = 1;
  k&lt;&lt;&lt;32, 256&gt;&gt;&gt;(flag);
  sleep(5);
  *flag = 0;
  cudaDeviceSynchronize();
}

For other examples of signalling, such as from device to host, other readers may be interested in this.

huangapple
  • 本文由 发表于 2023年2月8日 20:02:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/75385530.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定