2023年8月11日 03:19:27go评论113阅读模式

英文:

CUDA program printing less values than expected

问题

这个CUDA 实验性程序应该打印出16个值。

然而，它只打印出12个值。

可能的原因是：

英文:

This CUDA experimental program is supposed to print 16 values.

However, it is printing 12 values.

What could be the reason?

#include &lt;cuda_runtime.h&gt;
#include &lt;stdio.h&gt;
#define IDX blockIdx.x * blockDim.x + threadIdx.x
#define IDY blockIdx.y * blockDim.y + threadIdx.y
#define IDZ blockIdx.z * blockDim.z + threadIdx.z
#define WIDTH  4
#define LENGTH 4
#define HEIGHT 1
#define TENSOR_LENGTH WIDTH*LENGTH*HEIGHT
__global__ void printTensor(int *tensor, int width, int length, int height)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int idy = blockIdx.y * blockDim.y + threadIdx.y;
    int idz = blockIdx.z * blockDim.z + threadIdx.z;
    if (idx &lt; width &amp;&amp; idy &lt; length &amp;&amp; idz &lt; height)
    {
        printf(&quot;%d &quot;, tensor[idz * width * length + idy * width + idx]);
    }
}
int main()
{
   int tensor[WIDTH*LENGTH*HEIGHT];
  // Initialize the tensor on the host (CPU)
  for (int i = 0; i &lt; TENSOR_LENGTH; i++) {
      tensor[i] = i;
  }
  for (int i = 0; i &lt; TENSOR_LENGTH; i++) {
      printf(&quot;%d  &quot;, tensor[i]);
  }
  printf(&quot;\n\n&quot;);
  // Allocate memory on GPU
  int *dev_tensor;
  cudaMalloc((void**)&amp;dev_tensor, WIDTH*LENGTH*HEIGHT * sizeof(int));
  // Copy tensor from CPU to GPU
  cudaMemcpy(dev_tensor, tensor, WIDTH*LENGTH*HEIGHT * sizeof(int), cudaMemcpyHostToDevice);
  // Setup grid and block sizes
  dim3 grid(3,5,7);
  dim3 block(1,1,1);
  // Launch CUDA kernel
  printTensor&lt;&lt;&lt;grid, block&gt;&gt;&gt;(dev_tensor, WIDTH, LENGTH, HEIGHT);
  // Free memory
  cudaFree(dev_tensor);
}

答案1

得分: 1

您的核心代码如下：

__global__ void printTensor(int *tensor, int width, int length, int height)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int idy = blockIdx.y * blockDim.y + threadIdx.y;
    int idz = blockIdx.z * blockDim.z + threadIdx.z;
    if (idx < 4 && idy < 4 && idz < 1)
    {
        printf("%d ", tensor[idz * width * length + idy * width + idx]);
    }
}

您使用以下方式启动了这个核函数：

dim3 grid(3, 5, 7);
dim3 block(1, 1, 1);

也就是说，您启动了105个块，每个块都有1个线程，这些线程的编号从(0,0,0)到(2,4,6)，按照您应该熟悉的顺序编号。根据您代码中的索引计算，可以轻松证明只有块中的线程满足 (idx < 4 && idy < 4 && idz < 1) 的条件。

如果您计算这些块，您应该期望看到12行输出，假设您的程序采取了步骤来正确刷新CUDA核函数的printf缓冲区。

英文:

Your kernel is effectively this:

__global__ void printTensor(int *tensor, int width, int length, int height)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int idy = blockIdx.y * blockDim.y + threadIdx.y;
    int idz = blockIdx.z * blockDim.z + threadIdx.z;
    if (idx &lt; 4 &amp;&amp; idy &lt; 4 &amp;&amp; idz &lt; 1)
    {
        printf(&quot;%d &quot;, tensor[idz * width * length + idy * width + idx]);
    }
}

You launch the kernel using:

dim3 grid(3,5,7);
dim3 block(1,1,1);

i.e. you have launched 105 blocks, each with 1 thread, numbered from (0,0,0) to (2,4,6) in the order that you should be familiar with. Given the index calculations in your code, it is trivial to prove to yourself that only the threads in blocks with (x,y,z)

(0,0,0) (1,0,0) (2,0,0) 
(0,1,0) (1,1,0) (2,1,0) 
(0,2,0) (1,2,0) (2,2,0)
(0,3,0) (1,3,0) (2,3,0)

satisfy (idx < 4 && idy < 4 && idz < 1). If you count those blocks, you should expect to see 12 lines of output, given your program takes steps to correctly flush the CUDA kernel printf buffer.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

CUDA程序打印的值少于预期。

问题

答案1

“CL_TARGET_OPENCL_VERSION is not defined” – 为什么会出现这个错误？

cuModuleGetSurfRef和cuModuleGetTexRef的替代方法是什么？

Problems in implementing adaptive thresholding using CUDA

在C++中加速这个for循环的方法，可能使用NVidia技术。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。