CUDA程序打印的值少于预期。

huangapple go评论89阅读模式
英文:

CUDA program printing less values than expected

问题

这个CUDA 实验性程序应该打印出16个值。

然而,它只打印出12个值。

可能的原因是:

英文:

This CUDA experimental program is supposed to print 16 values.

However, it is printing 12 values.

What could be the reason?

#include <cuda_runtime.h>
#include <stdio.h>

#define IDX blockIdx.x * blockDim.x + threadIdx.x
#define IDY blockIdx.y * blockDim.y + threadIdx.y
#define IDZ blockIdx.z * blockDim.z + threadIdx.z

#define WIDTH  4
#define LENGTH 4
#define HEIGHT 1

#define TENSOR_LENGTH WIDTH*LENGTH*HEIGHT


__global__ void printTensor(int *tensor, int width, int length, int height)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int idy = blockIdx.y * blockDim.y + threadIdx.y;
    int idz = blockIdx.z * blockDim.z + threadIdx.z;

    if (idx < width && idy < length && idz < height)
    {
        printf("%d ", tensor[idz * width * length + idy * width + idx]);
    }
}


int main()
{
   int tensor[WIDTH*LENGTH*HEIGHT];

  // Initialize the tensor on the host (CPU)
  for (int i = 0; i < TENSOR_LENGTH; i++) {
      tensor[i] = i;
  }

  for (int i = 0; i < TENSOR_LENGTH; i++) {
      printf("%d  ", tensor[i]);
  }

  printf("\n\n");

  // Allocate memory on GPU
  int *dev_tensor;
  cudaMalloc((void**)&dev_tensor, WIDTH*LENGTH*HEIGHT * sizeof(int));

  // Copy tensor from CPU to GPU
  cudaMemcpy(dev_tensor, tensor, WIDTH*LENGTH*HEIGHT * sizeof(int), cudaMemcpyHostToDevice);

  // Setup grid and block sizes
  dim3 grid(3,5,7);
  dim3 block(1,1,1);

  // Launch CUDA kernel
  printTensor<<<grid, block>>>(dev_tensor, WIDTH, LENGTH, HEIGHT);

  // Free memory
  cudaFree(dev_tensor);
}

答案1

得分: 1

您的核心代码如下:

__global__ void printTensor(int *tensor, int width, int length, int height)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int idy = blockIdx.y * blockDim.y + threadIdx.y;
    int idz = blockIdx.z * blockDim.z + threadIdx.z;

    if (idx < 4 && idy < 4 && idz < 1)
    {
        printf("%d ", tensor[idz * width * length + idy * width + idx]);
    }
}

您使用以下方式启动了这个核函数:

dim3 grid(3, 5, 7);
dim3 block(1, 1, 1);

也就是说,您启动了105个块,每个块都有1个线程,这些线程的编号从(0,0,0)(2,4,6),按照您应该熟悉的顺序编号。根据您代码中的索引计算,可以轻松证明只有块中的线程满足 (idx < 4 && idy < 4 && idz < 1) 的条件。

如果您计算这些块,您应该期望看到12行输出,假设您的程序采取了步骤来正确刷新CUDA核函数的printf缓冲区。

英文:

Your kernel is effectively this:

__global__ void printTensor(int *tensor, int width, int length, int height)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int idy = blockIdx.y * blockDim.y + threadIdx.y;
    int idz = blockIdx.z * blockDim.z + threadIdx.z;

    if (idx &lt; 4 &amp;&amp; idy &lt; 4 &amp;&amp; idz &lt; 1)
    {
        printf(&quot;%d &quot;, tensor[idz * width * length + idy * width + idx]);
    }
}

You launch the kernel using:

dim3 grid(3,5,7);
dim3 block(1,1,1);

i.e. you have launched 105 blocks, each with 1 thread, numbered from (0,0,0) to (2,4,6) in the order that you should be familiar with. Given the index calculations in your code, it is trivial to prove to yourself that only the threads in blocks with (x,y,z)

(0,0,0) (1,0,0) (2,0,0) 
(0,1,0) (1,1,0) (2,1,0) 
(0,2,0) (1,2,0) (2,2,0)
(0,3,0) (1,3,0) (2,3,0)

satisfy (idx &lt; 4 &amp;&amp; idy &lt; 4 &amp;&amp; idz &lt; 1). If you count those blocks, you should expect to see 12 lines of output, given your program takes steps to correctly flush the CUDA kernel printf buffer.

huangapple
  • 本文由 发表于 2023年8月11日 03:19:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76878753.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定