英文:
CUDA program printing less values than expected
问题
这个CUDA 实验性程序应该打印出16个值。
然而,它只打印出12个值。
可能的原因是:
英文:
This CUDA experimental program is supposed to print 16 values.
However, it is printing 12 values.
What could be the reason?
#include <cuda_runtime.h>
#include <stdio.h>
#define IDX blockIdx.x * blockDim.x + threadIdx.x
#define IDY blockIdx.y * blockDim.y + threadIdx.y
#define IDZ blockIdx.z * blockDim.z + threadIdx.z
#define WIDTH 4
#define LENGTH 4
#define HEIGHT 1
#define TENSOR_LENGTH WIDTH*LENGTH*HEIGHT
__global__ void printTensor(int *tensor, int width, int length, int height)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
int idz = blockIdx.z * blockDim.z + threadIdx.z;
if (idx < width && idy < length && idz < height)
{
printf("%d ", tensor[idz * width * length + idy * width + idx]);
}
}
int main()
{
int tensor[WIDTH*LENGTH*HEIGHT];
// Initialize the tensor on the host (CPU)
for (int i = 0; i < TENSOR_LENGTH; i++) {
tensor[i] = i;
}
for (int i = 0; i < TENSOR_LENGTH; i++) {
printf("%d ", tensor[i]);
}
printf("\n\n");
// Allocate memory on GPU
int *dev_tensor;
cudaMalloc((void**)&dev_tensor, WIDTH*LENGTH*HEIGHT * sizeof(int));
// Copy tensor from CPU to GPU
cudaMemcpy(dev_tensor, tensor, WIDTH*LENGTH*HEIGHT * sizeof(int), cudaMemcpyHostToDevice);
// Setup grid and block sizes
dim3 grid(3,5,7);
dim3 block(1,1,1);
// Launch CUDA kernel
printTensor<<<grid, block>>>(dev_tensor, WIDTH, LENGTH, HEIGHT);
// Free memory
cudaFree(dev_tensor);
}
答案1
得分: 1
您的核心代码如下:
__global__ void printTensor(int *tensor, int width, int length, int height)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
int idz = blockIdx.z * blockDim.z + threadIdx.z;
if (idx < 4 && idy < 4 && idz < 1)
{
printf("%d ", tensor[idz * width * length + idy * width + idx]);
}
}
您使用以下方式启动了这个核函数:
dim3 grid(3, 5, 7);
dim3 block(1, 1, 1);
也就是说,您启动了105个块,每个块都有1个线程,这些线程的编号从(0,0,0)
到(2,4,6)
,按照您应该熟悉的顺序编号。根据您代码中的索引计算,可以轻松证明只有块中的线程满足 (idx < 4 && idy < 4 && idz < 1)
的条件。
如果您计算这些块,您应该期望看到12行输出,假设您的程序采取了步骤来正确刷新CUDA核函数的printf缓冲区。
英文:
Your kernel is effectively this:
__global__ void printTensor(int *tensor, int width, int length, int height)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
int idz = blockIdx.z * blockDim.z + threadIdx.z;
if (idx < 4 && idy < 4 && idz < 1)
{
printf("%d ", tensor[idz * width * length + idy * width + idx]);
}
}
You launch the kernel using:
dim3 grid(3,5,7);
dim3 block(1,1,1);
i.e. you have launched 105 blocks, each with 1 thread, numbered from (0,0,0)
to (2,4,6)
in the order that you should be familiar with. Given the index calculations in your code, it is trivial to prove to yourself that only the threads in blocks with (x,y,z)
(0,0,0) (1,0,0) (2,0,0)
(0,1,0) (1,1,0) (2,1,0)
(0,2,0) (1,2,0) (2,2,0)
(0,3,0) (1,3,0) (2,3,0)
satisfy (idx < 4 && idy < 4 && idz < 1)
. If you count those blocks, you should expect to see 12 lines of output, given your program takes steps to correctly flush the CUDA kernel printf buffer.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论