英文:
What is the proper way to print an FP16 element in a cuda kernel without casting it to other formats?
问题
我写了一个简单的内核程序,使用 printf 记录一些计算后的值。它与整数等类型一起运行得很完美,但与 fp16 类型不同。它错误地打印了零值。问题似乎是由于打印格式不正确造成的。我应该使用哪种格式而不是将其转换为其他数据类型,例如 float,double?
#include <stdio.h>
#include <cuda_fp16.h>
#define DTYPE half //int
#define PRINT_FMT "output %f\n" //"output %d\n"
__global__ void __launch_bounds__(1024) test_print_kernel(DTYPE *__restrict__ O)
{
// printf("test kernel\n");
if (((int)blockIdx.x == 0) && ((int)threadIdx.x == 0))
{
O[0] = ((DTYPE)(2));
__syncthreads();
printf(PRINT_FMT, O[0]);
}
}
int main(int argc, char **argv)
{
DTYPE *h_O;
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
cudaMallocHost(&h_O, 1 * sizeof(DTYPE));
test_print_kernel<<<dim3(1, 1, 1), dim3(1, 1, 1), 0, (cudaStream_t)stream>>>(h_O);
cudaDeviceSynchronize();
}
编译时收到的警告:
test_printf.cu(16): warning #1290-D: a class type that is not trivially copyable passed through ellipsis
test_printf.cu(16): warning #181-D: argument is incompatible with corresponding format string conversion
并且在终端上打印:
output 0.000000
英文:
I wrote a simple kernel program that uses printf to log some values after computation. It works perfectly with types like integer other than fp16. It incorrectly prints a zero value. The problem seems to be due to the fact that the print format is not correct. Which format should I use instead without casting it to other data types e.g. float,double?
#include <stdio.h>
#include <cuda_fp16.h>
#define DTYPE half //int
#define PRINT_FMT "output %f\n" //"output %d\n"
__global__ void __launch_bounds__(1024) test_print_kernel(DTYPE *__restrict__ O)
{
// printf("test kernel\n");
if (((int)blockIdx.x == 0) && ((int)threadIdx.x == 0))
{
O[0] = ((DTYPE)(2));
__syncthreads();
printf(PRINT_FMT, O[0]);
}
}
int main(int argc, char **argv)
{
DTYPE *h_O;
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
cudaMallocHost(&h_O, 1 * sizeof(DTYPE));
test_print_kernel<<<dim3(1, 1, 1), dim3(1, 1, 1), 0, (cudaStream_t)stream>>>(h_O);
cudaDeviceSynchronize();
}
The warning I got during compilation:
test_printf.cu(16): warning #1290-D: a class type that is not trivially copyable passed through ellipsis
test_printf.cu(16): warning #181-D: argument is incompatible with corresponding format string conversion
And it prints to the terminal:
output 0.000000
答案1
得分: 1
在C++标准定义的printf函数中,以及CUDA实现中,都没有用于半精度浮点值的输出格式说明符。
你唯一的选择是将半精度值转换为单精度值,然后使用标准的单精度格式说明符。CUDA Math API提供了一个内置函数来执行这种转换,所以类似这样的代码应该可以工作:
printf("output %f\n", __half2float(O[0]));
至少应该能够运行。正如评论中所指出的,转换会带来性能损耗,但这是printf函数,本身性能开销就很大。
请注意,这将使内核专门用于半精度情况。如果你真的需要一个通用的内核来处理不同类型的输出,可以使用可变模板解决方案,以将其泛化为不同类型,例如在这里讨论的方式:
英文:
There is no output format specifier for half precision floating point values in either the C++ standard definition of printf, or in the CUDA implementation.
Your only real choice is to convert the half value to a float and use the standard float format specifier. The CUDA Math API has an instrinic function for this so something like:
printf(“output %f\n”, __half2float(O[0]));
should work at a minimum. As pointed out in comments, there is a performance penalty for the conversion, but this is printf, which has a huge performance penalty anyway.
Note that this will make the kernel specific to the half case. If you genuinely need a generic kernel for different types with output, there are variadic template solutions which you could use to generalise this to different types, for example as discussed here.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论