在CUDA内核中打印FP16元素的正确方法是不进行类型转换。

huangapple go评论179阅读模式
英文:

What is the proper way to print an FP16 element in a cuda kernel without casting it to other formats?

问题

我写了一个简单的内核程序,使用 printf 记录一些计算后的值。它与整数等类型一起运行得很完美,但与 fp16 类型不同。它错误地打印了零值。问题似乎是由于打印格式不正确造成的。我应该使用哪种格式而不是将其转换为其他数据类型,例如 float,double?



#include <stdio.h>
#include <cuda_fp16.h>

#define DTYPE half //int
#define PRINT_FMT "output %f\n" //"output %d\n"

__global__ void __launch_bounds__(1024) test_print_kernel(DTYPE *__restrict__ O)
{
    // printf("test kernel\n");
    if (((int)blockIdx.x == 0) && ((int)threadIdx.x == 0))
    {
        O[0] = ((DTYPE)(2));
        __syncthreads();

        printf(PRINT_FMT, O[0]);

    }
}

int main(int argc, char **argv)
{
    DTYPE *h_O;
    cudaStream_t stream;
    cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
    cudaMallocHost(&h_O, 1 * sizeof(DTYPE));
    test_print_kernel<<<dim3(1, 1, 1), dim3(1, 1, 1), 0, (cudaStream_t)stream>>>(h_O);
    cudaDeviceSynchronize();
}

编译时收到的警告:
test_printf.cu(16): warning #1290-D: a class type that is not trivially copyable passed through ellipsis

test_printf.cu(16): warning #181-D: argument is incompatible with corresponding format string conversion

并且在终端上打印:
output 0.000000

英文:

I wrote a simple kernel program that uses printf to log some values after computation. It works perfectly with types like integer other than fp16. It incorrectly prints a zero value. The problem seems to be due to the fact that the print format is not correct. Which format should I use instead without casting it to other data types e.g. float,double?


#include &lt;stdio.h&gt;
#include &lt;cuda_fp16.h&gt;

#define DTYPE half //int
#define PRINT_FMT &quot;output %f\n&quot; //&quot;output %d\n&quot;

__global__ void __launch_bounds__(1024) test_print_kernel(DTYPE *__restrict__ O)
{
    // printf(&quot;test kernel\n&quot;);
    if (((int)blockIdx.x == 0) &amp;&amp; ((int)threadIdx.x == 0))
    {
        O[0] = ((DTYPE)(2));
        __syncthreads();

        printf(PRINT_FMT, O[0]);

    }
}

int main(int argc, char **argv)
{
    DTYPE *h_O;
    cudaStream_t stream;
    cudaStreamCreateWithFlags(&amp;stream, cudaStreamNonBlocking);
    cudaMallocHost(&amp;h_O, 1 * sizeof(DTYPE));
    test_print_kernel&lt;&lt;&lt;dim3(1, 1, 1), dim3(1, 1, 1), 0, (cudaStream_t)stream&gt;&gt;&gt;(h_O);
    cudaDeviceSynchronize();
}

The warning I got during compilation:
test_printf.cu(16): warning #1290-D: a class type that is not trivially copyable passed through ellipsis

test_printf.cu(16): warning #181-D: argument is incompatible with corresponding format string conversion

And it prints to the terminal:
output 0.000000

答案1

得分: 1

在C++标准定义的printf函数中,以及CUDA实现中,都没有用于半精度浮点值的输出格式说明符。

你唯一的选择是将半精度值转换为单精度值,然后使用标准的单精度格式说明符。CUDA Math API提供了一个内置函数来执行这种转换,所以类似这样的代码应该可以工作:

printf("output %f\n", __half2float(O[0]));

至少应该能够运行。正如评论中所指出的,转换会带来性能损耗,但这是printf函数,本身性能开销就很大。

请注意,这将使内核专门用于半精度情况。如果你真的需要一个通用的内核来处理不同类型的输出,可以使用可变模板解决方案,以将其泛化为不同类型,例如在这里讨论的方式:

英文:

There is no output format specifier for half precision floating point values in either the C++ standard definition of printf, or in the CUDA implementation.

Your only real choice is to convert the half value to a float and use the standard float format specifier. The CUDA Math API has an instrinic function for this so something like:

printf(“output %f\n”, __half2float(O[0]));

should work at a minimum. As pointed out in comments, there is a performance penalty for the conversion, but this is printf, which has a huge performance penalty anyway.

Note that this will make the kernel specific to the half case. If you genuinely need a generic kernel for different types with output, there are variadic template solutions which you could use to generalise this to different types, for example as discussed here.

huangapple
  • 本文由 发表于 2023年6月15日 01:48:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76476301.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定