2023年6月15日 01:48:34go评论209阅读模式

英文:

What is the proper way to print an FP16 element in a cuda kernel without casting it to other formats?

问题

我写了一个简单的内核程序，使用 printf 记录一些计算后的值。它与整数等类型一起运行得很完美，但与 fp16 类型不同。它错误地打印了零值。问题似乎是由于打印格式不正确造成的。我应该使用哪种格式而不是将其转换为其他数据类型，例如 float，double？


#include <stdio.h>
#include <cuda_fp16.h>
#define DTYPE half //int
#define PRINT_FMT "output %f\n" //"output %d\n"
__global__ void __launch_bounds__(1024) test_print_kernel(DTYPE *__restrict__ O)
{
    // printf("test kernel\n");
    if (((int)blockIdx.x == 0) && ((int)threadIdx.x == 0))
    {
        O[0] = ((DTYPE)(2));
        __syncthreads();
        printf(PRINT_FMT, O[0]);
    }
}
int main(int argc, char **argv)
{
    DTYPE *h_O;
    cudaStream_t stream;
    cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
    cudaMallocHost(&h_O, 1 * sizeof(DTYPE));
    test_print_kernel<<<dim3(1, 1, 1), dim3(1, 1, 1), 0, (cudaStream_t)stream>>>(h_O);
    cudaDeviceSynchronize();
}

编译时收到的警告：
test_printf.cu(16): warning #1290-D: a class type that is not trivially copyable passed through ellipsis

test_printf.cu(16): warning #181-D: argument is incompatible with corresponding format string conversion

并且在终端上打印：
output 0.000000

英文:

I wrote a simple kernel program that uses printf to log some values after computation. It works perfectly with types like integer other than fp16. It incorrectly prints a zero value. The problem seems to be due to the fact that the print format is not correct. Which format should I use instead without casting it to other data types e.g. float,double?


#include &lt;stdio.h&gt;
#include &lt;cuda_fp16.h&gt;
#define DTYPE half //int
#define PRINT_FMT &quot;output %f\n&quot; //&quot;output %d\n&quot;
__global__ void __launch_bounds__(1024) test_print_kernel(DTYPE *__restrict__ O)
{
    // printf(&quot;test kernel\n&quot;);
    if (((int)blockIdx.x == 0) &amp;&amp; ((int)threadIdx.x == 0))
    {
        O[0] = ((DTYPE)(2));
        __syncthreads();
        printf(PRINT_FMT, O[0]);
    }
}
int main(int argc, char **argv)
{
    DTYPE *h_O;
    cudaStream_t stream;
    cudaStreamCreateWithFlags(&amp;stream, cudaStreamNonBlocking);
    cudaMallocHost(&amp;h_O, 1 * sizeof(DTYPE));
    test_print_kernel&lt;&lt;&lt;dim3(1, 1, 1), dim3(1, 1, 1), 0, (cudaStream_t)stream&gt;&gt;&gt;(h_O);
    cudaDeviceSynchronize();
}

The warning I got during compilation:
test_printf.cu(16): warning #1290-D: a class type that is not trivially copyable passed through ellipsis

test_printf.cu(16): warning #181-D: argument is incompatible with corresponding format string conversion

And it prints to the terminal:
output 0.000000

答案1

得分: 1

在C++标准定义的printf函数中，以及CUDA实现中，都没有用于半精度浮点值的输出格式说明符。

你唯一的选择是将半精度值转换为单精度值，然后使用标准的单精度格式说明符。CUDA Math API提供了一个内置函数来执行这种转换，所以类似这样的代码应该可以工作：

printf("output %f\n", __half2float(O[0]));

至少应该能够运行。正如评论中所指出的，转换会带来性能损耗，但这是printf函数，本身性能开销就很大。

请注意，这将使内核专门用于半精度情况。如果你真的需要一个通用的内核来处理不同类型的输出，可以使用可变模板解决方案，以将其泛化为不同类型，例如在这里讨论的方式：

英文:

There is no output format specifier for half precision floating point values in either the C++ standard definition of printf, or in the CUDA implementation.

Your only real choice is to convert the half value to a float and use the standard float format specifier. The CUDA Math API has an instrinic function for this so something like:

printf(“output %f\n”, __half2float(O[0]));

should work at a minimum. As pointed out in comments, there is a performance penalty for the conversion, but this is printf, which has a huge performance penalty anyway.

Note that this will make the kernel specific to the half case. If you genuinely need a generic kernel for different types with output, there are variadic template solutions which you could use to generalise this to different types, for example as discussed here.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在CUDA内核中打印FP16元素的正确方法是不进行类型转换。

问题

答案1

Problems in implementing adaptive thresholding using CUDA

‘issused warp per scheduler’在充满IMAD.WIDE指令的NVIDIA代码中为什么这么低？

缺少适用于Linux aarch64的Cuda 12.1驱动程序。

如何同时使用CUDA代码加速Tensorflow-gpu

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。