2023年6月29日 14:33:21go评论55阅读模式

英文:

Higher Tested Bandwidth with Caches Compared to Theoretical Memory Bandwidth in GPU when Accessing Global Memory

问题

I am currently working on optimizing memory access in a GPU-accelerated application and have encountered an intriguing scenario. When I use linear read in kernel code while accessing global memory, I have noticed that the tested memory bandwidth appears to be higher than the theoretical memory bandwidth. I'm seeking to understand the possible reasons behind this behavior.

Here is my kernel code:


__kernel void read_single(__global DATATYPE *input,__global DATATYPE *output)
{
        DATATYPE val = (DATATYPE)(0.0f);
        IDXTYPE gid = get_global_id(0);

        val = val + input[0];
        val = val + input[1];
        val = val + input[2];
        val = val + input[3];
        val = val + input[4];
        val = val + input[5];
        val = val + input[6];
        val = val + input[7];
        val = val + input[8];
        val = val + input[9];
        val = val + input[10];
        val = val + input[11];
        val = val + input[12];
        val = val + input[13];
        val = val + input[14];
        val = val + input[15];
        val = val + input[16];
        val = val + input[17];
        val = val + input[18];
        val = val + input[19];
        val = val + input[20];
        val = val + input[21];
        val = val + input[22];
        val = val + input[23];
        val = val + input[24];
        val = val + input[25];
        val = val + input[26];
        val = val + input[27];
        val = val + input[28];
        val = val + input[29];
        val = val + input[30];
        val = val + input[31];

        output[gid] = val;
}

I used GPU Navi 14 [Radeon Pro W5500]

Memory Size: 8 GB
Memory Type: GDDR6
Memory Bus : 128 bit
Bandwidth : 224.0 GB/s

But my test results are:

Global Memory Read: Single
Size (Bytes) 33554432
Avg. Kernel Time (sec) 1.56687e-05
Avg Bandwidth (GBPS) 2141.5

I did a global memory bandwidth test and expected that it would be lower than the theoretical value.

英文:

Here is my kernel code


__kernel void read_single(__global DATATYPE *input,__global DATATYPE *output)
{
        DATATYPE val = (DATATYPE)(0.0f);
        IDXTYPE gid = get_global_id(0);

        val = val + input[0];
        val = val + input[1];
        val = val + input[2];
        val = val + input[3];
        val = val + input[4];
        val = val + input[5];
        val = val + input[6];
        val = val + input[7];
        val = val + input[8];
        val = val + input[9];
        val = val + input[10];
        val = val + input[11];
        val = val + input[12];
        val = val + input[13];
        val = val + input[14];
        val = val + input[15];
        val = val + input[16];
        val = val + input[17];
        val = val + input[18];
        val = val + input[19];
        val = val + input[20];
        val = val + input[21];
        val = val + input[22];
        val = val + input[23];
        val = val + input[24];
        val = val + input[25];
        val = val + input[26];
        val = val + input[27];
        val = val + input[28];
        val = val + input[29];
        val = val + input[30];
        val = val + input[31];

        output[gid] = val;
}

I used GPU Navi 14 [Radeon Pro W5500]

Memory Size: 8 GB
Memory Type: GDDR6
Memory Bus : 128 bit
Bandwidth : 224.0 GB/s

But my test results is

Global Memory Read: Single
Size (Bytes) 33554432
Avg. Kernel Time (sec) 1.56687e-05
Avg Bandwidth (GBPS) 2141.5

I did global memory bandwidth test and expected that it will be lower than theoretical value

答案1

得分: 2

OpenCL对OpenCL程序员来说有点像一个黑盒子。谁知道它在盒子里实际上做了什么。不过我对你的测试有一些想法。

就像编译C程序时，编译器会寻找优化计算的方法一样。有时，这可能涉及重新排列代码或在编译时评估函数的返回值，如果输入在编译时是已知的话。可以明显地优化你的测试，观察到大部分内核计算（32个val+=...语句）在每个线程中都是相同的，因此只需在运行时评估一次，而不是每个线程都要评估一次。它肯定不需要读取超过输入数组的前32个元素。然后它只需要将那个值复制“全局大小”次到输出中。我认为如果你检查你的输出缓冲区，每个元素都会持有相同的值。

你没有说你的全局大小是多少，但假设是25万的话，那么你的输出缓冲区大约会是1MB（假设你的DATATYPE是32位浮点数）。然后，你在1.57e-5秒内更新了1MB，这给出了一个带宽约为63.7 GB/s。

如果你有32MB用于输入并且想要完全读取，你需要在每个线程上使用不同的索引。你可以做一些像这样的事情：

private DATATYPE *threadindexstart = input + 32*gid;
val = *threadindexstart;
val += *(threadindexstart + 1);...

英文:

OpenCL is something of a black box to OpenCL programmers. Who knows what it actually does inside the box. I have some thoughts on your test though.

Like when you compile a c program, the compiler will look for ways to optimise the computation. Sometimes, this will involve re-ordering code or evaluating the return value of functions if the input is known at compile time. One obvious optimisation that could be made with your test is from observing that most of the kernel calculation (the 32 val+=... statements) is the same in each thread, so only has to be evaluated at runtime once, not once per thread. It certainly doesn't have to read more than the first 32 elements of the input array. Then all it has to do is copy that value 'global-size' times to the output. I think if you inspect your output buffer, every element will hold the same value.

You didn't say what your global-size was, but let's say it was 250,000. Then, your output buffer would be about 1MB (assuming your DATATYPE is 32 bit float). You would then have updated 1MB in 1.57e-5 seconds, which gives a bandwidth of approx 63.7 GB/s.

If you have 32MB for your input and want to read all the way through, you need to be using different indices on each thread. You could do something like

private DATATYPE *threadindexstart = input + 32*gid;
val = *threadindexstart;
val += *(threadindexstart + 1);...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Higher Tested Bandwidth with Caches Compared to Theoretical Memory Bandwidth in GPU when Accessing Global Memory

问题

答案1

LWJGL在IDE之外运行时发现不同的OpenCL安装。

“CL_TARGET_OPENCL_VERSION is not defined” – 为什么会出现这个错误？

如何释放一个cl::Context对象 [C++ API] – 无法将’cl::Context’转换为’cl_context’

访问向量类型的第i个组件

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论