英文:
Higher Tested Bandwidth with Caches Compared to Theoretical Memory Bandwidth in GPU when Accessing Global Memory
问题
I am currently working on optimizing memory access in a GPU-accelerated application and have encountered an intriguing scenario. When I use linear read in kernel code while accessing global memory, I have noticed that the tested memory bandwidth appears to be higher than the theoretical memory bandwidth. I'm seeking to understand the possible reasons behind this behavior.
Here is my kernel code:
__kernel void read_single(__global DATATYPE *input,__global DATATYPE *output)
{
DATATYPE val = (DATATYPE)(0.0f);
IDXTYPE gid = get_global_id(0);
val = val + input[0];
val = val + input[1];
val = val + input[2];
val = val + input[3];
val = val + input[4];
val = val + input[5];
val = val + input[6];
val = val + input[7];
val = val + input[8];
val = val + input[9];
val = val + input[10];
val = val + input[11];
val = val + input[12];
val = val + input[13];
val = val + input[14];
val = val + input[15];
val = val + input[16];
val = val + input[17];
val = val + input[18];
val = val + input[19];
val = val + input[20];
val = val + input[21];
val = val + input[22];
val = val + input[23];
val = val + input[24];
val = val + input[25];
val = val + input[26];
val = val + input[27];
val = val + input[28];
val = val + input[29];
val = val + input[30];
val = val + input[31];
output[gid] = val;
}
I used GPU Navi 14 [Radeon Pro W5500]
- Memory Size: 8 GB
- Memory Type: GDDR6
- Memory Bus : 128 bit
- Bandwidth : 224.0 GB/s
But my test results are:
- Global Memory Read: Single
- Size (Bytes) 33554432
- Avg. Kernel Time (sec) 1.56687e-05
- Avg Bandwidth (GBPS) 2141.5
I did a global memory bandwidth test and expected that it would be lower than the theoretical value.
英文:
I am currently working on optimizing memory access in a GPU-accelerated application and have encountered an intriguing scenario. When I use linear read in kernel code while accessing global memory, I have noticed that the tested memory bandwidth appears to be higher than the theoretical memory bandwidth. I'm seeking to understand the possible reasons behind this behavior.
Here is my kernel code
__kernel void read_single(__global DATATYPE *input,__global DATATYPE *output)
{
DATATYPE val = (DATATYPE)(0.0f);
IDXTYPE gid = get_global_id(0);
val = val + input[0];
val = val + input[1];
val = val + input[2];
val = val + input[3];
val = val + input[4];
val = val + input[5];
val = val + input[6];
val = val + input[7];
val = val + input[8];
val = val + input[9];
val = val + input[10];
val = val + input[11];
val = val + input[12];
val = val + input[13];
val = val + input[14];
val = val + input[15];
val = val + input[16];
val = val + input[17];
val = val + input[18];
val = val + input[19];
val = val + input[20];
val = val + input[21];
val = val + input[22];
val = val + input[23];
val = val + input[24];
val = val + input[25];
val = val + input[26];
val = val + input[27];
val = val + input[28];
val = val + input[29];
val = val + input[30];
val = val + input[31];
output[gid] = val;
}
I used GPU Navi 14 [Radeon Pro W5500]
- Memory Size: 8 GB
- Memory Type: GDDR6
- Memory Bus : 128 bit
- Bandwidth : 224.0 GB/s
But my test results is
- Global Memory Read: Single
- Size (Bytes) 33554432
- Avg. Kernel Time (sec) 1.56687e-05
- Avg Bandwidth (GBPS) 2141.5
I did global memory bandwidth test and expected that it will be lower than theoretical value
答案1
得分: 2
OpenCL对OpenCL程序员来说有点像一个黑盒子。谁知道它在盒子里实际上做了什么。不过我对你的测试有一些想法。
就像编译C程序时,编译器会寻找优化计算的方法一样。有时,这可能涉及重新排列代码或在编译时评估函数的返回值,如果输入在编译时是已知的话。可以明显地优化你的测试,观察到大部分内核计算(32个val+=...语句)在每个线程中都是相同的,因此只需在运行时评估一次,而不是每个线程都要评估一次。它肯定不需要读取超过输入数组的前32个元素。然后它只需要将那个值复制“全局大小”次到输出中。我认为如果你检查你的输出缓冲区,每个元素都会持有相同的值。
你没有说你的全局大小是多少,但假设是25万的话,那么你的输出缓冲区大约会是1MB(假设你的DATATYPE是32位浮点数)。然后,你在1.57e-5秒内更新了1MB,这给出了一个带宽约为63.7 GB/s。
如果你有32MB用于输入并且想要完全读取,你需要在每个线程上使用不同的索引。你可以做一些像这样的事情:
private DATATYPE *threadindexstart = input + 32*gid;
val = *threadindexstart;
val += *(threadindexstart + 1);...
英文:
OpenCL is something of a black box to OpenCL programmers. Who knows what it actually does inside the box. I have some thoughts on your test though.
Like when you compile a c program, the compiler will look for ways to optimise the computation. Sometimes, this will involve re-ordering code or evaluating the return value of functions if the input is known at compile time. One obvious optimisation that could be made with your test is from observing that most of the kernel calculation (the 32 val+=... statements) is the same in each thread, so only has to be evaluated at runtime once, not once per thread. It certainly doesn't have to read more than the first 32 elements of the input array. Then all it has to do is copy that value 'global-size' times to the output. I think if you inspect your output buffer, every element will hold the same value.
You didn't say what your global-size was, but let's say it was 250,000. Then, your output buffer would be about 1MB (assuming your DATATYPE is 32 bit float). You would then have updated 1MB in 1.57e-5 seconds, which gives a bandwidth of approx 63.7 GB/s.
If you have 32MB for your input and want to read all the way through, you need to be using different indices on each thread. You could do something like
private DATATYPE *threadindexstart = input + 32*gid;
val = *threadindexstart;
val += *(threadindexstart + 1);...
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论