英文:
Unbalanced Memory Read & Write in CUDA
问题
我注意到在使用ncu对底层的cuda内核进行性能分析时,发现了不平衡的内存读写量。
__global__ void kernel(void* mem, int n) {
int* ptr = reinterpret_cast<int*>(mem);
for (int offset = (threadIdx.x + blockIdx.x * blockDim.x) * 32; offset < n; offset += blockDim.x * gridDim.x * 32) {
#pragma unroll
for (int i = 0; i < 16; i++) {
ptr[offset + i] = ptr[offset + i + 16];
}
}
}
int main() {
int* mem;
int N = 1024 * 256 * 256;
cudaMalloc((void**)&mem, sizeof(int) * N);
cudaMemset(mem, 0, sizeof(int) * N);
kernel<<<8192, 256>>>(mem, N);
cudaFree(mem);
return 0;
}
在ncu中,它告诉我内存读取量为305 MB,而内存写入量为1.07 GB。我理解存在全局内存合并,但内存读取和写入应该都接近1GB,而不仅仅是305MB的内存读取,对吗?即使内存读取没有全局内存合并,内存读取量也应该接近128MB,是吗?
谢谢。
英文:
I noticed an unbalanced memory read and write amount when profiling the underneath cuda kernel using ncu.
__global__ void kernel(void* mem, int n) {
int* ptr = reinterpret_cast<int*>(mem);
for (int offset = (threadIdx.x + blockIdx.x * blockDim.x)*32; offset < n; offset += blockDim.x * gridDim.x * 32) {
#pragma unroll
for (int i = 0; i < 16; i++) {
ptr[offset + i] = ptr[offset + i + 16];
}
}
}
int main() {
int* mem;
int N = 1024 * 256 * 256;
cudaMalloc((void**)&mem, sizeof(int) * N);
cudaMemset(mem, 0, sizeof(int) * N);
kernel<<<8192, 256>>>(mem, N);
cudaFree(mem);
return 0;
}
In ncu, it tells me that memory read is 305 MB while memory write is 1.07GB. I understand that there is global memory coalescing, but shouldn’t the memory read and write both be equal to approximately 1GB, instead of only 305 MB memory read?
And even if there is no global memory coalescing for memory read, shouldn’t the memory read amount be equal to around 128MB?
Thanks.
答案1
得分: 1
>内存读写应该都接近1GB,而不仅仅是305MB的内存读吗?
你已经确定的流量:
>它告诉我内存读取为305MB,而内存写入为1.07GB。
实际上是L1和L2缓存之间的流量。
GPU的L1缓存通常被描述为“写透”(例如,第43页)。这可能会导致L1<->L2的流量在“平衡”的读写代码中出现显著的“不平衡”:写入可能会在每次写入时触发到L2的流量,而读取有可能在L1中命中,因此不会生成相应的L2流量。
>内存读取量不应该接近128MB吗?
从L1到L2的流量高于实际的内存流量,因为L1缓存相对较小,无法容纳您的代码的整个内存占用。由于未合并的访问模式和内存资源的低效使用,您的代码具有比实际工作所需更高的动态内存占用。因此,L1到L2的流量可能会远高于128MB。
至于从L2到内存的流量,取决于您的GPU,L2可能也小于128MB。在这种情况下,再次拥有比实际所需更大的动态内存占用(根据运行的warp而触及的内存)以及低效的内存使用意味着您实际上也会使L2受到影响,从而导致不必要的流量到内存。
英文:
>shouldn’t the memory read and write both be equal to approximately 1GB, instead of only 305 MB memory read?
The traffic you have identified:
>it tells me that memory read is 305 MB while memory write is 1.07GB.
is actually traffic between the L1 and L2 cache.
The GPU L1 cache is typically described as "write-through" (e.g. slide 43). This can result in a significant “imbalance” in L1<->L2 traffic for a “balanced” read/write code: writes have the potential to trigger traffic to the L2 on each write, reads have the potential to hit in L1, therefore not generating corresponding traffic to the L2.
>shouldn’t the memory read amount be equal to around 128MB?
The traffic from L1 to L2 is higher than the actual memory traffic because the L1 cache is relatively small, and cannot contain the entire memory footprint of your code. Your code has a dynamic footprint much higher than necessary to do the actual work you are doing, because of the uncoalesced access pattern and inefficient usage of memory resources. Therefore the L1 to L2 traffic can be much higher than 128MB.
With respect to the L2 to memory traffic, depending on your GPU, the L2 may also be smaller than 128MB. In this case, again, having a larger than necessary dynamic footprint (the memory being touched based on the warps in flight) coupled with inefficient memory usage means that effectively you can thrash the L2 as well, resulting in higher than necessary traffic to memory.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论