2023年6月29日 11:42:29go评论68阅读模式

英文:

Unbalanced Memory Read & Write in CUDA

问题

我注意到在使用ncu对底层的cuda内核进行性能分析时，发现了不平衡的内存读写量。

__global__ void kernel(void* mem, int n) {
    int* ptr = reinterpret_cast<int*>(mem);

    for (int offset = (threadIdx.x + blockIdx.x * blockDim.x) * 32; offset < n; offset += blockDim.x * gridDim.x * 32) {
        #pragma unroll
        for (int i = 0; i < 16; i++) {
            ptr[offset + i] = ptr[offset + i + 16];
        }
    }
}

int main() {
    int* mem;

    int N = 1024 * 256 * 256;
    cudaMalloc((void**)&mem, sizeof(int) * N);
    cudaMemset(mem, 0, sizeof(int) * N);

    kernel<<<8192, 256>>>(mem, N);

    cudaFree(mem);

    return 0;
}

在ncu中，它告诉我内存读取量为305 MB，而内存写入量为1.07 GB。我理解存在全局内存合并，但内存读取和写入应该都接近1GB，而不仅仅是305MB的内存读取，对吗？即使内存读取没有全局内存合并，内存读取量也应该接近128MB，是吗？

谢谢。

英文:

I noticed an unbalanced memory read and write amount when profiling the underneath cuda kernel using ncu.

__global__ void kernel(void* mem, int n) {
    int* ptr = reinterpret_cast&lt;int*&gt;(mem);

    for (int offset = (threadIdx.x + blockIdx.x * blockDim.x)*32; offset &lt; n; offset += blockDim.x * gridDim.x * 32) {
        #pragma unroll
        for (int i = 0; i &lt; 16; i++) {
            ptr[offset + i] = ptr[offset + i + 16];
        }
    }
}

int main() {
    int* mem;

    int N = 1024 * 256 * 256;
    cudaMalloc((void**)&amp;mem, sizeof(int) * N);
    cudaMemset(mem, 0, sizeof(int) * N);

    kernel&lt;&lt;&lt;8192, 256&gt;&gt;&gt;(mem, N);

    cudaFree(mem);

    return 0;
}

In ncu, it tells me that memory read is 305 MB while memory write is 1.07GB. I understand that there is global memory coalescing, but shouldn’t the memory read and write both be equal to approximately 1GB, instead of only 305 MB memory read?
And even if there is no global memory coalescing for memory read, shouldn’t the memory read amount be equal to around 128MB?

Thanks.

答案1

得分: 1

>内存读写应该都接近1GB，而不仅仅是305MB的内存读吗？

你已经确定的流量：

>它告诉我内存读取为305MB，而内存写入为1.07GB。

实际上是L1和L2缓存之间的流量。

GPU的L1缓存通常被描述为“写透”（例如，第43页）。这可能会导致L1<->L2的流量在“平衡”的读写代码中出现显著的“不平衡”：写入可能会在每次写入时触发到L2的流量，而读取有可能在L1中命中，因此不会生成相应的L2流量。

>内存读取量不应该接近128MB吗？

从L1到L2的流量高于实际的内存流量，因为L1缓存相对较小，无法容纳您的代码的整个内存占用。由于未合并的访问模式和内存资源的低效使用，您的代码具有比实际工作所需更高的动态内存占用。因此，L1到L2的流量可能会远高于128MB。

至于从L2到内存的流量，取决于您的GPU，L2可能也小于128MB。在这种情况下，再次拥有比实际所需更大的动态内存占用（根据运行的warp而触及的内存）以及低效的内存使用意味着您实际上也会使L2受到影响，从而导致不必要的流量到内存。

英文:

>shouldn’t the memory read and write both be equal to approximately 1GB, instead of only 305 MB memory read?

The traffic you have identified:

>it tells me that memory read is 305 MB while memory write is 1.07GB.

is actually traffic between the L1 and L2 cache.

The GPU L1 cache is typically described as "write-through" (e.g. slide 43). This can result in a significant “imbalance” in L1<->L2 traffic for a “balanced” read/write code: writes have the potential to trigger traffic to the L2 on each write, reads have the potential to hit in L1, therefore not generating corresponding traffic to the L2.

>shouldn’t the memory read amount be equal to around 128MB?

The traffic from L1 to L2 is higher than the actual memory traffic because the L1 cache is relatively small, and cannot contain the entire memory footprint of your code. Your code has a dynamic footprint much higher than necessary to do the actual work you are doing, because of the uncoalesced access pattern and inefficient usage of memory resources. Therefore the L1 to L2 traffic can be much higher than 128MB.

With respect to the L2 to memory traffic, depending on your GPU, the L2 may also be smaller than 128MB. In this case, again, having a larger than necessary dynamic footprint (the memory being touched based on the warps in flight) coupled with inefficient memory usage means that effectively you can thrash the L2 as well, resulting in higher than necessary traffic to memory.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

不平衡的CUDA内存读取与写入

问题

答案1

DiskCacheStrategy无法解析为变量

cuModuleGetSurfRef和cuModuleGetTexRef的替代方法是什么？

Angular：优化对多次重新渲染大型列表的性能

缓存CMS的资产以供离线PWA使用？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论