问题

GPU是否使用类似于CPU的TLB（Translation Lookaside Buffer），因此会出现TLB命中/未命中现象吗？
TLB未命中是由CUDA驱动程序还是GPU硬件处理的？
是否存在TLB未命中导致显著/明显的性能影响的情况？

英文:

There are plenty of documentation/publications on CUDA/Nvidia GPUs, but I never encountered anything about TLBs.

Do GPUs use TLBs similar to CPUs (and, therefore, have TLB hits/misses)?
How are TLB misses handled? By CUDA driver or by GPU HW?
Are there cases when TLB misses cause significant/noticeable performance impact?

答案1

得分: 1

一个TLB确实存在。我不知道任何官方文件，但可以通过逆向工程确定其大小。例如，参见Zhe Jia等人的文章：通过微基准测试解剖NVidia Turing T4 GPU

在可用的全局内存大小范围内，图灵架构的GPU有两个级别的TLB。L1 TLB有2 MiB页条目和32 MiB的覆盖范围。L2 TLB的覆盖范围约为8192 MiB，与Volta相同。

英文:

A TLB does exist. I am not aware of any official documentation but its size can be determined via reverse engineering. See for example Zhe Jia et.al.: Dissecting the NVidia Turing T4 GPU via Microbenchmarking

> […] within the available global memory size, there are
two levels of TLB on the Turing GPUs. The L1 TLB has 2 MiB page entries and
32 MiB coverage. The coverage of the L2 TLB is about 8192 MiB, which is the
same as Volta.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

GPU (Nvidia) TLB misses

问题

答案1

在CUDA GPU上执行简单的浮点数算术会得到稍微不同的答案。

图像为什么只被部分处理？

不平衡的CUDA内存读取与写入

如何在Python程序执行过程中减少内存使用量

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论