GPU driver version error on google virtual machine

huangapple go评论67阅读模式
英文:

GPU driver version error on google virtual machine

问题

我遇到了一个错误,无法初始化NVML:由于云虚拟机上的驱动/库版本不匹配而导致的原因不明。系统之前正常运行,突然崩溃并报告了这样的错误。GPU driver version error on google virtual machine

我感到非常困惑,不知道是什么原因,是否有经验的人可以帮助我解决这个问题?我想知道为什么会出现这样的错误,是否有办法防止它发生,谢谢。

英文:

I got an error Failed to initialize NVML: Driver/library version mismatch on a cloud virtual machine for unknown reasons, the system was working normally then suddenly crashed and reported such an error, GPU driver version error on google virtual machine

I'm very confused and don't know what is the cause, can someone with experience in this matter please help me, I want to know why I get such an error and is there any way to prevent it, thanks

答案1

得分: 2

根据 Bright Computing 知识库中由此文档整理的信息,“Failed to initialize NVML: Driver/library version mismatch?” 错误通常意味着 CUDA 驱动仍在运行一个与当前使用的 CUDA 工具包版本不兼容的旧版本。

> 重新启动虚拟机是解决此问题的最简单方法。重新启动虚拟机将确保驱动程序在升级后得到正确初始化。

> 如果您不希望重新启动虚拟机,则需要删除现有的 Nvidia 内核模块并加载新模块。

> 在虚拟机上

> 删除现有的 Nvidia 内核模块:

> modprobe -r nvidia nvidia_uvm

> 重新加载 systemd 单位:

> systemctl daemon-reload

> 构建并加载新内核模块:

> systemctl restart cuda-driver

> 如果旧的 Nvidia 内核模块仍在加载,您可能需要从软件映像和节点中删除该模块。您可以使用以下命令检查:

> find /lib/modules | grep nvidia
> find /cm/images/default-image/lib/modules | grep nvidia

参考此官方文档以清除所有先前的 CUDA 和 NVIDIA 驱动程序文件,按照 cuda Linux 安装指南中的步骤进行操作,然后重新安装。

英文:

As per this doc curated by Bright computing knowledge base the “Failed to initialize NVML: Driver/library version mismatch?” error generally means the CUDA Driver is still running an older release that is incompatible with the CUDA toolkit version currently in use.

> Rebooting the VM is the easiest way to fix the issue. Rebooting the VM
> will ensure that the drivers are properly initialized after the
> upgrade.
>
> If you do not wish to reboot the VM, you will need to remove the
> existing Nvidia kernel module and load the new module.
>
> On the VM:
>
> Remove the existing Nvidia kernel module:
>
> modprobe -r nvidia nvidia_uvm
>
> Reload the systemd units:
>
> systemctl daemon-reload
>
> Build and load the new kernel module:
>
> systemctl restart cuda-driver
>
> If the old Nvidia Kernel module is still loading, you may need to
> delete the module from the software image and node. You can check this
> with:
>
> find /lib/modules | grep nvidia
> find /cm/images/default-image/lib/modules | grep nvidia

Refer to this official document to get rid of all previous CUDA and NVIDIA driver files, follow the steps in the cuda linux installation guide and then reinstall.

huangapple
  • 本文由 发表于 2023年6月29日 08:53:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76577488.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定