英文:
GPU driver version error on google virtual machine
问题
我遇到了一个错误,无法初始化NVML:由于云虚拟机上的驱动/库版本不匹配而导致的原因不明。系统之前正常运行,突然崩溃并报告了这样的错误。
我感到非常困惑,不知道是什么原因,是否有经验的人可以帮助我解决这个问题?我想知道为什么会出现这样的错误,是否有办法防止它发生,谢谢。
英文:
I got an error Failed to initialize NVML: Driver/library version mismatch on a cloud virtual machine for unknown reasons, the system was working normally then suddenly crashed and reported such an error,
I'm very confused and don't know what is the cause, can someone with experience in this matter please help me, I want to know why I get such an error and is there any way to prevent it, thanks
答案1
得分: 2
根据 Bright Computing 知识库中由此文档整理的信息,“Failed to initialize NVML: Driver/library version mismatch?” 错误通常意味着 CUDA 驱动仍在运行一个与当前使用的 CUDA 工具包版本不兼容的旧版本。
> 重新启动虚拟机是解决此问题的最简单方法。重新启动虚拟机将确保驱动程序在升级后得到正确初始化。
> 如果您不希望重新启动虚拟机,则需要删除现有的 Nvidia 内核模块并加载新模块。
> 在虚拟机上:
> 删除现有的 Nvidia 内核模块:
> modprobe -r nvidia nvidia_uvm
> 重新加载 systemd 单位:
> systemctl daemon-reload
> 构建并加载新内核模块:
> systemctl restart cuda-driver
> 如果旧的 Nvidia 内核模块仍在加载,您可能需要从软件映像和节点中删除该模块。您可以使用以下命令检查:
> find /lib/modules | grep nvidia
> find /cm/images/default-image/lib/modules | grep nvidia
参考此官方文档以清除所有先前的 CUDA 和 NVIDIA 驱动程序文件,按照 cuda Linux 安装指南中的步骤进行操作,然后重新安装。
英文:
As per this doc curated by Bright computing knowledge base the “Failed to initialize NVML: Driver/library version mismatch?” error generally means the CUDA Driver is still running an older release that is incompatible with the CUDA toolkit version currently in use.
> Rebooting the VM is the easiest way to fix the issue. Rebooting the VM
> will ensure that the drivers are properly initialized after the
> upgrade.
>
> If you do not wish to reboot the VM, you will need to remove the
> existing Nvidia kernel module and load the new module.
>
> On the VM:
>
> Remove the existing Nvidia kernel module:
>
> modprobe -r nvidia nvidia_uvm
>
> Reload the systemd units:
>
> systemctl daemon-reload
>
> Build and load the new kernel module:
>
> systemctl restart cuda-driver
>
> If the old Nvidia Kernel module is still loading, you may need to
> delete the module from the software image and node. You can check this
> with:
>
> find /lib/modules | grep nvidia
> find /cm/images/default-image/lib/modules | grep nvidia
Refer to this official document to get rid of all previous CUDA and NVIDIA driver files, follow the steps in the cuda linux installation guide and then reinstall.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论