2023年5月25日 23:03:50go评论94阅读模式

英文:

TensorFlow does not Detect GPU on Cluster

问题

I'm using a cluster to train my Machine Learning Model (TensorFlow) in Jupyter Notebook. The cluster already has JupyterHub (Python 3.7.5), CUDA, and cuDNN installed before I started using it. The cluster is running on Ubuntu 18.04 with GCC Version 8.4.0. When I execute the nvidia-smi command, I get the following output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
(...)
|===============================+======================+======================|
|   0  Quadro P4000        On   | 00000000:00:05.0 Off |                  N/A |
(...)

I am not the system administrator, so I installed TensorFlow-GPU using pip. However, when I train the model, Jupyter Notebook and TensorFlow do not detect any GPU, as can be seen below:

Code:

import tensorflow as tf
print(tf.__version__)
print(tf.config.list_physical_devices('GPU'))

Output:

2.8.3
[]

I hope that you can help me.

Here are the steps I have already taken:

Reinstalled tensorflow packages;
Installed a different version of Tensorflow;

英文:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
(...)
|===============================+======================+======================|
|   0  Quadro P4000        On   | 00000000:00:05.0 Off |                  N/A |
(...)

I am not the system administrator, so I installed TensorFlow-GPU using pip. However, when I train the model, Jupyter Notebook and TensorFlow do not detect any GPU, as can be seen below:

Code:

import tensorflow as tf
print(tf.__version__)
print(tf.config.list_physical_devices(&#39;GPU&#39;))

Output:

2.8.3
[]

I hope that you can help me.

Here are the steps I have already taken:

Reinstalled tensorflow packages;
Installed a different version of Tensorflow;

答案1

得分: 1

感谢您的回答。我按照以下步骤解决了这个问题。

我使用 ls /usr/local 命令来检查系统上安装的 CUDA Toolkit 的版本。
我在 .bashrc 文件中做了如下的 PATH 更改：

export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}}$
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

我运行了 source .bashrc 以应用这些更改。
现在 nvcc -V 命令正常工作，显示了 TensorFlow 的正确版本：

Cuda compilation tools, release 10.1, V10.1.243

运行 pip uninstall tensorflow-gpu 和 pip uninstall tensorflow-estimator
运行 pip install tensorflow-gpu==2.3.0（根据此链接兼容 CUDA 10.1）

现在，当我运行下面的代码时：

import tensorflow as tf
print(tf.__version__)
print(tf.config.list_physical_devices('GPU'))

输出是：

2.3.0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

谢谢！
1: https://www.tensorflow.org/install/source?hl=pt-br#linux

英文:

Thanks for the answer. I solved the problem by following the steps below.

I used the ls /usr/local command to check the installed version of the CUDA Toolkit on the system.
I made changes to the PATH file in .bashrc as follows:

export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}}$
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

I ran source .bashrc to apply the changes.
Now the nvcc -V command is working and showing the correct version for TensorFlow:

Cuda compilation tools, release 10.1, V10.1.243

pip uninstall tensorflow-gpu and pip uninstall tensorflow-estimator
pip install tensorflow-gpu==2.3.0 (Compatible with CUDA 10.1 according this link)

Now, when I run the code below:

import tensorflow as tf
print(tf.__version__)
print(tf.config.list_physical_devices(&#39;GPU&#39;))

The output is:

2.3.0
[PhysicalDevice(name=&#39;/physical_device:GPU:0&#39;, device_type=&#39;GPU&#39;)]

Thank you!

答案2

得分: 0

以下是已翻译的内容：

版本	Python 版本	编译器	构建工具	cuDNN	CUDA
tensorflow-2.9.0	3.7-3.10	GCC 9.3.1	Bazel 5.0.0	8.1	11.2
tensorflow-2.8.0	3.7-3.10	GCC 7.3.1	Bazel 4.2.1	8.1	11.2

使用 TensorFlow 版本 2.8.0 时，列出了 GCC 7.3.1。这可能是 TensorFlow 无法检测到 GPU 的原因。我曾遇到类似问题，当我使用与指定版本略有不同的版本时，我的 GPU 也没有显示出来。

英文:

The TensorFlow documentation lists the following specifications:

Version	Python version	Compiler	Build tools	cuDNN	CUDA
tensorflow-2.9.0	3.7-3.10	GCC 9.3.1	Bazel 5.0.0	8.1	11.2
tensorflow-2.8.0	3.7-3.10	GCC 7.3.1	Bazel 4.2.1	8.1	11.2

With TensorFlow version 2.8.0; GCC 7.3.1 is listed. This could be the reason why TensorFlow does not detect the GPU. I had a similar issue where I used a slightly different version than the one specified and my GPU also did not show up.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

TensorFlow在集群上未检测到GPU。

问题

答案1

答案2

使用`numpy.ndarray`在Matplotlib标题图中以指定格式绘制。

使用Scapy来嗅探管理帧

Python Tkinter级联菜单命令未执行。

tkraise和Python GUI(tkinter)中的一些面向对象编程问题。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。