英文:
TensorFlow does not Detect GPU on Cluster
问题
I'm using a cluster to train my Machine Learning Model (TensorFlow) in Jupyter Notebook. The cluster already has JupyterHub (Python 3.7.5), CUDA, and cuDNN installed before I started using it. The cluster is running on Ubuntu 18.04 with GCC Version 8.4.0. When I execute the nvidia-smi command, I get the following output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
(...)
|===============================+======================+======================|
| 0 Quadro P4000 On | 00000000:00:05.0 Off | N/A |
(...)
I am not the system administrator, so I installed TensorFlow-GPU using pip. However, when I train the model, Jupyter Notebook and TensorFlow do not detect any GPU, as can be seen below:
Code:
import tensorflow as tf
print(tf.__version__)
print(tf.config.list_physical_devices('GPU'))
Output:
2.8.3
[]
I hope that you can help me.
Here are the steps I have already taken:
- Reinstalled tensorflow packages;
- Installed a different version of Tensorflow;
英文:
I'm using a cluster to train my Machine Learning Model (TensorFlow) in Jupyter Notebook. The cluster already has JupyterHub (Python 3.7.5), CUDA, and cuDNN installed before I started using it. The cluster is running on Ubuntu 18.04 with GCC Version 8.4.0. When I execute the nvidia-smi command, I get the following output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
(...)
|===============================+======================+======================|
| 0 Quadro P4000 On | 00000000:00:05.0 Off | N/A |
(...)
I am not the system administrator, so I installed TensorFlow-GPU using pip. However, when I train the model, Jupyter Notebook and TensorFlow do not detect any GPU, as can be seen below:
Code:
import tensorflow as tf
print(tf.__version__)
print(tf.config.list_physical_devices('GPU'))
Output:
2.8.3
[]
I hope that you can help me.
Here are the steps I have already taken:
- Reinstalled tensorflow packages;
- Installed a different version of Tensorflow;
答案1
得分: 1
感谢您的回答。我按照以下步骤解决了这个问题。
-
我使用
ls /usr/local
命令来检查系统上安装的 CUDA Toolkit 的版本。 -
我在
.bashrc
文件中做了如下的 PATH 更改:
export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}}$
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
-
我运行了
source .bashrc
以应用这些更改。 -
现在
nvcc -V
命令正常工作,显示了 TensorFlow 的正确版本:
Cuda compilation tools, release 10.1, V10.1.243
-
运行
pip uninstall tensorflow-gpu
和pip uninstall tensorflow-estimator
-
运行
pip install tensorflow-gpu==2.3.0
(根据此 链接 兼容 CUDA 10.1)
现在,当我运行下面的代码时:
import tensorflow as tf
print(tf.__version__)
print(tf.config.list_physical_devices('GPU'))
输出是:
2.3.0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
谢谢!
1: https://www.tensorflow.org/install/source?hl=pt-br#linux
英文:
Thanks for the answer. I solved the problem by following the steps below.
-
I used the
ls /usr/local
command to check the installed version of the CUDA Toolkit on the system. -
I made changes to the PATH file in
.bashrc
as follows:
export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}}$
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
-
I ran source
.bashrc
to apply the changes. -
Now the
nvcc -V
command is working and showing the correct version for TensorFlow:
Cuda compilation tools, release 10.1, V10.1.243
-
pip uninstall tensorflow-gpu
andpip uninstall tensorflow-estimator
-
pip install tensorflow-gpu==2.3.0
(Compatible with CUDA 10.1 according this link)
Now, when I run the code below:
import tensorflow as tf
print(tf.__version__)
print(tf.config.list_physical_devices('GPU'))
The output is:
2.3.0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Thank you!
答案2
得分: 0
以下是已翻译的内容:
版本 | Python 版本 | 编译器 | 构建工具 | cuDNN | CUDA |
---|---|---|---|---|---|
tensorflow-2.9.0 | 3.7-3.10 | GCC 9.3.1 | Bazel 5.0.0 | 8.1 | 11.2 |
tensorflow-2.8.0 | 3.7-3.10 | GCC 7.3.1 | Bazel 4.2.1 | 8.1 | 11.2 |
使用 TensorFlow 版本 2.8.0 时,列出了 GCC 7.3.1。这可能是 TensorFlow 无法检测到 GPU 的原因。我曾遇到类似问题,当我使用与指定版本略有不同的版本时,我的 GPU 也没有显示出来。
英文:
The TensorFlow documentation lists the following specifications:
Version | Python version | Compiler | Build tools | cuDNN | CUDA |
---|---|---|---|---|---|
tensorflow-2.9.0 | 3.7-3.10 | GCC 9.3.1 | Bazel 5.0.0 | 8.1 | 11.2 |
tensorflow-2.8.0 | 3.7-3.10 | GCC 7.3.1 | Bazel 4.2.1 | 8.1 | 11.2 |
With TensorFlow version 2.8.0; GCC 7.3.1 is listed. This could be the reason why TensorFlow does not detect the GPU. I had a similar issue where I used a slightly different version than the one specified and my GPU also did not show up.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论