TensorFlow在集群上未检测到GPU。

huangapple go评论51阅读模式
英文:

TensorFlow does not Detect GPU on Cluster

问题

I'm using a cluster to train my Machine Learning Model (TensorFlow) in Jupyter Notebook. The cluster already has JupyterHub (Python 3.7.5), CUDA, and cuDNN installed before I started using it. The cluster is running on Ubuntu 18.04 with GCC Version 8.4.0. When I execute the nvidia-smi command, I get the following output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
(...)
|===============================+======================+======================|
|   0  Quadro P4000        On   | 00000000:00:05.0 Off |                  N/A |
(...)

I am not the system administrator, so I installed TensorFlow-GPU using pip. However, when I train the model, Jupyter Notebook and TensorFlow do not detect any GPU, as can be seen below:

Code:

import tensorflow as tf

print(tf.__version__)
print(tf.config.list_physical_devices('GPU'))

Output:

2.8.3
[]

I hope that you can help me.

Here are the steps I have already taken:

  • Reinstalled tensorflow packages;
  • Installed a different version of Tensorflow;
英文:

I'm using a cluster to train my Machine Learning Model (TensorFlow) in Jupyter Notebook. The cluster already has JupyterHub (Python 3.7.5), CUDA, and cuDNN installed before I started using it. The cluster is running on Ubuntu 18.04 with GCC Version 8.4.0. When I execute the nvidia-smi command, I get the following output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
(...)
|===============================+======================+======================|
|   0  Quadro P4000        On   | 00000000:00:05.0 Off |                  N/A |
(...)

I am not the system administrator, so I installed TensorFlow-GPU using pip. However, when I train the model, Jupyter Notebook and TensorFlow do not detect any GPU, as can be seen below:

Code:

import tensorflow as tf

print(tf.__version__)
print(tf.config.list_physical_devices('GPU'))

Output:

2.8.3
[]

I hope that you can help me.

Here are the steps I have already taken:

  • Reinstalled tensorflow packages;
  • Installed a different version of Tensorflow;

答案1

得分: 1

感谢您的回答。我按照以下步骤解决了这个问题。

  1. 我使用 ls /usr/local 命令来检查系统上安装的 CUDA Toolkit 的版本。

  2. 我在 .bashrc 文件中做了如下的 PATH 更改:

export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}}$
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
  1. 我运行了 source .bashrc 以应用这些更改。

  2. 现在 nvcc -V 命令正常工作,显示了 TensorFlow 的正确版本:

Cuda compilation tools, release 10.1, V10.1.243  
  1. 运行 pip uninstall tensorflow-gpupip uninstall tensorflow-estimator

  2. 运行 pip install tensorflow-gpu==2.3.0(根据此 链接 兼容 CUDA 10.1)

现在,当我运行下面的代码时:

import tensorflow as tf

print(tf.__version__)
print(tf.config.list_physical_devices('GPU'))

输出是:

2.3.0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

谢谢!
1: https://www.tensorflow.org/install/source?hl=pt-br#linux

英文:

Thanks for the answer. I solved the problem by following the steps below.

  1. I used the ls /usr/local command to check the installed version of the CUDA Toolkit on the system.

  2. I made changes to the PATH file in .bashrc as follows:

export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}}$
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
  1. I ran source .bashrc to apply the changes.

  2. Now the nvcc -V command is working and showing the correct version for TensorFlow:

Cuda compilation tools, release 10.1, V10.1.243  
  1. pip uninstall tensorflow-gpu and pip uninstall tensorflow-estimator

  2. pip install tensorflow-gpu==2.3.0 (Compatible with CUDA 10.1 according this link)

Now, when I run the code below:

import tensorflow as tf

print(tf.__version__)
print(tf.config.list_physical_devices('GPU'))

The output is:

2.3.0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Thank you!

答案2

得分: 0

以下是已翻译的内容:

版本 Python 版本 编译器 构建工具 cuDNN CUDA
tensorflow-2.9.0 3.7-3.10 GCC 9.3.1 Bazel 5.0.0 8.1 11.2
tensorflow-2.8.0 3.7-3.10 GCC 7.3.1 Bazel 4.2.1 8.1 11.2

使用 TensorFlow 版本 2.8.0 时,列出了 GCC 7.3.1。这可能是 TensorFlow 无法检测到 GPU 的原因。我曾遇到类似问题,当我使用与指定版本略有不同的版本时,我的 GPU 也没有显示出来。

英文:

The TensorFlow documentation lists the following specifications:

Version Python version Compiler Build tools cuDNN CUDA
tensorflow-2.9.0 3.7-3.10 GCC 9.3.1 Bazel 5.0.0 8.1 11.2
tensorflow-2.8.0 3.7-3.10 GCC 7.3.1 Bazel 4.2.1 8.1 11.2

With TensorFlow version 2.8.0; GCC 7.3.1 is listed. This could be the reason why TensorFlow does not detect the GPU. I had a similar issue where I used a slightly different version than the one specified and my GPU also did not show up.

huangapple
  • 本文由 发表于 2023年5月25日 23:03:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76333746.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定