TensorFlow在集群上未检测到GPU。

huangapple go评论94阅读模式
英文:

TensorFlow does not Detect GPU on Cluster

问题

I'm using a cluster to train my Machine Learning Model (TensorFlow) in Jupyter Notebook. The cluster already has JupyterHub (Python 3.7.5), CUDA, and cuDNN installed before I started using it. The cluster is running on Ubuntu 18.04 with GCC Version 8.4.0. When I execute the nvidia-smi command, I get the following output:

  1. +-----------------------------------------------------------------------------+
  2. | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
  3. |-------------------------------+----------------------+----------------------+
  4. (...)
  5. |===============================+======================+======================|
  6. | 0 Quadro P4000 On | 00000000:00:05.0 Off | N/A |
  7. (...)

I am not the system administrator, so I installed TensorFlow-GPU using pip. However, when I train the model, Jupyter Notebook and TensorFlow do not detect any GPU, as can be seen below:

Code:

  1. import tensorflow as tf
  2. print(tf.__version__)
  3. print(tf.config.list_physical_devices('GPU'))

Output:

  1. 2.8.3
  2. []

I hope that you can help me.

Here are the steps I have already taken:

  • Reinstalled tensorflow packages;
  • Installed a different version of Tensorflow;
英文:

I'm using a cluster to train my Machine Learning Model (TensorFlow) in Jupyter Notebook. The cluster already has JupyterHub (Python 3.7.5), CUDA, and cuDNN installed before I started using it. The cluster is running on Ubuntu 18.04 with GCC Version 8.4.0. When I execute the nvidia-smi command, I get the following output:

  1. +-----------------------------------------------------------------------------+
  2. | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
  3. |-------------------------------+----------------------+----------------------+
  4. (...)
  5. |===============================+======================+======================|
  6. | 0 Quadro P4000 On | 00000000:00:05.0 Off | N/A |
  7. (...)

I am not the system administrator, so I installed TensorFlow-GPU using pip. However, when I train the model, Jupyter Notebook and TensorFlow do not detect any GPU, as can be seen below:

Code:

  1. import tensorflow as tf
  2. print(tf.__version__)
  3. print(tf.config.list_physical_devices('GPU'))

Output:

  1. 2.8.3
  2. []

I hope that you can help me.

Here are the steps I have already taken:

  • Reinstalled tensorflow packages;
  • Installed a different version of Tensorflow;

答案1

得分: 1

感谢您的回答。我按照以下步骤解决了这个问题。

  1. 我使用 ls /usr/local 命令来检查系统上安装的 CUDA Toolkit 的版本。

  2. 我在 .bashrc 文件中做了如下的 PATH 更改:

  1. export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}}$
  2. export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
  1. 我运行了 source .bashrc 以应用这些更改。

  2. 现在 nvcc -V 命令正常工作,显示了 TensorFlow 的正确版本:

  1. Cuda compilation tools, release 10.1, V10.1.243
  1. 运行 pip uninstall tensorflow-gpupip uninstall tensorflow-estimator

  2. 运行 pip install tensorflow-gpu==2.3.0(根据此 链接 兼容 CUDA 10.1)

现在,当我运行下面的代码时:

  1. import tensorflow as tf
  2. print(tf.__version__)
  3. print(tf.config.list_physical_devices('GPU'))

输出是:

  1. 2.3.0
  2. [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

谢谢!
1: https://www.tensorflow.org/install/source?hl=pt-br#linux

英文:

Thanks for the answer. I solved the problem by following the steps below.

  1. I used the ls /usr/local command to check the installed version of the CUDA Toolkit on the system.

  2. I made changes to the PATH file in .bashrc as follows:

  1. export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}}$
  2. export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
  1. I ran source .bashrc to apply the changes.

  2. Now the nvcc -V command is working and showing the correct version for TensorFlow:

  1. Cuda compilation tools, release 10.1, V10.1.243
  1. pip uninstall tensorflow-gpu and pip uninstall tensorflow-estimator

  2. pip install tensorflow-gpu==2.3.0 (Compatible with CUDA 10.1 according this link)

Now, when I run the code below:

  1. import tensorflow as tf
  2. print(tf.__version__)
  3. print(tf.config.list_physical_devices('GPU'))

The output is:

  1. 2.3.0
  2. [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Thank you!

答案2

得分: 0

以下是已翻译的内容:

版本 Python 版本 编译器 构建工具 cuDNN CUDA
tensorflow-2.9.0 3.7-3.10 GCC 9.3.1 Bazel 5.0.0 8.1 11.2
tensorflow-2.8.0 3.7-3.10 GCC 7.3.1 Bazel 4.2.1 8.1 11.2

使用 TensorFlow 版本 2.8.0 时,列出了 GCC 7.3.1。这可能是 TensorFlow 无法检测到 GPU 的原因。我曾遇到类似问题,当我使用与指定版本略有不同的版本时,我的 GPU 也没有显示出来。

英文:

The TensorFlow documentation lists the following specifications:

Version Python version Compiler Build tools cuDNN CUDA
tensorflow-2.9.0 3.7-3.10 GCC 9.3.1 Bazel 5.0.0 8.1 11.2
tensorflow-2.8.0 3.7-3.10 GCC 7.3.1 Bazel 4.2.1 8.1 11.2

With TensorFlow version 2.8.0; GCC 7.3.1 is listed. This could be the reason why TensorFlow does not detect the GPU. I had a similar issue where I used a slightly different version than the one specified and my GPU also did not show up.

huangapple
  • 本文由 发表于 2023年5月25日 23:03:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76333746.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定