无法在Docker上运行Tensorflow GPU(尽管似乎在其外部已安装)

huangapple go评论72阅读模式
英文:

Cannot run Tensorflow GPU on Docker (although it seems to be installed outside of it)

问题

我下载了tensorflow/tensorflow:latest-gpu镜像。为了运行它,我运行以下命令启动docker镜像:

docker run -it --rm \
--ipc=host \
--gpus all \
--volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" \
--volume="$(pwd)/://mydir:rw" \
--workdir="/mydir/" \
tensorflow/tensorflow:latest-gpu bash -c 'bash'

然而,每当我运行以下Python命令时:

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

我得到以下输出(在Docker内部):

2023-06-26 13:10:46.768093: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2023-06-26 13:10:46.768177: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: 466cc7912253
2023-06-26 13:10:46.768189: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: 466cc7912253
2023-06-26 13:10:46.768314: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2023-06-26 13:10:46.768355: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 515.65.1
Num GPUs Available:  0

我还可以在Docker内外看到以下信息:

## nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
|  0%   37C    P0    58W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
...

我还可以看到CUDA版本已安装:

## nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

那么,我应该怎么做才能让我的Docker镜像找到CUDA并在GPU上运行我的代码呢?

英文:

Here is my situation

I downloaded the tensorflow/tensorflow:latest-gpu image. In order to run it, I run the following command to start the docker image:

docker run -it --rm \
--ipc=host \
  --gpus all \
  --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" \
  --volume="$(pwd)/://mydir:rw" \
  --workdir="/mydir/" \
  tensorflow/tensorflow:latest-gpu bash -c 'bash'

However, whenever I run the following Python commands:

>> import tensorflow as tf
>> print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Here is the output I have (inside the Docker):

2023-06-26 13:10:46.768093: E 

tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2023-06-26 13:10:46.768177: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: 466cc7912253
2023-06-26 13:10:46.768189: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: 466cc7912253
2023-06-26 13:10:46.768314: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2023-06-26 13:10:46.768355: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 515.65.1
Num GPUs Available:  0

Here is what I have outside and INSIDE docker:

## nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
|  0%   37C    P0    58W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:05:00.0 Off |                  N/A |
|  0%   35C    P0    59W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:84:00.0 Off |                  N/A |
|  0%   38C    P0    60W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:85:00.0 Off |                  N/A |
|  0%   32C    P0    59W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  Off  | 00000000:88:00.0 Off |                  N/A |
|  0%   26C    P0    58W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  Off  | 00000000:89:00.0 Off |                  N/A |
|  0%   28C    P0    57W / 250W |      0MiB / 11264MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I also can see that there is a CUDA version installed:

## nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

So what can I do to make my docker image see where is my CUDA and make my code run in the GPU?

答案1

得分: 2

"tensorflow/tensorflow:latest-gpu"获取了最新的TensorFlow版本,即2.13.0(截止到23年7月8日),而这个TensorFlow版本需要cuda 11.8,而您的显卡不支持。您可以在错误消息中看到这个记录:

tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW

请检查您的显卡支持的最新cuda版本,并使用与该cuda版本兼容的TensorFlow版本:https://www.tensorflow.org/install/source#gpu

当我尝试运行tensorflow/tensorflow:latest-gpu时,我也遇到了同样的问题,但tensorflow/tensorflow:2.4.3-gpu可以工作。从我使用的TensorFlow版本可以看出,我使用的显卡略旧。

英文:

tensorflow/tensorflow:latest-gpu gets latest tensorflow version which is 2.13.0 (as of 07/08/23) and this tensorflow version requires cuda 11.8 which is not supported by your graphics card. You can see this logged in the error message

tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW

Check the latest cuda version supported by your graphics card and use the tensorflow version compatible with that cuda version: https://www.tensorflow.org/install/source#gpu

I am facing the same issue when I try to run tensorflow/tensorflow:latest-gpu but tensorflow/tensorflow:2.4.3-gpu works. I have a slightly older gpu as you can see from the tensorflow version I am using.

答案2

得分: 1

以下是翻译好的部分:

  1. 确保主机上已安装Nvidia容器工具包 - 这是Docker访问Nvidia GPU所需的。您可以使用以下命令安装:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
  1. 传递附加的运行时参数以公开GPU:
--runtime=nvidia \
--gpus all \
-e NVIDIA_VISIBLE_DEVICES=all
  1. 确保TensorFlow Docker镜像支持GPU

  2. 再次检查主机机器上的GPU驱动程序是否最新。

这些步骤的组合应该允许容器正常访问GPU...

英文:

It still seems like the Docker container is not able to access the GPUs on the host machine even though you are passing --gpus all as an argument. A few things to try:

  1. Make sure the Nvidia container toolkit is installed on the host - this is required for Docker to access Nvidia GPUs. You can install it with:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
  1. Pass additional runtime arguments to expose the GPUs:
--runtime=nvidia \
--gpus all \
-e NVIDIA_VISIBLE_DEVICES=all
  1. Make sure the TensorFlow Docker image has GPU support

  2. Double check GPU drivers are up-to-date on the host machine.

Some combination of these steps should allow the container to access the GPUs properly...

huangapple
  • 本文由 发表于 2023年6月26日 21:17:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76557066.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定