无法在Docker上运行Tensorflow GPU(尽管似乎在其外部已安装)

huangapple go评论116阅读模式
英文:

Cannot run Tensorflow GPU on Docker (although it seems to be installed outside of it)

问题

我下载了tensorflow/tensorflow:latest-gpu镜像。为了运行它,我运行以下命令启动docker镜像:

  1. docker run -it --rm \
  2. --ipc=host \
  3. --gpus all \
  4. --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" \
  5. --volume="$(pwd)/://mydir:rw" \
  6. --workdir="/mydir/" \
  7. tensorflow/tensorflow:latest-gpu bash -c 'bash'

然而,每当我运行以下Python命令时:

  1. import tensorflow as tf
  2. print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

我得到以下输出(在Docker内部):

  1. 2023-06-26 13:10:46.768093: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
  2. 2023-06-26 13:10:46.768177: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: 466cc7912253
  3. 2023-06-26 13:10:46.768189: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: 466cc7912253
  4. 2023-06-26 13:10:46.768314: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
  5. 2023-06-26 13:10:46.768355: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 515.65.1
  6. Num GPUs Available: 0

我还可以在Docker内外看到以下信息:

  1. ## nvidia-smi
  2. +-----------------------------------------------------------------------------+
  3. | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
  4. |-------------------------------+----------------------+----------------------+
  5. | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
  6. | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
  7. | | | MIG M. |
  8. |===============================+======================+======================|
  9. | 0 NVIDIA GeForce ... Off | 00000000:04:00.0 Off | N/A |
  10. | 0% 37C P0 58W / 250W | 0MiB / 11264MiB | 0% Default |
  11. | | | N/A |
  12. +-------------------------------+----------------------+----------------------+
  13. ...
  14. 我还可以看到CUDA版本已安装:
  15. ## nvcc --version
  16. nvcc: NVIDIA (R) Cuda compiler driver
  17. Copyright (c) 2005-2020 NVIDIA Corporation
  18. Built on Wed_Jul_22_19:09:09_PDT_2020
  19. Cuda compilation tools, release 11.0, V11.0.221
  20. Build cuda_11.0_bu.TC445_37.28845127_0

那么,我应该怎么做才能让我的Docker镜像找到CUDA并在GPU上运行我的代码呢?

英文:

Here is my situation

I downloaded the tensorflow/tensorflow:latest-gpu image. In order to run it, I run the following command to start the docker image:

  1. docker run -it --rm \
  2. --ipc=host \
  3. --gpus all \
  4. --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" \
  5. --volume="$(pwd)/://mydir:rw" \
  6. --workdir="/mydir/" \
  7. tensorflow/tensorflow:latest-gpu bash -c 'bash'

However, whenever I run the following Python commands:

  1. >> import tensorflow as tf
  2. >> print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Here is the output I have (inside the Docker):

  1. 2023-06-26 13:10:46.768093: E
  2. tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
  3. 2023-06-26 13:10:46.768177: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: 466cc7912253
  4. 2023-06-26 13:10:46.768189: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: 466cc7912253
  5. 2023-06-26 13:10:46.768314: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
  6. 2023-06-26 13:10:46.768355: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 515.65.1
  7. Num GPUs Available: 0

Here is what I have outside and INSIDE docker:

  1. ## nvidia-smi
  2. +-----------------------------------------------------------------------------+
  3. | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
  4. |-------------------------------+----------------------+----------------------+
  5. | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
  6. | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
  7. | | | MIG M. |
  8. |===============================+======================+======================|
  9. | 0 NVIDIA GeForce ... Off | 00000000:04:00.0 Off | N/A |
  10. | 0% 37C P0 58W / 250W | 0MiB / 11264MiB | 0% Default |
  11. | | | N/A |
  12. +-------------------------------+----------------------+----------------------+
  13. | 1 NVIDIA GeForce ... Off | 00000000:05:00.0 Off | N/A |
  14. | 0% 35C P0 59W / 250W | 0MiB / 11264MiB | 0% Default |
  15. | | | N/A |
  16. +-------------------------------+----------------------+----------------------+
  17. | 2 NVIDIA GeForce ... Off | 00000000:84:00.0 Off | N/A |
  18. | 0% 38C P0 60W / 250W | 0MiB / 11264MiB | 0% Default |
  19. | | | N/A |
  20. +-------------------------------+----------------------+----------------------+
  21. | 3 NVIDIA GeForce ... Off | 00000000:85:00.0 Off | N/A |
  22. | 0% 32C P0 59W / 250W | 0MiB / 11264MiB | 0% Default |
  23. | | | N/A |
  24. +-------------------------------+----------------------+----------------------+
  25. | 4 NVIDIA GeForce ... Off | 00000000:88:00.0 Off | N/A |
  26. | 0% 26C P0 58W / 250W | 0MiB / 11264MiB | 0% Default |
  27. | | | N/A |
  28. +-------------------------------+----------------------+----------------------+
  29. | 5 NVIDIA GeForce ... Off | 00000000:89:00.0 Off | N/A |
  30. | 0% 28C P0 57W / 250W | 0MiB / 11264MiB | 1% Default |
  31. | | | N/A |
  32. +-------------------------------+----------------------+----------------------+
  33. +-----------------------------------------------------------------------------+
  34. | Processes: |
  35. | GPU GI CI PID Type Process name GPU Memory |
  36. | ID ID Usage |
  37. |=============================================================================|
  38. | No running processes found |
  39. +-----------------------------------------------------------------------------+

I also can see that there is a CUDA version installed:

  1. ## nvcc --version
  2. nvcc: NVIDIA (R) Cuda compiler driver
  3. Copyright (c) 2005-2020 NVIDIA Corporation
  4. Built on Wed_Jul_22_19:09:09_PDT_2020
  5. Cuda compilation tools, release 11.0, V11.0.221
  6. Build cuda_11.0_bu.TC445_37.28845127_0

So what can I do to make my docker image see where is my CUDA and make my code run in the GPU?

答案1

得分: 2

"tensorflow/tensorflow:latest-gpu"获取了最新的TensorFlow版本,即2.13.0(截止到23年7月8日),而这个TensorFlow版本需要cuda 11.8,而您的显卡不支持。您可以在错误消息中看到这个记录:

tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW

请检查您的显卡支持的最新cuda版本,并使用与该cuda版本兼容的TensorFlow版本:https://www.tensorflow.org/install/source#gpu

当我尝试运行tensorflow/tensorflow:latest-gpu时,我也遇到了同样的问题,但tensorflow/tensorflow:2.4.3-gpu可以工作。从我使用的TensorFlow版本可以看出,我使用的显卡略旧。

英文:

tensorflow/tensorflow:latest-gpu gets latest tensorflow version which is 2.13.0 (as of 07/08/23) and this tensorflow version requires cuda 11.8 which is not supported by your graphics card. You can see this logged in the error message

  1. tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW

Check the latest cuda version supported by your graphics card and use the tensorflow version compatible with that cuda version: https://www.tensorflow.org/install/source#gpu

I am facing the same issue when I try to run tensorflow/tensorflow:latest-gpu but tensorflow/tensorflow:2.4.3-gpu works. I have a slightly older gpu as you can see from the tensorflow version I am using.

答案2

得分: 1

以下是翻译好的部分:

  1. 确保主机上已安装Nvidia容器工具包 - 这是Docker访问Nvidia GPU所需的。您可以使用以下命令安装:
  1. distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
  2. && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
  3. && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
  4. sudo apt-get update
  5. sudo apt-get install -y nvidia-docker2
  6. sudo systemctl restart docker
  1. 传递附加的运行时参数以公开GPU:
  1. --runtime=nvidia \
  2. --gpus all \
  3. -e NVIDIA_VISIBLE_DEVICES=all
  1. 确保TensorFlow Docker镜像支持GPU

  2. 再次检查主机机器上的GPU驱动程序是否最新。

这些步骤的组合应该允许容器正常访问GPU...

英文:

It still seems like the Docker container is not able to access the GPUs on the host machine even though you are passing --gpus all as an argument. A few things to try:

  1. Make sure the Nvidia container toolkit is installed on the host - this is required for Docker to access Nvidia GPUs. You can install it with:
  1. distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
  2. && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
  3. && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
  4. sudo apt-get update
  5. sudo apt-get install -y nvidia-docker2
  6. sudo systemctl restart docker
  1. Pass additional runtime arguments to expose the GPUs:
  1. --runtime=nvidia \
  2. --gpus all \
  3. -e NVIDIA_VISIBLE_DEVICES=all
  1. Make sure the TensorFlow Docker image has GPU support

  2. Double check GPU drivers are up-to-date on the host machine.

Some combination of these steps should allow the container to access the GPUs properly...

huangapple
  • 本文由 发表于 2023年6月26日 21:17:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76557066.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定