英文:
Cannot run Tensorflow GPU on Docker (although it seems to be installed outside of it)
问题
我下载了tensorflow/tensorflow:latest-gpu镜像。为了运行它,我运行以下命令启动docker镜像:
docker run -it --rm \
--ipc=host \
--gpus all \
--volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" \
--volume="$(pwd)/://mydir:rw" \
--workdir="/mydir/" \
tensorflow/tensorflow:latest-gpu bash -c 'bash'
然而,每当我运行以下Python命令时:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
我得到以下输出(在Docker内部):
2023-06-26 13:10:46.768093: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2023-06-26 13:10:46.768177: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: 466cc7912253
2023-06-26 13:10:46.768189: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: 466cc7912253
2023-06-26 13:10:46.768314: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2023-06-26 13:10:46.768355: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 515.65.1
Num GPUs Available: 0
我还可以在Docker内外看到以下信息:
## nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:04:00.0 Off | N/A |
| 0% 37C P0 58W / 250W | 0MiB / 11264MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
...
我还可以看到CUDA版本已安装:
## nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
那么,我应该怎么做才能让我的Docker镜像找到CUDA并在GPU上运行我的代码呢?
英文:
Here is my situation
I downloaded the tensorflow/tensorflow:latest-gpu image. In order to run it, I run the following command to start the docker image:
docker run -it --rm \
--ipc=host \
--gpus all \
--volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" \
--volume="$(pwd)/://mydir:rw" \
--workdir="/mydir/" \
tensorflow/tensorflow:latest-gpu bash -c 'bash'
However, whenever I run the following Python commands:
>> import tensorflow as tf
>> print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
Here is the output I have (inside the Docker):
2023-06-26 13:10:46.768093: E
tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2023-06-26 13:10:46.768177: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: 466cc7912253
2023-06-26 13:10:46.768189: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: 466cc7912253
2023-06-26 13:10:46.768314: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2023-06-26 13:10:46.768355: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 515.65.1
Num GPUs Available: 0
Here is what I have outside and INSIDE docker:
## nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:04:00.0 Off | N/A |
| 0% 37C P0 58W / 250W | 0MiB / 11264MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:05:00.0 Off | N/A |
| 0% 35C P0 59W / 250W | 0MiB / 11264MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... Off | 00000000:84:00.0 Off | N/A |
| 0% 38C P0 60W / 250W | 0MiB / 11264MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... Off | 00000000:85:00.0 Off | N/A |
| 0% 32C P0 59W / 250W | 0MiB / 11264MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA GeForce ... Off | 00000000:88:00.0 Off | N/A |
| 0% 26C P0 58W / 250W | 0MiB / 11264MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA GeForce ... Off | 00000000:89:00.0 Off | N/A |
| 0% 28C P0 57W / 250W | 0MiB / 11264MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I also can see that there is a CUDA version installed:
## nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
So what can I do to make my docker image see where is my CUDA and make my code run in the GPU?
答案1
得分: 2
"tensorflow/tensorflow:latest-gpu"获取了最新的TensorFlow版本,即2.13.0(截止到23年7月8日),而这个TensorFlow版本需要cuda 11.8,而您的显卡不支持。您可以在错误消息中看到这个记录:
tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
请检查您的显卡支持的最新cuda版本,并使用与该cuda版本兼容的TensorFlow版本:https://www.tensorflow.org/install/source#gpu
当我尝试运行tensorflow/tensorflow:latest-gpu时,我也遇到了同样的问题,但tensorflow/tensorflow:2.4.3-gpu可以工作。从我使用的TensorFlow版本可以看出,我使用的显卡略旧。
英文:
tensorflow/tensorflow:latest-gpu gets latest tensorflow version which is 2.13.0 (as of 07/08/23) and this tensorflow version requires cuda 11.8 which is not supported by your graphics card. You can see this logged in the error message
tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
Check the latest cuda version supported by your graphics card and use the tensorflow version compatible with that cuda version: https://www.tensorflow.org/install/source#gpu
I am facing the same issue when I try to run tensorflow/tensorflow:latest-gpu but tensorflow/tensorflow:2.4.3-gpu works. I have a slightly older gpu as you can see from the tensorflow version I am using.
答案2
得分: 1
以下是翻译好的部分:
- 确保主机上已安装Nvidia容器工具包 - 这是Docker访问Nvidia GPU所需的。您可以使用以下命令安装:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
- 传递附加的运行时参数以公开GPU:
--runtime=nvidia \
--gpus all \
-e NVIDIA_VISIBLE_DEVICES=all
-
确保TensorFlow Docker镜像支持GPU
-
再次检查主机机器上的GPU驱动程序是否最新。
这些步骤的组合应该允许容器正常访问GPU...
英文:
It still seems like the Docker container is not able to access the GPUs on the host machine even though you are passing --gpus all as an argument. A few things to try:
- Make sure the Nvidia container toolkit is installed on the host - this is required for Docker to access Nvidia GPUs. You can install it with:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
- Pass additional runtime arguments to expose the GPUs:
--runtime=nvidia \
--gpus all \
-e NVIDIA_VISIBLE_DEVICES=all
-
Make sure the TensorFlow Docker image has GPU support
-
Double check GPU drivers are up-to-date on the host machine.
Some combination of these steps should allow the container to access the GPUs properly...
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论