2023年6月26日 21:17:15go评论116阅读模式

英文:

Cannot run Tensorflow GPU on Docker (although it seems to be installed outside of it)

问题

我下载了tensorflow/tensorflow:latest-gpu镜像。为了运行它，我运行以下命令启动docker镜像：

docker run -it --rm \
--ipc=host \
--gpus all \
--volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" \
--volume="$(pwd)/://mydir:rw" \
--workdir="/mydir/" \
tensorflow/tensorflow:latest-gpu bash -c 'bash'

然而，每当我运行以下Python命令时：

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

我得到以下输出（在Docker内部）：

2023-06-26 13:10:46.768093: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2023-06-26 13:10:46.768177: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: 466cc7912253
2023-06-26 13:10:46.768189: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: 466cc7912253
2023-06-26 13:10:46.768314: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2023-06-26 13:10:46.768355: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 515.65.1
Num GPUs Available:  0

我还可以在Docker内外看到以下信息：

## nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
|  0%   37C    P0    58W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
...
我还可以看到CUDA版本已安装：
## nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

那么，我应该怎么做才能让我的Docker镜像找到CUDA并在GPU上运行我的代码呢？

英文:

Here is my situation

I downloaded the tensorflow/tensorflow:latest-gpu image. In order to run it, I run the following command to start the docker image:

docker run -it --rm \
--ipc=host \
  --gpus all \
  --volume=&quot;/tmp/.X11-unix:/tmp/.X11-unix:rw&quot; \
  --volume=&quot;$(pwd)/://mydir:rw&quot; \
  --workdir=&quot;/mydir/&quot; \
  tensorflow/tensorflow:latest-gpu bash -c &#39;bash&#39;

However, whenever I run the following Python commands:

&gt;&gt; import tensorflow as tf
&gt;&gt; print(&quot;Num GPUs Available: &quot;, len(tf.config.list_physical_devices(&#39;GPU&#39;)))

Here is the output I have (inside the Docker):

2023-06-26 13:10:46.768093: E 
tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2023-06-26 13:10:46.768177: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: 466cc7912253
2023-06-26 13:10:46.768189: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: 466cc7912253
2023-06-26 13:10:46.768314: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2023-06-26 13:10:46.768355: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 515.65.1
Num GPUs Available:  0

Here is what I have outside and INSIDE docker:

## nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
|  0%   37C    P0    58W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:05:00.0 Off |                  N/A |
|  0%   35C    P0    59W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:84:00.0 Off |                  N/A |
|  0%   38C    P0    60W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:85:00.0 Off |                  N/A |
|  0%   32C    P0    59W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  Off  | 00000000:88:00.0 Off |                  N/A |
|  0%   26C    P0    58W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  Off  | 00000000:89:00.0 Off |                  N/A |
|  0%   28C    P0    57W / 250W |      0MiB / 11264MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I also can see that there is a CUDA version installed:

## nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

So what can I do to make my docker image see where is my CUDA and make my code run in the GPU?

答案1

得分: 2

"tensorflow/tensorflow:latest-gpu"获取了最新的TensorFlow版本，即2.13.0（截止到23年7月8日），而这个TensorFlow版本需要cuda 11.8，而您的显卡不支持。您可以在错误消息中看到这个记录：

tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW

请检查您的显卡支持的最新cuda版本，并使用与该cuda版本兼容的TensorFlow版本：https://www.tensorflow.org/install/source#gpu

当我尝试运行tensorflow/tensorflow:latest-gpu时，我也遇到了同样的问题，但tensorflow/tensorflow:2.4.3-gpu可以工作。从我使用的TensorFlow版本可以看出，我使用的显卡略旧。

英文:

tensorflow/tensorflow:latest-gpu gets latest tensorflow version which is 2.13.0 (as of 07/08/23) and this tensorflow version requires cuda 11.8 which is not supported by your graphics card. You can see this logged in the error message

tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW

Check the latest cuda version supported by your graphics card and use the tensorflow version compatible with that cuda version: https://www.tensorflow.org/install/source#gpu

I am facing the same issue when I try to run tensorflow/tensorflow:latest-gpu but tensorflow/tensorflow:2.4.3-gpu works. I have a slightly older gpu as you can see from the tensorflow version I am using.

答案2

得分: 1

以下是翻译好的部分：

确保主机上已安装Nvidia容器工具包 - 这是Docker访问Nvidia GPU所需的。您可以使用以下命令安装：

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

传递附加的运行时参数以公开GPU：

--runtime=nvidia \
--gpus all \
-e NVIDIA_VISIBLE_DEVICES=all

确保TensorFlow Docker镜像支持GPU
再次检查主机机器上的GPU驱动程序是否最新。

这些步骤的组合应该允许容器正常访问GPU...

英文:

It still seems like the Docker container is not able to access the GPUs on the host machine even though you are passing --gpus all as an argument. A few things to try:

Make sure the Nvidia container toolkit is installed on the host - this is required for Docker to access Nvidia GPUs. You can install it with:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   &amp;&amp; curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   &amp;&amp; curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

Pass additional runtime arguments to expose the GPUs:

--runtime=nvidia \
--gpus all \
-e NVIDIA_VISIBLE_DEVICES=all

Make sure the TensorFlow Docker image has GPU support
Double check GPU drivers are up-to-date on the host machine.

Some combination of these steps should allow the container to access the GPUs properly...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

无法在Docker上运行Tensorflow GPU（尽管似乎在其外部已安装）

问题

答案1

答案2

(vowpal wabbit) 处理新上下文的情境赌博

搜索字符串，使用列表元素进行匹配，返回找到的匹配项。

设置每个组的最后一行的列为前一行的列。

多层C类型结构的字典的YAML表示得到了一个奇怪的对象。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。