2023年3月1日 11:15:45go评论107阅读模式

英文:

Unable to use GPU in custom Docker container built on top of nvidia/cuda image despite --gpus all flag

问题

我正在尝试运行一个需要访问我的主机NVIDIA GPU的Docker容器，使用--gpus all标志来启用GPU访问。当我使用nvidia-smi命令运行容器时，我可以看到一个活动的GPU，表明容器可以访问GPU。然而，当我尝试在容器内简单地运行TensorFlow、PyTorch或ONNX Runtime时，这些库似乎无法检测或使用GPU。

具体来说，当我使用以下命令运行容器时，我只看到CPUExecutionProvider，而没有看到ONNX Runtime中的CUDAExecutionProvider：

sudo docker run --gpus all mycontainer:latest

然而，当我使用nvidia-smi命令运行相同的容器时，我会得到活动GPU的提示：

sudo docker run --gpus all mycontainer:latest nvidia-smi

这是活动GPU的提示：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------|
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   44C    P0    27W /  N/A |     10MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

这是我使用的Dockerfile，我用它构建了mycontainer：

FROM nvidia/cuda:11.5.0-base-ubuntu20.04
WORKDIR /home
COPY requirements.txt /home/requirements.txt
# Add the deadsnakes PPA for Python 3.10
RUN apt-get update &amp;&amp; \
    apt-get install -y software-properties-common libgl1-mesa-glx cmake protobuf-compiler &amp;&amp; \
    add-apt-repository ppa:deadsnakes/ppa &amp;&amp; \
    apt-get update
# Install Python 3.10 and dev packages
RUN apt-get update &amp;&amp; \
    apt-get install -y python3.10 python3.10-dev python3-pip  &amp;&amp; \
    rm -rf /var/lib/apt/lists/*
# Install virtualenv
RUN pip3 install virtualenv
# Create a virtual environment with Python 3.10
RUN virtualenv -p python3.10 venv
# Activate the virtual environment
ENV PATH=&quot;/home/venv/bin:$PATH&quot;
# Install Python dependencies
RUN pip3 install --upgrade pip \
    &amp;&amp; pip3 install --default-timeout=10000000 torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116 \
    &amp;&amp; pip3 install --default-timeout=10000000 -r requirements.txt
# Copy files
COPY /src /home/src
# Set the PYTHONPATH and LD_LIBRARY_PATH environment variable to include the CUDA libraries
ENV PYTHONPATH=/usr/local/cuda-11.5/lib64
ENV LD_LIBRARY_PATH=/usr/local/cuda-11.5/lib64
# Set the CUDA_PATH and CUDA_HOME environment variable to point to the CUDA installation directory
ENV CUDA_PATH=/usr/local/cuda-11.5
ENV CUDA_HOME=/usr/local/cuda-11.5
# Set the default command
CMD [&quot;sh&quot;, &quot;-c&quot;, &quot;. /home/venv/bin/activate &amp;&amp; python main.py $@&quot;]

我已经确认我使用的TensorFlow、PyTorch和ONNX Runtime的版本与我系统上安装的CUDA版本兼容。我还确保正确设置了LD_LIBRARY_PATH环境变量以包括CUDA库的路径。最后，我确保在启动容器时包括了--gpus all标志，并正确配置了NVIDIA Docker运行时和设备插件。尽管采取了这些步骤，但我仍然无法在使用TensorFlow、PyTorch或ONNX Runtime时访问容器内的GPU。可能是什么原因导致了这个问题，我该如何解决它？如果需要更多信息，请告诉我。

英文:

I am trying to run a Docker container that requires access to my host NVIDIA GPU, using the --gpus all flag to enable GPU access. When I run the container with the nvidia-smi command, I can see an active GPU, indicating that the container has access to the GPU. However, when I simply try to run TensorFlow, PyTorch, or ONNX Runtime inside the container, these libraries do not seem to be able to detect or use the GPU.

Specifically, when I run the container with the following command, I see only the CPUExecutionProvider, but not the CUDAExecutionProvider in ONNX Runtime:

sudo docker run --gpus all mycontainer:latest

However, when I run the same container with the nvidia-smi command, I get the active GPU prompt:

sudo docker run --gpus all mycontainer:latest nvidia-smi

This is the active GPU prompt:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   44C    P0    27W /  N/A |     10MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

And this is the Dockerfile, I built mycontainer with:

FROM nvidia/cuda:11.5.0-base-ubuntu20.04
WORKDIR /home
COPY requirements.txt /home/requirements.txt
# Add the deadsnakes PPA for Python 3.10
RUN apt-get update &amp;&amp; \
    apt-get install -y software-properties-common libgl1-mesa-glx cmake protobuf-compiler &amp;&amp; \
    add-apt-repository ppa:deadsnakes/ppa &amp;&amp; \
    apt-get update
# Install Python 3.10 and dev packages
RUN apt-get update &amp;&amp; \
    apt-get install -y python3.10 python3.10-dev python3-pip  &amp;&amp; \
    rm -rf /var/lib/apt/lists/*
# Install virtualenv
RUN pip3 install virtualenv
# Create a virtual environment with Python 3.10
RUN virtualenv -p python3.10 venv
# Activate the virtual environment
ENV PATH=&quot;/home/venv/bin:$PATH&quot;
# Install Python dependencies
RUN pip3 install --upgrade pip \
    &amp;&amp; pip3 install --default-timeout=10000000 torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116 \
    &amp;&amp; pip3 install --default-timeout=10000000 -r requirements.txt
# Copy files
COPY /src /home/src
# Set the PYTHONPATH and LD_LIBRARY_PATH environment variable to include the CUDA libraries
ENV PYTHONPATH=/usr/local/cuda-11.5/lib64
ENV LD_LIBRARY_PATH=/usr/local/cuda-11.5/lib64
# Set the CUDA_PATH and CUDA_HOME environment variable to point to the CUDA installation directory
ENV CUDA_PATH=/usr/local/cuda-11.5
ENV CUDA_HOME=/usr/local/cuda-11.5
# Set the default command
CMD [&quot;sh&quot;, &quot;-c&quot;, &quot;. /home/venv/bin/activate &amp;&amp; python main.py $@&quot;]

I have checked that the version of TensorFlow, PyTorch, and ONNX Runtime that I am using is compatible with the version of CUDA installed on my system. I have also made sure to set the LD_LIBRARY_PATH environment variable correctly to include the path to the CUDA libraries. Finally, I have made sure to include the --gpus all flag when starting the container, and to properly configure the NVIDIA Docker runtime and device plugin. Despite these steps, I am still unable to access the GPU inside the container when using TensorFlow, PyTorch, or ONNX Runtime. What could be causing this issue, and how can I resolve it? Please let me know, if you need further information.

答案1

得分: 3

你应该安装 onnxruntime-gpu 以获取 CUDAExecutionProvider。

docker run --gpus all -it nvcr.io/nvidia/pytorch:22.12-py3 bash
pip install onnxruntime-gpu
python3 -c "import onnxruntime as rt; print(rt.get_device())"
GPU

英文:

You should install onnxruntime-gpu to get CUDAExecutionProvider.

docker run --gpus all -it nvcr.io/nvidia/pytorch:22.12-py3 bash
pip install onnxruntime-gpu
python3 -c &quot;import onnxruntime as rt; print(rt.get_device())&quot;
GPU

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Unable to use GPU in custom Docker container built on top of nvidia/cuda image despite –gpus all flag

问题

答案1

如何将应用程序映射到Docker卷

AWS ECS任务在运行后直接消失

ERROR: hyperledger/fabric:make gotools: unrecognized import path "golang.org/x/tools/go/gcexportdata"

Error while running Docker run docker: Error response from daemon

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。