Unable to use GPU in custom Docker container built on top of nvidia/cuda image despite –gpus all flag

huangapple go评论107阅读模式
英文:

Unable to use GPU in custom Docker container built on top of nvidia/cuda image despite --gpus all flag

问题

我正在尝试运行一个需要访问我的主机NVIDIA GPU的Docker容器,使用--gpus all标志来启用GPU访问。当我使用nvidia-smi命令运行容器时,我可以看到一个活动的GPU,表明容器可以访问GPU。然而,当我尝试在容器内简单地运行TensorFlow、PyTorch或ONNX Runtime时,这些库似乎无法检测或使用GPU。

具体来说,当我使用以下命令运行容器时,我只看到CPUExecutionProvider,而没有看到ONNX Runtime中的CUDAExecutionProvider

  1. sudo docker run --gpus all mycontainer:latest

然而,当我使用nvidia-smi命令运行相同的容器时,我会得到活动GPU的提示:

  1. sudo docker run --gpus all mycontainer:latest nvidia-smi

这是活动GPU的提示:

  1. +-----------------------------------------------------------------------------+
  2. | NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
  3. |-------------------------------+----------------------+----------------------|
  4. | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
  5. | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
  6. | | | MIG M. |
  7. |===============================+======================+======================|
  8. | 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
  9. | N/A 44C P0 27W / N/A | 10MiB / 7982MiB | 0% Default |
  10. | | | N/A |
  11. +-------------------------------+----------------------+----------------------+
  12. +-----------------------------------------------------------------------------+
  13. | Processes: |
  14. | GPU GI CI PID Type Process name GPU Memory |
  15. | ID ID Usage |
  16. |=============================================================================|
  17. +-----------------------------------------------------------------------------+

这是我使用的Dockerfile,我用它构建了mycontainer

  1. FROM nvidia/cuda:11.5.0-base-ubuntu20.04
  2. WORKDIR /home
  3. COPY requirements.txt /home/requirements.txt
  4. # Add the deadsnakes PPA for Python 3.10
  5. RUN apt-get update && \
  6. apt-get install -y software-properties-common libgl1-mesa-glx cmake protobuf-compiler && \
  7. add-apt-repository ppa:deadsnakes/ppa && \
  8. apt-get update
  9. # Install Python 3.10 and dev packages
  10. RUN apt-get update && \
  11. apt-get install -y python3.10 python3.10-dev python3-pip && \
  12. rm -rf /var/lib/apt/lists/*
  13. # Install virtualenv
  14. RUN pip3 install virtualenv
  15. # Create a virtual environment with Python 3.10
  16. RUN virtualenv -p python3.10 venv
  17. # Activate the virtual environment
  18. ENV PATH="/home/venv/bin:$PATH"
  19. # Install Python dependencies
  20. RUN pip3 install --upgrade pip \
  21. && pip3 install --default-timeout=10000000 torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116 \
  22. && pip3 install --default-timeout=10000000 -r requirements.txt
  23. # Copy files
  24. COPY /src /home/src
  25. # Set the PYTHONPATH and LD_LIBRARY_PATH environment variable to include the CUDA libraries
  26. ENV PYTHONPATH=/usr/local/cuda-11.5/lib64
  27. ENV LD_LIBRARY_PATH=/usr/local/cuda-11.5/lib64
  28. # Set the CUDA_PATH and CUDA_HOME environment variable to point to the CUDA installation directory
  29. ENV CUDA_PATH=/usr/local/cuda-11.5
  30. ENV CUDA_HOME=/usr/local/cuda-11.5
  31. # Set the default command
  32. CMD ["sh", "-c", ". /home/venv/bin/activate && python main.py $@"]

我已经确认我使用的TensorFlow、PyTorch和ONNX Runtime的版本与我系统上安装的CUDA版本兼容。我还确保正确设置了LD_LIBRARY_PATH环境变量以包括CUDA库的路径。最后,我确保在启动容器时包括了--gpus all标志,并正确配置了NVIDIA Docker运行时和设备插件。尽管采取了这些步骤,但我仍然无法在使用TensorFlow、PyTorch或ONNX Runtime时访问容器内的GPU。可能是什么原因导致了这个问题,我该如何解决它?如果需要更多信息,请告诉我。

英文:

I am trying to run a Docker container that requires access to my host NVIDIA GPU, using the --gpus all flag to enable GPU access. When I run the container with the nvidia-smi command, I can see an active GPU, indicating that the container has access to the GPU. However, when I simply try to run TensorFlow, PyTorch, or ONNX Runtime inside the container, these libraries do not seem to be able to detect or use the GPU.

Specifically, when I run the container with the following command, I see only the CPUExecutionProvider, but not the CUDAExecutionProvider in ONNX Runtime:

  1. sudo docker run --gpus all mycontainer:latest

However, when I run the same container with the nvidia-smi command, I get the active GPU prompt:

  1. sudo docker run --gpus all mycontainer:latest nvidia-smi

This is the active GPU prompt:

  1. +-----------------------------------------------------------------------------+
  2. | NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
  3. |-------------------------------+----------------------+----------------------+
  4. | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
  5. | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
  6. | | | MIG M. |
  7. |===============================+======================+======================|
  8. | 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
  9. | N/A 44C P0 27W / N/A | 10MiB / 7982MiB | 0% Default |
  10. | | | N/A |
  11. +-------------------------------+----------------------+----------------------+
  12. +-----------------------------------------------------------------------------+
  13. | Processes: |
  14. | GPU GI CI PID Type Process name GPU Memory |
  15. | ID ID Usage |
  16. |=============================================================================|
  17. +-----------------------------------------------------------------------------+

And this is the Dockerfile, I built mycontainer with:

  1. FROM nvidia/cuda:11.5.0-base-ubuntu20.04
  2. WORKDIR /home
  3. COPY requirements.txt /home/requirements.txt
  4. # Add the deadsnakes PPA for Python 3.10
  5. RUN apt-get update && \
  6. apt-get install -y software-properties-common libgl1-mesa-glx cmake protobuf-compiler && \
  7. add-apt-repository ppa:deadsnakes/ppa && \
  8. apt-get update
  9. # Install Python 3.10 and dev packages
  10. RUN apt-get update && \
  11. apt-get install -y python3.10 python3.10-dev python3-pip && \
  12. rm -rf /var/lib/apt/lists/*
  13. # Install virtualenv
  14. RUN pip3 install virtualenv
  15. # Create a virtual environment with Python 3.10
  16. RUN virtualenv -p python3.10 venv
  17. # Activate the virtual environment
  18. ENV PATH="/home/venv/bin:$PATH"
  19. # Install Python dependencies
  20. RUN pip3 install --upgrade pip \
  21. && pip3 install --default-timeout=10000000 torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116 \
  22. && pip3 install --default-timeout=10000000 -r requirements.txt
  23. # Copy files
  24. COPY /src /home/src
  25. # Set the PYTHONPATH and LD_LIBRARY_PATH environment variable to include the CUDA libraries
  26. ENV PYTHONPATH=/usr/local/cuda-11.5/lib64
  27. ENV LD_LIBRARY_PATH=/usr/local/cuda-11.5/lib64
  28. # Set the CUDA_PATH and CUDA_HOME environment variable to point to the CUDA installation directory
  29. ENV CUDA_PATH=/usr/local/cuda-11.5
  30. ENV CUDA_HOME=/usr/local/cuda-11.5
  31. # Set the default command
  32. CMD ["sh", "-c", ". /home/venv/bin/activate && python main.py $@"]

I have checked that the version of TensorFlow, PyTorch, and ONNX Runtime that I am using is compatible with the version of CUDA installed on my system. I have also made sure to set the LD_LIBRARY_PATH environment variable correctly to include the path to the CUDA libraries. Finally, I have made sure to include the --gpus all flag when starting the container, and to properly configure the NVIDIA Docker runtime and device plugin. Despite these steps, I am still unable to access the GPU inside the container when using TensorFlow, PyTorch, or ONNX Runtime. What could be causing this issue, and how can I resolve it? Please let me know, if you need further information.

答案1

得分: 3

你应该安装 onnxruntime-gpu 以获取 CUDAExecutionProvider

  1. docker run --gpus all -it nvcr.io/nvidia/pytorch:22.12-py3 bash
  2. pip install onnxruntime-gpu
  3. python3 -c "import onnxruntime as rt; print(rt.get_device())"
  4. GPU
英文:

You should install onnxruntime-gpu to get CUDAExecutionProvider.

  1. docker run --gpus all -it nvcr.io/nvidia/pytorch:22.12-py3 bash
  2. pip install onnxruntime-gpu
  3. python3 -c "import onnxruntime as rt; print(rt.get_device())"
  4. GPU

huangapple
  • 本文由 发表于 2023年3月1日 11:15:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/75599261.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定