Azure ML实验使用自定义GPU CUDA环境

huangapple go评论105阅读模式
英文:

Azure ML experiment using custom GPU CUDA environment

问题

在上周,我一直在尝试在Azure ML工作室中创建一个Python实验。这个任务涉及使用CUDA 11.6进行GPU加速的自定义环境来训练PyTorch(1.12.1)神经网络。然而,当尝试执行任何操作时,我遇到了运行时错误:

  1. device = torch.device("cuda")
  2. test_tensor = torch.rand((3, 4), device="cpu")
  3. test_tensor.to(device)
  1. CUDA错误:所有支持CUDA的设备都正在忙或不可用
  2. CUDA内核错误可能会在某些其他API调用时异步报告,因此下面的堆栈跟踪可能不正确。
  3. 为了进行调试,请考虑传递CUDA_LAUNCH_BLOCKING=1

我尝试设置CUDA_LAUNCH_BLOCKING=1,但这并没有改变结果。

我还尝试检查CUDA是否可用:

  1. print(f"是否可用CUDA? {torch.cuda.is_available()}")
  2. print(f"当前设备是哪个? {torch.cuda.current_device()}")
  3. print(f"我们有多少设备? {torch.cuda.device_count()}")
  4. print(f"当前设备的名称是什么? {torch.cuda.get_device_name(torch.cuda.current_device())}")

结果完全正常:

  1. 是否可用CUDA True
  2. 当前设备是哪个? 0
  3. 我们有多少设备? 1
  4. 当前设备的名称是什么? Tesla K80

我还尝试降级和更改CUDA、Torch和Python版本,但似乎不影响错误。

据我所知,只有在使用自定义环境时才会出现这个错误。当使用精心策划的环境时,脚本可以正常运行。然而,由于脚本需要一些像OpenCV这样的库,我被迫使用自定义DockerFile来创建我的环境,你可以在这里阅读它以供参考:

  1. FROM mcr.microsoft.com/azureml/aifx/stable-ubuntu2004-cu116-py39-torch1121:biweekly.202301.1
  2. USER root
  3. RUN apt update
  4. # OpenCV所需的依赖项
  5. RUN apt install ffmpeg libsm6 libxext6 libgl1-mesa-glx -y
  6. RUN pip install numpy matplotlib pandas opencv-python Pillow scipy tqdm mlflow joblib onnx ultralytics
  7. RUN pip install 'ipykernel~=6.0' \
  8. 'azureml-core' \
  9. 'azureml-dataset-runtime' \
  10. 'azureml-defaults' \
  11. 'azure-ml' \
  12. 'azure-ml-component' \
  13. 'azureml-mlflow' \
  14. 'azureml-telemetry' \
  15. 'azureml-contrib-services'
  16. COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20220607.v1 /artifacts /var/
  17. RUN /var/requirements/install_system_requirements.sh && \
  18. cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \
  19. cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \
  20. ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \
  21. rm -f /etc/nginx/sites-enabled/default
  22. ENV SVDIR=/var/runit
  23. ENV WORKER_TIMEOUT=400
  24. EXPOSE 5001 8883 8888

COPY语句中的代码是从Azure预定义的一个精心策划的环境中复制的。我想强调一下,我尝试使用Azure中一个这些环境中提供的未经修改的DockerFile,结果仍然相同。

因此,我的问题是:如何在自定义环境中运行CUDA作业?是否可能?

我尝试寻找解决方案,但未能找到有相同问题的任何人,也没有在微软文档中找到可以提问此问题的任何地方。希望这不是重复的问题,希望你们中的任何人可以帮助我。

英文:

During the last week I have been trying to create a python experiment in Azure ML studio. The job consists on training a PyTorch (1.12.1) Neural Network using a custom environment with CUDA 11.6 for GPU acceleration. However, when attempting any movement operation I get a Runtime Error:

  1. device = torch.device("cuda")
  2. test_tensor = torch.rand((3, 4), device = "cpu")
  3. test_tensor.to(device)
  1. CUDA error: all CUDA-capable devices are busy or unavailable
  2. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
  3. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I have tried to set CUDA_LAUNCH_BLOCKING=1, but this does not change the result.

I have also tried to check if CUDA is available:

  1. print(f"Is cuda available? {torch.cuda.is_available()}")
  2. print(f"Which is the current device? {torch.cuda.current_device()}")
  3. print(f"How many devices do we have? {torch.cuda.device_count()}")
  4. print(f"How is the current device named? {torch.cuda.get_device_name(torch.cuda.current_device())}")

and the result is completely normal:

  1. Is cuda available? True
  2. Which is the current device? 0
  3. How many devices do we have? 1
  4. How is the current device named? Tesla K80

I also tried to downgrade and change the CUDA, Torch and Python versions, but this does not seem to affect the error.

As far as I found this error appears only when using a custom environment. When a curated environment is used, the scripts runs with no problem. However, as the script needs of some libraries like OpenCV, I am forced to use a custom DockerFile to create my environment, which you can read here for reference:

  1. FROM mcr.microsoft.com/azureml/aifx/stable-ubuntu2004-cu116-py39-torch1121:biweekly.202301.1
  2. USER root
  3. RUN apt update
  4. # Necessary dependencies for OpenCV
  5. RUN apt install ffmpeg libsm6 libxext6 libgl1-mesa-glx -y
  6. RUN pip install numpy matplotlib pandas opencv-python Pillow scipy tqdm mlflow joblib onnx ultralytics
  7. RUN pip install 'ipykernel~=6.0' \
  8. 'azureml-core' \
  9. 'azureml-dataset-runtime' \
  10. 'azureml-defaults' \
  11. 'azure-ml' \
  12. 'azure-ml-component' \
  13. 'azureml-mlflow' \
  14. 'azureml-telemetry' \
  15. 'azureml-contrib-services'
  16. COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20220607.v1 /artifacts /var/
  17. RUN /var/requirements/install_system_requirements.sh && \
  18. cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \
  19. cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \
  20. ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \
  21. rm -f /etc/nginx/sites-enabled/default
  22. ENV SVDIR=/var/runit
  23. ENV WORKER_TIMEOUT=400
  24. EXPOSE 5001 8883 8888

The code from the COPY statement is a copy from one of the curated environments already predefined by Azure. I would like to highlight that I tried using the DockerFile given in one of these environments, without any modification and I get the same result.

Hence, my question is: How can I run a CUDA job using a custom environment? Is it possible?

I have tried to find a solution for this but I have not been able of finding any person with the same problem, nor any place in the Microsoft documentation where I could ask for this. I hope this is not duplicated and that any of you can help me out here.

答案1

得分: 5

问题确实敏感且难以调试。我怀疑与部署Docker容器的底层硬件有关,而不是实际的自定义Docker容器及其相关依赖项。

由于您有一台Tesla K80,我怀疑是NC系列的视频卡(用于部署环境)。

截至我撰写此评论(2023年2月10日),以下观察是有效的(https://learn.microsoft.com/en-us/azure/machine-learning/resource-curated-environments):

注意

目前,由于底层的CUDA和集群不兼容性,只能使用具有CUDA 11.3的AzureML-ACPT-pytorch-1.11-py38-cuda11.3-gpu在NC系列上。

因此,在我看来,这可以追溯到支持的CUDA + PyTorch和Python版本。

在我的情况下,我只是在创建环境时通过一个.yaml依赖文件安装了我的依赖项,从这个基础镜像开始:

Azure容器注册表

mcr.microsoft.com/azureml/curated/acpt-pytorch-1.11-py38-cuda11.3-gpu:9

您可以从此URI作为基础镜像开始构建您的Docker容器,以便在Tesla K80上正常工作。

重要提示:在我的情况下,使用这个基础镜像确实有效,我能够训练PyTorch模型。

英文:

The problem is indeed sensitive and hard to debug. I suspect it has to do with the underlying hardware on which the docker container is deployed, not with the actual custom Docker container and its corresponding dependencies.

Since you have a Tesla K80, I suspect NC series video cards (upon which the environments are deployed).

As of writing this comment (10th of February 2023), the following observation is valid (https://learn.microsoft.com/en-us/azure/machine-learning/resource-curated-environments):

> Note
>
> Currently, due to underlying cuda and cluster incompatibilities, on NC
> series only AzureML-ACPT-pytorch-1.11-py38-cuda11.3-gpu with cuda 11.3
> can be used.

Therefore, in my opinion, this can be traced back to the supported versions of CUDA + PyTorch and Python.

What I did in my case, I just installed my dependences via a .yaml dependency file when creating the environment, starting from this base image:

Azure container registry

  1. mcr.microsoft.com/azureml/curated/acpt-pytorch-1.11-py38-cuda11.3-gpu:9

You can start building your docker container from this URI as base image in order to work properly on Tesla K80s.

IMPORTANT NOTE : Using this base image did work in my case, I was able to train PyTorch models.

huangapple
  • 本文由 发表于 2023年2月8日 22:34:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/75387306.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定