Azure ML实验使用自定义GPU CUDA环境

huangapple go评论74阅读模式
英文:

Azure ML experiment using custom GPU CUDA environment

问题

在上周,我一直在尝试在Azure ML工作室中创建一个Python实验。这个任务涉及使用CUDA 11.6进行GPU加速的自定义环境来训练PyTorch(1.12.1)神经网络。然而,当尝试执行任何操作时,我遇到了运行时错误:

device = torch.device("cuda")
test_tensor = torch.rand((3, 4), device="cpu")
test_tensor.to(device)
CUDA错误:所有支持CUDA的设备都正在忙或不可用
CUDA内核错误可能会在某些其他API调用时异步报告,因此下面的堆栈跟踪可能不正确。
为了进行调试,请考虑传递CUDA_LAUNCH_BLOCKING=1。

我尝试设置CUDA_LAUNCH_BLOCKING=1,但这并没有改变结果。

我还尝试检查CUDA是否可用:

print(f"是否可用CUDA? {torch.cuda.is_available()}")
print(f"当前设备是哪个? {torch.cuda.current_device()}")
print(f"我们有多少设备? {torch.cuda.device_count()}")
print(f"当前设备的名称是什么? {torch.cuda.get_device_name(torch.cuda.current_device())}")

结果完全正常:

是否可用CUDA? True
当前设备是哪个? 0
我们有多少设备? 1
当前设备的名称是什么? Tesla K80

我还尝试降级和更改CUDA、Torch和Python版本,但似乎不影响错误。

据我所知,只有在使用自定义环境时才会出现这个错误。当使用精心策划的环境时,脚本可以正常运行。然而,由于脚本需要一些像OpenCV这样的库,我被迫使用自定义DockerFile来创建我的环境,你可以在这里阅读它以供参考:

FROM mcr.microsoft.com/azureml/aifx/stable-ubuntu2004-cu116-py39-torch1121:biweekly.202301.1

USER root
RUN apt update
# OpenCV所需的依赖项
RUN apt install ffmpeg libsm6 libxext6 libgl1-mesa-glx -y 

RUN pip install numpy matplotlib pandas opencv-python Pillow scipy tqdm mlflow joblib onnx ultralytics
RUN pip install 'ipykernel~=6.0' \
                'azureml-core' \
                'azureml-dataset-runtime' \
                'azureml-defaults' \
                'azure-ml' \
                'azure-ml-component' \
                'azureml-mlflow' \
                'azureml-telemetry' \
                'azureml-contrib-services'

COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20220607.v1 /artifacts /var/
RUN /var/requirements/install_system_requirements.sh && \
    cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \
    cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \
    ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \
    rm -f /etc/nginx/sites-enabled/default
ENV SVDIR=/var/runit
ENV WORKER_TIMEOUT=400
EXPOSE 5001 8883 8888

COPY语句中的代码是从Azure预定义的一个精心策划的环境中复制的。我想强调一下,我尝试使用Azure中一个这些环境中提供的未经修改的DockerFile,结果仍然相同。

因此,我的问题是:如何在自定义环境中运行CUDA作业?是否可能?

我尝试寻找解决方案,但未能找到有相同问题的任何人,也没有在微软文档中找到可以提问此问题的任何地方。希望这不是重复的问题,希望你们中的任何人可以帮助我。

英文:

During the last week I have been trying to create a python experiment in Azure ML studio. The job consists on training a PyTorch (1.12.1) Neural Network using a custom environment with CUDA 11.6 for GPU acceleration. However, when attempting any movement operation I get a Runtime Error:

device = torch.device("cuda")
test_tensor = torch.rand((3, 4), device = "cpu")
test_tensor.to(device)
CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I have tried to set CUDA_LAUNCH_BLOCKING=1, but this does not change the result.

I have also tried to check if CUDA is available:

print(f"Is cuda available? {torch.cuda.is_available()}")
print(f"Which is the current device? {torch.cuda.current_device()}")
print(f"How many devices do we have? {torch.cuda.device_count()}")
print(f"How is the current device named? {torch.cuda.get_device_name(torch.cuda.current_device())}")

and the result is completely normal:

Is cuda available? True
Which is the current device? 0
How many devices do we have? 1
How is the current device named? Tesla K80

I also tried to downgrade and change the CUDA, Torch and Python versions, but this does not seem to affect the error.

As far as I found this error appears only when using a custom environment. When a curated environment is used, the scripts runs with no problem. However, as the script needs of some libraries like OpenCV, I am forced to use a custom DockerFile to create my environment, which you can read here for reference:

FROM mcr.microsoft.com/azureml/aifx/stable-ubuntu2004-cu116-py39-torch1121:biweekly.202301.1


USER root
RUN apt update
# Necessary dependencies for OpenCV
RUN apt install ffmpeg libsm6 libxext6 libgl1-mesa-glx -y 

RUN pip install numpy matplotlib pandas opencv-python Pillow scipy tqdm mlflow joblib onnx ultralytics
RUN pip install 'ipykernel~=6.0' \
                'azureml-core' \
		'azureml-dataset-runtime' \
                'azureml-defaults' \
		'azure-ml' \
		'azure-ml-component' \
                'azureml-mlflow' \
                'azureml-telemetry' \
		'azureml-contrib-services'

COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20220607.v1 /artifacts /var/
RUN /var/requirements/install_system_requirements.sh && \
    cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \
    cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \
    ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \
    rm -f /etc/nginx/sites-enabled/default
ENV SVDIR=/var/runit
ENV WORKER_TIMEOUT=400
EXPOSE 5001 8883 8888

The code from the COPY statement is a copy from one of the curated environments already predefined by Azure. I would like to highlight that I tried using the DockerFile given in one of these environments, without any modification and I get the same result.

Hence, my question is: How can I run a CUDA job using a custom environment? Is it possible?

I have tried to find a solution for this but I have not been able of finding any person with the same problem, nor any place in the Microsoft documentation where I could ask for this. I hope this is not duplicated and that any of you can help me out here.

答案1

得分: 5

问题确实敏感且难以调试。我怀疑与部署Docker容器的底层硬件有关,而不是实际的自定义Docker容器及其相关依赖项。

由于您有一台Tesla K80,我怀疑是NC系列的视频卡(用于部署环境)。

截至我撰写此评论(2023年2月10日),以下观察是有效的(https://learn.microsoft.com/en-us/azure/machine-learning/resource-curated-environments):

注意

目前,由于底层的CUDA和集群不兼容性,只能使用具有CUDA 11.3的AzureML-ACPT-pytorch-1.11-py38-cuda11.3-gpu在NC系列上。

因此,在我看来,这可以追溯到支持的CUDA + PyTorch和Python版本。

在我的情况下,我只是在创建环境时通过一个.yaml依赖文件安装了我的依赖项,从这个基础镜像开始:

Azure容器注册表

mcr.microsoft.com/azureml/curated/acpt-pytorch-1.11-py38-cuda11.3-gpu:9

您可以从此URI作为基础镜像开始构建您的Docker容器,以便在Tesla K80上正常工作。

重要提示:在我的情况下,使用这个基础镜像确实有效,我能够训练PyTorch模型。

英文:

The problem is indeed sensitive and hard to debug. I suspect it has to do with the underlying hardware on which the docker container is deployed, not with the actual custom Docker container and its corresponding dependencies.

Since you have a Tesla K80, I suspect NC series video cards (upon which the environments are deployed).

As of writing this comment (10th of February 2023), the following observation is valid (https://learn.microsoft.com/en-us/azure/machine-learning/resource-curated-environments):

> Note
>
> Currently, due to underlying cuda and cluster incompatibilities, on NC
> series only AzureML-ACPT-pytorch-1.11-py38-cuda11.3-gpu with cuda 11.3
> can be used.

Therefore, in my opinion, this can be traced back to the supported versions of CUDA + PyTorch and Python.

What I did in my case, I just installed my dependences via a .yaml dependency file when creating the environment, starting from this base image:

Azure container registry

mcr.microsoft.com/azureml/curated/acpt-pytorch-1.11-py38-cuda11.3-gpu:9

You can start building your docker container from this URI as base image in order to work properly on Tesla K80s.

IMPORTANT NOTE : Using this base image did work in my case, I was able to train PyTorch models.

huangapple
  • 本文由 发表于 2023年2月8日 22:34:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/75387306.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定