英文:
Azure ML experiment using custom GPU CUDA environment
问题
在上周,我一直在尝试在Azure ML工作室中创建一个Python实验。这个任务涉及使用CUDA 11.6进行GPU加速的自定义环境来训练PyTorch(1.12.1)神经网络。然而,当尝试执行任何操作时,我遇到了运行时错误:
device = torch.device("cuda")
test_tensor = torch.rand((3, 4), device="cpu")
test_tensor.to(device)
CUDA错误:所有支持CUDA的设备都正在忙或不可用
CUDA内核错误可能会在某些其他API调用时异步报告,因此下面的堆栈跟踪可能不正确。
为了进行调试,请考虑传递CUDA_LAUNCH_BLOCKING=1。
我尝试设置CUDA_LAUNCH_BLOCKING=1,但这并没有改变结果。
我还尝试检查CUDA是否可用:
print(f"是否可用CUDA? {torch.cuda.is_available()}")
print(f"当前设备是哪个? {torch.cuda.current_device()}")
print(f"我们有多少设备? {torch.cuda.device_count()}")
print(f"当前设备的名称是什么? {torch.cuda.get_device_name(torch.cuda.current_device())}")
结果完全正常:
是否可用CUDA? True
当前设备是哪个? 0
我们有多少设备? 1
当前设备的名称是什么? Tesla K80
我还尝试降级和更改CUDA、Torch和Python版本,但似乎不影响错误。
据我所知,只有在使用自定义环境时才会出现这个错误。当使用精心策划的环境时,脚本可以正常运行。然而,由于脚本需要一些像OpenCV这样的库,我被迫使用自定义DockerFile来创建我的环境,你可以在这里阅读它以供参考:
FROM mcr.microsoft.com/azureml/aifx/stable-ubuntu2004-cu116-py39-torch1121:biweekly.202301.1
USER root
RUN apt update
# OpenCV所需的依赖项
RUN apt install ffmpeg libsm6 libxext6 libgl1-mesa-glx -y
RUN pip install numpy matplotlib pandas opencv-python Pillow scipy tqdm mlflow joblib onnx ultralytics
RUN pip install 'ipykernel~=6.0' \
'azureml-core' \
'azureml-dataset-runtime' \
'azureml-defaults' \
'azure-ml' \
'azure-ml-component' \
'azureml-mlflow' \
'azureml-telemetry' \
'azureml-contrib-services'
COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20220607.v1 /artifacts /var/
RUN /var/requirements/install_system_requirements.sh && \
cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \
cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \
ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \
rm -f /etc/nginx/sites-enabled/default
ENV SVDIR=/var/runit
ENV WORKER_TIMEOUT=400
EXPOSE 5001 8883 8888
COPY
语句中的代码是从Azure预定义的一个精心策划的环境中复制的。我想强调一下,我尝试使用Azure中一个这些环境中提供的未经修改的DockerFile,结果仍然相同。
因此,我的问题是:如何在自定义环境中运行CUDA作业?是否可能?
我尝试寻找解决方案,但未能找到有相同问题的任何人,也没有在微软文档中找到可以提问此问题的任何地方。希望这不是重复的问题,希望你们中的任何人可以帮助我。
英文:
During the last week I have been trying to create a python experiment in Azure ML studio. The job consists on training a PyTorch (1.12.1) Neural Network using a custom environment with CUDA 11.6 for GPU acceleration. However, when attempting any movement operation I get a Runtime Error:
device = torch.device("cuda")
test_tensor = torch.rand((3, 4), device = "cpu")
test_tensor.to(device)
CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
I have tried to set CUDA_LAUNCH_BLOCKING=1, but this does not change the result.
I have also tried to check if CUDA is available:
print(f"Is cuda available? {torch.cuda.is_available()}")
print(f"Which is the current device? {torch.cuda.current_device()}")
print(f"How many devices do we have? {torch.cuda.device_count()}")
print(f"How is the current device named? {torch.cuda.get_device_name(torch.cuda.current_device())}")
and the result is completely normal:
Is cuda available? True
Which is the current device? 0
How many devices do we have? 1
How is the current device named? Tesla K80
I also tried to downgrade and change the CUDA, Torch and Python versions, but this does not seem to affect the error.
As far as I found this error appears only when using a custom environment. When a curated environment is used, the scripts runs with no problem. However, as the script needs of some libraries like OpenCV, I am forced to use a custom DockerFile to create my environment, which you can read here for reference:
FROM mcr.microsoft.com/azureml/aifx/stable-ubuntu2004-cu116-py39-torch1121:biweekly.202301.1
USER root
RUN apt update
# Necessary dependencies for OpenCV
RUN apt install ffmpeg libsm6 libxext6 libgl1-mesa-glx -y
RUN pip install numpy matplotlib pandas opencv-python Pillow scipy tqdm mlflow joblib onnx ultralytics
RUN pip install 'ipykernel~=6.0' \
'azureml-core' \
'azureml-dataset-runtime' \
'azureml-defaults' \
'azure-ml' \
'azure-ml-component' \
'azureml-mlflow' \
'azureml-telemetry' \
'azureml-contrib-services'
COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20220607.v1 /artifacts /var/
RUN /var/requirements/install_system_requirements.sh && \
cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \
cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \
ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \
rm -f /etc/nginx/sites-enabled/default
ENV SVDIR=/var/runit
ENV WORKER_TIMEOUT=400
EXPOSE 5001 8883 8888
The code from the COPY
statement is a copy from one of the curated environments already predefined by Azure. I would like to highlight that I tried using the DockerFile given in one of these environments, without any modification and I get the same result.
Hence, my question is: How can I run a CUDA job using a custom environment? Is it possible?
I have tried to find a solution for this but I have not been able of finding any person with the same problem, nor any place in the Microsoft documentation where I could ask for this. I hope this is not duplicated and that any of you can help me out here.
答案1
得分: 5
问题确实敏感且难以调试。我怀疑与部署Docker容器的底层硬件有关,而不是实际的自定义Docker容器及其相关依赖项。
由于您有一台Tesla K80,我怀疑是NC系列的视频卡(用于部署环境)。
截至我撰写此评论(2023年2月10日),以下观察是有效的(https://learn.microsoft.com/en-us/azure/machine-learning/resource-curated-environments):
注意
目前,由于底层的CUDA和集群不兼容性,只能使用具有CUDA 11.3的AzureML-ACPT-pytorch-1.11-py38-cuda11.3-gpu在NC系列上。
因此,在我看来,这可以追溯到支持的CUDA + PyTorch和Python版本。
在我的情况下,我只是在创建环境时通过一个.yaml
依赖文件安装了我的依赖项,从这个基础镜像开始:
Azure容器注册表
mcr.microsoft.com/azureml/curated/acpt-pytorch-1.11-py38-cuda11.3-gpu:9
您可以从此URI作为基础镜像开始构建您的Docker容器,以便在Tesla K80上正常工作。
重要提示:在我的情况下,使用这个基础镜像确实有效,我能够训练PyTorch模型。
英文:
The problem is indeed sensitive and hard to debug. I suspect it has to do with the underlying hardware on which the docker container is deployed, not with the actual custom Docker container and its corresponding dependencies.
Since you have a Tesla K80, I suspect NC series video cards (upon which the environments are deployed).
As of writing this comment (10th of February 2023), the following observation is valid (https://learn.microsoft.com/en-us/azure/machine-learning/resource-curated-environments):
> Note
>
> Currently, due to underlying cuda and cluster incompatibilities, on NC
> series only AzureML-ACPT-pytorch-1.11-py38-cuda11.3-gpu with cuda 11.3
> can be used.
Therefore, in my opinion, this can be traced back to the supported versions of CUDA + PyTorch and Python.
What I did in my case, I just installed my dependences via a .yaml
dependency file when creating the environment, starting from this base image:
Azure container registry
mcr.microsoft.com/azureml/curated/acpt-pytorch-1.11-py38-cuda11.3-gpu:9
You can start building your docker container from this URI as base image in order to work properly on Tesla K80s.
IMPORTANT NOTE : Using this base image did work in my case, I was able to train PyTorch models.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论