2023年2月8日 22:34:16go评论105阅读模式

英文:

Azure ML experiment using custom GPU CUDA environment

问题

在上周，我一直在尝试在Azure ML工作室中创建一个Python实验。这个任务涉及使用CUDA 11.6进行GPU加速的自定义环境来训练PyTorch（1.12.1）神经网络。然而，当尝试执行任何操作时，我遇到了运行时错误：

device = torch.device("cuda")
test_tensor = torch.rand((3, 4), device="cpu")
test_tensor.to(device)

CUDA错误：所有支持CUDA的设备都正在忙或不可用
CUDA内核错误可能会在某些其他API调用时异步报告，因此下面的堆栈跟踪可能不正确。
为了进行调试，请考虑传递CUDA_LAUNCH_BLOCKING=1。

我尝试设置CUDA_LAUNCH_BLOCKING=1，但这并没有改变结果。

我还尝试检查CUDA是否可用：

print(f"是否可用CUDA？ {torch.cuda.is_available()}")
print(f"当前设备是哪个？ {torch.cuda.current_device()}")
print(f"我们有多少设备？ {torch.cuda.device_count()}")
print(f"当前设备的名称是什么？ {torch.cuda.get_device_name(torch.cuda.current_device())}")

结果完全正常：

是否可用CUDA？ True
当前设备是哪个？ 0
我们有多少设备？ 1
当前设备的名称是什么？ Tesla K80

我还尝试降级和更改CUDA、Torch和Python版本，但似乎不影响错误。

据我所知，只有在使用自定义环境时才会出现这个错误。当使用精心策划的环境时，脚本可以正常运行。然而，由于脚本需要一些像OpenCV这样的库，我被迫使用自定义DockerFile来创建我的环境，你可以在这里阅读它以供参考：

FROM mcr.microsoft.com/azureml/aifx/stable-ubuntu2004-cu116-py39-torch1121:biweekly.202301.1
USER root
RUN apt update
# OpenCV所需的依赖项
RUN apt install ffmpeg libsm6 libxext6 libgl1-mesa-glx -y 
RUN pip install numpy matplotlib pandas opencv-python Pillow scipy tqdm mlflow joblib onnx ultralytics
RUN pip install 'ipykernel~=6.0' \
                'azureml-core' \
                'azureml-dataset-runtime' \
                'azureml-defaults' \
                'azure-ml' \
                'azure-ml-component' \
                'azureml-mlflow' \
                'azureml-telemetry' \
                'azureml-contrib-services'
COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20220607.v1 /artifacts /var/
RUN /var/requirements/install_system_requirements.sh && \
    cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \
    cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \
    ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \
    rm -f /etc/nginx/sites-enabled/default
ENV SVDIR=/var/runit
ENV WORKER_TIMEOUT=400
EXPOSE 5001 8883 8888

COPY语句中的代码是从Azure预定义的一个精心策划的环境中复制的。我想强调一下，我尝试使用Azure中一个这些环境中提供的未经修改的DockerFile，结果仍然相同。

因此，我的问题是：如何在自定义环境中运行CUDA作业？是否可能？

我尝试寻找解决方案，但未能找到有相同问题的任何人，也没有在微软文档中找到可以提问此问题的任何地方。希望这不是重复的问题，希望你们中的任何人可以帮助我。

英文:

During the last week I have been trying to create a python experiment in Azure ML studio. The job consists on training a PyTorch (1.12.1) Neural Network using a custom environment with CUDA 11.6 for GPU acceleration. However, when attempting any movement operation I get a Runtime Error:

device = torch.device(&quot;cuda&quot;)
test_tensor = torch.rand((3, 4), device = &quot;cpu&quot;)
test_tensor.to(device)

CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I have tried to set CUDA_LAUNCH_BLOCKING=1, but this does not change the result.

I have also tried to check if CUDA is available:

print(f&quot;Is cuda available? {torch.cuda.is_available()}&quot;)
print(f&quot;Which is the current device? {torch.cuda.current_device()}&quot;)
print(f&quot;How many devices do we have? {torch.cuda.device_count()}&quot;)
print(f&quot;How is the current device named? {torch.cuda.get_device_name(torch.cuda.current_device())}&quot;)

and the result is completely normal:

Is cuda available? True
Which is the current device? 0
How many devices do we have? 1
How is the current device named? Tesla K80

I also tried to downgrade and change the CUDA, Torch and Python versions, but this does not seem to affect the error.

As far as I found this error appears only when using a custom environment. When a curated environment is used, the scripts runs with no problem. However, as the script needs of some libraries like OpenCV, I am forced to use a custom DockerFile to create my environment, which you can read here for reference:

FROM mcr.microsoft.com/azureml/aifx/stable-ubuntu2004-cu116-py39-torch1121:biweekly.202301.1
USER root
RUN apt update
# Necessary dependencies for OpenCV
RUN apt install ffmpeg libsm6 libxext6 libgl1-mesa-glx -y 
RUN pip install numpy matplotlib pandas opencv-python Pillow scipy tqdm mlflow joblib onnx ultralytics
RUN pip install &#39;ipykernel~=6.0&#39; \
                &#39;azureml-core&#39; \
		&#39;azureml-dataset-runtime&#39; \
                &#39;azureml-defaults&#39; \
		&#39;azure-ml&#39; \
		&#39;azure-ml-component&#39; \
                &#39;azureml-mlflow&#39; \
                &#39;azureml-telemetry&#39; \
		&#39;azureml-contrib-services&#39;
COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20220607.v1 /artifacts /var/
RUN /var/requirements/install_system_requirements.sh &amp;&amp; \
    cp /var/configuration/rsyslog.conf /etc/rsyslog.conf &amp;&amp; \
    cp /var/configuration/nginx.conf /etc/nginx/sites-available/app &amp;&amp; \
    ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app &amp;&amp; \
    rm -f /etc/nginx/sites-enabled/default
ENV SVDIR=/var/runit
ENV WORKER_TIMEOUT=400
EXPOSE 5001 8883 8888

The code from the COPY statement is a copy from one of the curated environments already predefined by Azure. I would like to highlight that I tried using the DockerFile given in one of these environments, without any modification and I get the same result.

Hence, my question is: How can I run a CUDA job using a custom environment? Is it possible?

I have tried to find a solution for this but I have not been able of finding any person with the same problem, nor any place in the Microsoft documentation where I could ask for this. I hope this is not duplicated and that any of you can help me out here.

答案1

得分: 5

问题确实敏感且难以调试。我怀疑与部署Docker容器的底层硬件有关，而不是实际的自定义Docker容器及其相关依赖项。

由于您有一台Tesla K80，我怀疑是NC系列的视频卡（用于部署环境）。

截至我撰写此评论（2023年2月10日），以下观察是有效的（https://learn.microsoft.com/en-us/azure/machine-learning/resource-curated-environments）：

注意

目前，由于底层的CUDA和集群不兼容性，只能使用具有CUDA 11.3的AzureML-ACPT-pytorch-1.11-py38-cuda11.3-gpu在NC系列上。

因此，在我看来，这可以追溯到支持的CUDA + PyTorch和Python版本。

在我的情况下，我只是在创建环境时通过一个.yaml依赖文件安装了我的依赖项，从这个基础镜像开始：

Azure容器注册表

mcr.microsoft.com/azureml/curated/acpt-pytorch-1.11-py38-cuda11.3-gpu:9

您可以从此URI作为基础镜像开始构建您的Docker容器，以便在Tesla K80上正常工作。

重要提示：在我的情况下，使用这个基础镜像确实有效，我能够训练PyTorch模型。

英文:

The problem is indeed sensitive and hard to debug. I suspect it has to do with the underlying hardware on which the docker container is deployed, not with the actual custom Docker container and its corresponding dependencies.

Since you have a Tesla K80, I suspect NC series video cards (upon which the environments are deployed).

As of writing this comment (10th of February 2023), the following observation is valid (https://learn.microsoft.com/en-us/azure/machine-learning/resource-curated-environments):

> Note
>
> Currently, due to underlying cuda and cluster incompatibilities, on NC
> series only AzureML-ACPT-pytorch-1.11-py38-cuda11.3-gpu with cuda 11.3
> can be used.

Therefore, in my opinion, this can be traced back to the supported versions of CUDA + PyTorch and Python.

What I did in my case, I just installed my dependences via a .yaml dependency file when creating the environment, starting from this base image:

Azure container registry

mcr.microsoft.com/azureml/curated/acpt-pytorch-1.11-py38-cuda11.3-gpu:9

You can start building your docker container from this URI as base image in order to work properly on Tesla K80s.

IMPORTANT NOTE : Using this base image did work in my case, I was able to train PyTorch models.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Azure ML实验使用自定义GPU CUDA环境

问题

答案1

Databricks PySpark: java.lang.ArrayStoreException: java.util.HashMap

如何在同一类中的Python中调用另一个函数？

错误：连接被拒绝

从项目根目录获取文件路径。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。