问题

我目前使用Vertex AI自定义训练，其中：

使用存储在GCS中的数据集进行自定义训练（Pytorch）
每次Vertex AI启动训练作业时，都会将我的数据克隆并分片到一个暂存存储桶中
我的训练作业使用TorchData从暂存存储桶中流式传输数据来加载数据到我的训练应用程序（示例链接：https://pytorch.org/data/beta/dp_tutorial.html#accessing-google-cloud-storage-gcs-with-fsspec-datapipes）

然而，在这个过程中，我注意到我的GPU有时会出现0利用率的情况（而我的GPU内存一直保持在约80%）。我认为这可能是由于I/O瓶颈引起的，因为它正在从远程GCS存储桶传输数据。

加载数据到我的训练应用程序的最有效方式是什么？将数据下载到我的训练容器中，然后在本地加载数据，而不是从GCS存储桶中传输数据，这是否是最有效的方式？

英文:

I currently use Vertex AI Custom training where:

Custom training (Pytorch) with dataset in GCS
Every time when Vertex AI launches a training job, it clones and shard my data into a staging bucket
My training job loads the data into my training application using TorchData by streaming data from the staging bucket (example https://pytorch.org/data/beta/dp_tutorial.html#accessing-google-cloud-storage-gcs-with-fsspec-datapipes)

However when doing so, I notice that there are bouts of 0 utilisation on my GPU (whereas my GPU memory is constantly at ~ 80%). I presume that's because of I/O bottlenecks because it's piping data from a remote GCS bucket.

What's the most efficient way of loading data into my training application? Would it be to download my data into my training container than load data locally, rather than piping it from a GCS bucket?

答案1

得分: 1

我发现了这篇来自GCP的博客文章，回答了这个问题：

https://cloud.google.com/blog/products/ai-machine-learning/efficient-pytorch-training-with-vertex-ai

简而言之 - 使用 torchdata.datapipes.iter.WebDataset。

英文:

I found this blog post from GCP that answers the question:

https://cloud.google.com/blog/products/ai-machine-learning/efficient-pytorch-training-with-vertex-ai

TL;DR - use torchdata.datapipes.iter.WebDataset.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

什么是训练中加载数据的最有效方式？

问题

答案1

禁用模拟 google.cloud.storage 时的 Google 云身份验证 Python。

“Imagemagick – 从谷歌云存储打开时出现“文件中图像数据不足”错误”

如何使用Java上传文件夹到Google Cloud？

GCS存储桶如何使用过滤器获取对象 [Golang]

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论