英文:
What's the most efficient way of loading data for training?
问题
我目前使用Vertex AI自定义训练,其中:
- 使用存储在GCS中的数据集进行自定义训练(Pytorch)
- 每次Vertex AI启动训练作业时,都会将我的数据克隆并分片到一个暂存存储桶中
- 我的训练作业使用TorchData从暂存存储桶中流式传输数据来加载数据到我的训练应用程序(示例链接:https://pytorch.org/data/beta/dp_tutorial.html#accessing-google-cloud-storage-gcs-with-fsspec-datapipes)
然而,在这个过程中,我注意到我的GPU有时会出现0利用率的情况(而我的GPU内存一直保持在约80%)。我认为这可能是由于I/O瓶颈引起的,因为它正在从远程GCS存储桶传输数据。
加载数据到我的训练应用程序的最有效方式是什么?将数据下载到我的训练容器中,然后在本地加载数据,而不是从GCS存储桶中传输数据,这是否是最有效的方式?
英文:
I currently use Vertex AI Custom training where:
- Custom training (Pytorch) with dataset in GCS
- Every time when Vertex AI launches a training job, it clones and shard my data into a staging bucket
- My training job loads the data into my training application using TorchData by streaming data from the staging bucket (example https://pytorch.org/data/beta/dp_tutorial.html#accessing-google-cloud-storage-gcs-with-fsspec-datapipes)
However when doing so, I notice that there are bouts of 0 utilisation on my GPU (whereas my GPU memory is constantly at ~ 80%). I presume that's because of I/O bottlenecks because it's piping data from a remote GCS bucket.
What's the most efficient way of loading data into my training application? Would it be to download my data into my training container than load data locally, rather than piping it from a GCS bucket?
答案1
得分: 1
我发现了这篇来自GCP的博客文章,回答了这个问题:
https://cloud.google.com/blog/products/ai-machine-learning/efficient-pytorch-training-with-vertex-ai
简而言之 - 使用 torchdata.datapipes.iter.WebDataset
。
英文:
I found this blog post from GCP that answers the question:
https://cloud.google.com/blog/products/ai-machine-learning/efficient-pytorch-training-with-vertex-ai
TL;DR - use torchdata.datapipes.iter.WebDataset
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论