英文:
Impact of growing manifest file in gcloud storage cp
问题
我正在将大量文件从源存储桶复制到目标存储桶,其中源存储桶已使用AES256加密。
gcloud storage cp
是实现相同目的的最快选项,我们可以传递加密密钥。
但是,我想跳过已经复制的文件,有一种方法可以通过传递清单文件来跳过已复制的文件。
我关心的是当这个清单文件变得更大时会发生什么。
例如,对于大小为3.5GiB的数据传输,创建了约278MB大小的837136个文件的清单文件。
目前,数据传输服务不支持源存储桶使用AES256加密的数据传输。
问题
所以对于传输几TB大小的数据,这个文件将变得比现在还要大,问题是gcloud storage cp
是如何处理和读取这个文件的?清单文件的大小会成为瓶颈并导致内存限制问题吗?有关gcloud storage如何处理此问题的文档吗?
英文:
I am copying large number of files from source bucket to destination bucket where source bucket is encrypted with AES256.
gcloud storage cp
is fastest option to achieve same and we can pass encryption keys.
However I want to skip files which are already copied, There is a way to pass manifest file to skip files already copied.
My concern is what happens when this manifest file grows bigger.
For e.g. for transferring data of 3.5GiB size with 837136 files created manifest file of size ~278MB.
Currently data transfer service doesn't support data transfer where source bucket is encrypted with AES256.
Question
So for transferring data size of Terabytes, this file will become even bigger then the question is how does gcloud storage cp
handles and reads this file ? Will the size of manifest file will become bottleneck and cause throttling issues on memory ? Is there any documentation how gcloud storage handles this ?
答案1
得分: 1
基于这篇谷歌博客关于使用gcloud命令行进行更快云存储传输的内容:
在传输单个大文件时,差异更为明显。对于一个10GB的文件,
gcloud storage
的下载速度比gsutil
快94%,上传速度快57%。这种性能提升无需进行大量测试和调整,使得传输时间大幅缩短,非常方便。
此外,gcloud storage cp
利用并行复合上传,将文件分成32个块并并行上传到临时对象,然后使用这些临时对象重新创建最终对象,最后删除临时对象。
关于瓶颈问题,建议避免顺序命名造成的瓶颈问题,因为这可能导致上传速度问题,因为大多数连接都会指向相同的分片,因为文件名非常相似。解决这个问题的简单方法是重新命名文件夹或文件结构,使它们不再是线性的。
以下是一些你可能会发现在项目中有用并且可以进行测试的文档:
还建议执行可恢复上传,因为在网络或连接中断的情况下,这非常重要,你不希望重新开始上传数据块。
英文:
Based on this Google blog on Faster Cloud Storage transfers using the gcloud command-line:
> When transferring a single large file, the difference is even more pronounced. With a 10GB file, gcloud storage
was 94% faster than gsutil
on download and 57% faster on upload. This performance improvement comes without the need for extensive testing and tweaking, making it easy to see much faster transfer times.
Also, gcloud storage cp
takes advantage of parallel composite uploads wherein a file is divided into 32 chunks and then uploaded in parallel to temporary objects, the final object is recreated using the temporary objects, and the temporary objects are deleted.
With regards to bottleneck, it is suggested to avoid the sequential naming bottleneck as this can cause an upload speed issue, since the majority of your connections will all be directed to the same shard, since the filenames are so similar. A simple solution to this issue is to simply re-name your folder or file structure such that they are no longer linear.
Here are some documentations that you may find useful and that you can test on your projects:
It is also recommended to perform resumable uploads as this is very important in case there's a network or connection interruption and you don't want to start uploading chunks of data all over again.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论