2023年3月31日 20:37:52go评论155阅读模式

英文:

Impact of growing manifest file in gcloud storage cp

问题

我正在将大量文件从源存储桶复制到目标存储桶，其中源存储桶已使用AES256加密。

gcloud storage cp是实现相同目的的最快选项，我们可以传递加密密钥。

但是，我想跳过已经复制的文件，有一种方法可以通过传递清单文件来跳过已复制的文件。

我关心的是当这个清单文件变得更大时会发生什么。

例如，对于大小为3.5GiB的数据传输，创建了约278MB大小的837136个文件的清单文件。

目前，数据传输服务不支持源存储桶使用AES256加密的数据传输。

问题

所以对于传输几TB大小的数据，这个文件将变得比现在还要大，问题是gcloud storage cp是如何处理和读取这个文件的？清单文件的大小会成为瓶颈并导致内存限制问题吗？有关gcloud storage如何处理此问题的文档吗？

英文:

I am copying large number of files from source bucket to destination bucket where source bucket is encrypted with AES256.

gcloud storage cp is fastest option to achieve same and we can pass encryption keys.

However I want to skip files which are already copied, There is a way to pass manifest file to skip files already copied.

My concern is what happens when this manifest file grows bigger.

For e.g. for transferring data of 3.5GiB size with 837136 files created manifest file of size ~278MB.

Currently data transfer service doesn't support data transfer where source bucket is encrypted with AES256.

Question

So for transferring data size of Terabytes, this file will become even bigger then the question is how does gcloud storage cp handles and reads this file ? Will the size of manifest file will become bottleneck and cause throttling issues on memory ? Is there any documentation how gcloud storage handles this ?

答案1

得分: 1

基于这篇谷歌博客关于使用gcloud命令行进行更快云存储传输的内容：

在传输单个大文件时，差异更为明显。对于一个10GB的文件，gcloud storage的下载速度比gsutil快94%，上传速度快57%。这种性能提升无需进行大量测试和调整，使得传输时间大幅缩短，非常方便。

此外，gcloud storage cp利用并行复合上传，将文件分成32个块并并行上传到临时对象，然后使用这些临时对象重新创建最终对象，最后删除临时对象。

关于瓶颈问题，建议避免顺序命名造成的瓶颈问题，因为这可能导致上传速度问题，因为大多数连接都会指向相同的分片，因为文件名非常相似。解决这个问题的简单方法是重新命名文件夹或文件结构，使它们不再是线性的。

以下是一些你可能会发现在项目中有用并且可以进行测试的文档：

还建议执行可恢复上传，因为在网络或连接中断的情况下，这非常重要，你不希望重新开始上传数据块。

英文:

Based on this Google blog on Faster Cloud Storage transfers using the gcloud command-line:

> When transferring a single large file, the difference is even more pronounced. With a 10GB file, gcloud storage was 94% faster than gsutil on download and 57% faster on upload. This performance improvement comes without the need for extensive testing and tweaking, making it easy to see much faster transfer times.

Also, gcloud storage cp takes advantage of parallel composite uploads wherein a file is divided into 32 chunks and then uploaded in parallel to temporary objects, the final object is recreated using the temporary objects, and the temporary objects are deleted.

With regards to bottleneck, it is suggested to avoid the sequential naming bottleneck as this can cause an upload speed issue, since the majority of your connections will all be directed to the same shard, since the filenames are so similar. A simple solution to this issue is to simply re-name your folder or file structure such that they are no longer linear.

Here are some documentations that you may find useful and that you can test on your projects:

It is also recommended to perform resumable uploads as this is very important in case there's a network or connection interruption and you don't want to start uploading chunks of data all over again.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在gcloud存储cp中增长清单文件的影响。

问题

问题

Question

答案1

如何在应用程序的KIOSK模式中打开特定的活动页面？

使用Google Cloud存储桶上传和加载文件的最佳方式是什么？

App Engine 部署失败：处理程序异常 – 非零退出 [2]

Firestore合并两个集合成一个

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论