2023年2月10日 05:40:01go评论99阅读模式

英文:

Zipping large files(>15GB) and uploading to S3 without OOM

问题

我在压缩大文件/文件夹（结果zip >15GB）并上传到S3存储时遇到了一个内存问题。我可以在磁盘上创建zip文件并附加文件/文件夹，然后将该文件的各个部分上传到S3。但根据我的经验，这不是解决这个问题的好方法。您是否了解在不出现内存问题（如OOM）的情况下压缩大文件/文件夹并将其上传到S3的良好模式？如果我可以将这些文件/文件夹直接附加到S3中某个已上传的zip文件中，那将非常好。

将文件/文件夹压缩到磁盘并将该zip文件的各个部分上传到S3。

英文:

I have one memory issue while zipping large files/folders(result zip >15GB) and uploading it to S3 storage. I can create zip file in disc and append files/folders, upload that file with parts to S3. But by my experience it is not good way to resolve this issue. Do you know any good patterns zipping large files/folders and uploading it to S3 without memory issues(such OOM)? It will be good if i can append these files/folders to S3 directly to some uploaded zip.

Zip files/folders to disc and uploading that zip file by parts to S3.

答案1

得分: 1

你可以在上传到S3存储桶之前，使用AWS Lambda来压缩您的文件。您甚至可以配置Lambda在上传时触发并压缩文件。这里有一个用于压缩大文件的Lambda函数的Java示例。这个库限制在10 GB，但可以通过使用EFS来克服。

Lambda的临时存储限制为10 GB，但您可以附加EFS存储来处理更大的文件。如果在使用后删除文件，成本应该接近于零。

此外，当上传大于100 MB的文件到S3时，请记得使用多部分上传。如果您使用SDK，它应该会为您处理这个。

英文:

You can use AWS Lambda to zip your files for you before uploading them to an S3 bucket. You can even configure Lambda to be triggered and zip your files on upload. Here is a Java example of a Lambda function for zipping large files. This library is limited to 10 GB, but this can be overcome by using EFS.

Lambda’s ephemeral storage is limited to 10 GB, but you can attach EFS storage to handle larger files. The cost should be close to none if you delete the files after use.

Also, remember to use Multipart Upload when uploading file larger than 100 MB to S3. If you are using the SDK, it should handle this for you.

答案2

得分: 1

由于zlib的deflate算法的工作方式，导致你出现OOM的主要原因是这样的。

想象一下这个设置：

它开始通过打开可读流来读取整个文件。
它从头创建一个临时的0字节输出文件。
然后它按块读取数据，称为“字典大小”，然后将其发送到CPU进行进一步处理和压缩，然后传播回RAM。
当它完成了一个特定大小的字典时，它移动到下一个字典，依此类推，直到达到文件结束标志。
然后，它从RAM中获取所有这些压缩后的字节，并将其写入实际文件中。

你可以通过启动一个deflate操作来观察和推断这种行为，以下是一个示例。
(文件已创建，已处理372MB，但直到最后一个处理的字节才写入文件。)

从理论上讲，你可以获取所有这些部分，再次存档为tar.gz，然后上传到AWS，作为一个文件，但你可能会遇到与内存相关的问题，尤其是在上传部分。

这里是文件大小的限制：
https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html

如果你使用CLI，从技术上讲，你可以这样做，如果你需要或必须使用REST API，那对你来说不是一个选择，因为那里的限制每个请求只有5GB。

另外，你还没有指定最大大小，所以如果它甚至大于160GB，那就不是一个选项，即使使用AWS CLI（它在上传每个块后会释放内存）。所以你最好的选择是多部分上传。

https://docs.aws.amazon.com/cli/latest/reference/s3api/create-multipart-upload.html

祝一切顺利！

英文:

👋

The main reason why you are getting an OOM is just because of how the deflate algorithm of zlib works.

Imagine this setup:

It starts to read the whole file by opening a readable stream.
It creates a temporary 0 byte output file from the start.
It then reads the data in chunks, called dictionary size, it then sends it to the CPU for further processing and compression, which are propagated back to the RAM.
When it finished with a certain fixed sized dictionary, it moves to the next one, and so on until it reaches END OF FILE terminator.
After that, it grabs all that deflated bytes (compressed) from RAM and writes that to the actual file.

You can observe & deduce that behavior by initiating a deflate operation, an example below.
(The file is created, 372mb is processed, but none is written to the file until the last processed byte.)

You could technically grab all of the parts, archive them AGAIN in a tar.gz and then upload to AWS, as one file, but you may get into the same problem with memory, but now on the uploading part.

Here are the file size limitations:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html

If you use the CLI you can technically do that, if you need or have to use the REST API that's not an option for you as the limitation there is only 5GB per request.

Also, you have not specified the max size, so if it's even larger than 160GB that's not an option EVEN using the AWS CLI (which takes care of releasing the memory after each uploaded chunk). So your best bet would be multipart upload.

https://docs.aws.amazon.com/cli/latest/reference/s3api/create-multipart-upload.html

All the best!

答案3

得分: 0

将文件一次性压缩并发送并不是一个正确的做法。考虑更好的方法是将问题分解，以便不需要一次性加载整个数据，而是逐字节读取并逐字节发送到目的地。这种方式不仅可以提高速度（大约是原来的10倍），还可以解决内存溢出问题。

你的目的地可以是 EC2 实例上的 Web 端点，或者根据你的架构选择，也可以是 API 网关前端的 Web 服务。

因此，解决方案的第一部分实质上是按字节流方式进行压缩并发送到 HTTP 端点。第二部分可能是在目的地使用 AWS SDK 的多部分上传接口，并并行上传到 S3。

以下是示例代码的翻译：

Path in = Paths.get("abc.huge");
Path out = Paths.get("abc.huge.gz");

try (InputStream in = Files.newInputStream(in);
    OutputStream fout = Files.newOutputStream(out);) {
    GZipCompressorOutputStream out2 = new GZipCompressorOutputStream(
        new BufferedOutputStream(fout));

    // 逐字节读取和写入
    final byte[] buffer = new byte[buffersize];
    int n = 0;
    while (-1 != (n = in.read(buffer))) {
        out2.write(buffer, 0, n);
    }
}

请注意，这段代码演示了如何按字节流方式压缩文件并发送到目的地。

英文:

Zipping the file in 1 go is not exactly a correct way to go about. Think the better way is to break down the problem in a way you don't load the whole data in 1 go, but read it byte by byte and sent it to your destination byte by byte. This way, not only you will get speed (~x10) but also address those OOM's

Your destination could be a web end point on an EC2 instance or an API gateway fronted web service depending upon your architectural choice.

So essentially the part 1 of solution is to STREAM - zip it byte by byte and sent it to an http end point. Part 2 might be to use Multi part upload interfaces from AWS SDK (in your destination) and push it in parallel to S3

Path in = Paths.get("abc.huge");
Path out = Paths.get("abc.huge.gz");

try (InputStream in = Files.newInputStream(in);
    OutputStream fout = Files.newOutputStream(out);) {
	GZipCompressorOutputStream out2 = new GZipCompressorOutputStream(
   	 new BufferedOutputStream(fout));

	// Read and write byte by byte
 	final byte[] buffer = new byte[buffersize];
	 int n = 0;
	while (-1 != (n = in.read(buffer))) {
   	 out2.write(buffer, 0, n);
	}
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

压缩大文件（>15GB）并上传到S3，避免内存溢出。

问题

答案1

答案2

答案3

我可以直接从/bin目录运行一个.jar程序吗？

在Java中编写多个条件的最佳方法是使用逻辑或（OR）操作符。

“在单核多线程环境中需要使用 VOLATILE 吗？”

图标在JComboBox中没有显示出来，如何调整大小？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论