2020年8月13日 05:37:49go评论160阅读模式

英文:

Persisting S3 using "raw" format is a good idea or not?

问题

我是新手使用S3，并考虑使用Java将一些数据持久化存储在其中。
目前，我们的应用程序由两个服务组成，service1和service2。

Service1将中间结果持久存储到S3中，而service2将从那里接手并继续进行。

对于service1的输出，我将其分为两类。
1）其他团队将使用并操作的数据。
2）仅供service2利用的中间结果。因此，没有其他团队参与或操作它。

对于第一类数据，我将使用parquet文件存储它并上传到S3。
对于第二类数据，我考虑直接将“原始”数据保存到S3上的文件中（例如，复杂对象的映射），因为数据结构非常复杂。

我现在有三个关于将“原始”数据保存到S3中的问题。
1）这是一个好主意吗？如果不是，为什么？有什么更好的选择吗？我正在使用Java。
2）如果将“原始”数据保存没有问题，当我读取它时，是否能够获得原始数据？
3）如果我简单地使用aws S3 API上传对象，当数据太大时，是否会存在潜在问题？

请帮忙。谢谢。

英文:

I am new to S3 and thinking of persist some data into it using java.
Right now, our application consists of two services, service1 and service2.

Service1 will persist intermediate results into S3 and service2 will pick up from there and continue on.

For service1's output, I would categorize it to two categories.

data that other team will use and manipulate it.
intermediate result that is only for service2 to leverage. So no other team involved or manipulate it.

For data in category 1, I will store it using parquet file and upload to s3.
For data in category 2, I am thinking of directly save the "raw" data into a file on s3 (e.g. a map of complicated objects), cause the data structure is really complicated.

I have three questions now for saving "raw" data into S3.

is it a good idea? If no, why? And what is the good option? I am using java.
if there is nothing wrong with saving "raw" data, will I be able to get it as original when I read it back?
if I simply use aws S3 api to upload the object, when the data is too big, will there be a potential issue?

Please help. Thanks.

答案1

得分: 2

S3不关心您的数据是“原始”的还是以某种结构化格式存储 - 我为不同的原因存储这两种类型，从未遇到问题。

如果您担心，可以运行一些验证 - 即以任何您想要的格式上传文件，然后确保您可以按需要使用它们 - 但我认为您不会遇到任何问题。

英文:

S3 doesn't care if your data is 'raw' or in some structured format - I store both types for different reasons and have never had a problem.

If you are concerned, run some verifications - i.e. upload files in whatever format you want, and then make sure you can consume them as you need to - but I don't think you are going to have any issues.

答案2

得分: 2

Amazon S3 是一个出色的服务，用于在进程之间共享数据，原因如下：

存储量没有限制
存在精细的安全控制
S3 可以在数据上传时触发进程
S3 支持“范围检索”，因此您可以检索文件的部分内容。在读取 Parquet 文件格式时非常方便。
许多工具和服务可以直接从 S3 读取数据，例如 Amazon Athena 可以直接从 S3 查询数据，即使是在压缩和 Parquet 格式下。

无论您存储在 S3 中的内容是什么，您将会精确地获取相同的内容。

当两个系统通信时，常见的架构是：

将数据存储在 S3 中
向第二个应用程序发送一条消息，指向 S3 中数据的位置 -- 您可以通过向 Amazon SQS 队列 发送消息来完成此操作
第二个应用程序从队列中检索消息，然后从 S3 中检索数据并处理数据。

英文:

Amazon S3 is an excellent service for sharing data between processes because:

There is no limit to the amount of storage
There are fine-grained security controls
S3 can 'trigger' processes when data is uploaded
S3 supports 'ranged retrieval', so you can retrieve portions of a file. This is very handy when reading Parquet file format.
Many tools and services can directly read from S3, such as Amazon Athena that can query data directly from S3, even in compressed and Parquet format.

Whatever you store in S3 will be exactly what you get back.

A common architecture when having two systems communicate is:

Store the data in S3
Send a message to the second application, pointing to the location of the data in S3 -- you might do this by sending a message to an Amazon SQS queue
The second application retrieves the message from the queue, retrieves the data from S3 and processess the data.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用“原始”格式来持久化S3是个好主意吗？

问题

答案1

答案2

Is there some way to install TLS 1.2 certificate over a WebLogic 8.1 Application without migrate the webserver?

抛出异常，每当使用PowerMock（EasyMock）调用包保护的静态方法时。

Java使用Array.sort()对字符串数组进行排序，类似于Python中的sorted(key=lambda x:)。

如何在Reactor WebFlux中对对象进行增强？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论