使用“原始”格式来持久化S3是个好主意吗?

huangapple go评论55阅读模式
英文:

Persisting S3 using "raw" format is a good idea or not?

问题

我是新手使用S3,并考虑使用Java将一些数据持久化存储在其中。
目前,我们的应用程序由两个服务组成,service1和service2。

Service1将中间结果持久存储到S3中,而service2将从那里接手并继续进行。

对于service1的输出,我将其分为两类。
1)其他团队将使用并操作的数据。
2)仅供service2利用的中间结果。因此,没有其他团队参与或操作它。

对于第一类数据,我将使用parquet文件存储它并上传到S3。
对于第二类数据,我考虑直接将“原始”数据保存到S3上的文件中(例如,复杂对象的映射),因为数据结构非常复杂。

我现在有三个关于将“原始”数据保存到S3中的问题。
1)这是一个好主意吗?如果不是,为什么?有什么更好的选择吗?我正在使用Java。
2)如果将“原始”数据保存没有问题,当我读取它时,是否能够获得原始数据?
3)如果我简单地使用aws S3 API上传对象,当数据太大时,是否会存在潜在问题?

请帮忙。谢谢。

英文:

I am new to S3 and thinking of persist some data into it using java.
Right now, our application consists of two services, service1 and service2.

Service1 will persist intermediate results into S3 and service2 will pick up from there and continue on.

For service1's output, I would categorize it to two categories.

  1. data that other team will use and manipulate it.
  2. intermediate result that is only for service2 to leverage. So no other team involved or manipulate it.

For data in category 1, I will store it using parquet file and upload to s3.
For data in category 2, I am thinking of directly save the "raw" data into a file on s3 (e.g. a map of complicated objects), cause the data structure is really complicated.

I have three questions now for saving "raw" data into S3.

  1. is it a good idea? If no, why? And what is the good option? I am using java.
  2. if there is nothing wrong with saving "raw" data, will I be able to get it as original when I read it back?
  3. if I simply use aws S3 api to upload the object, when the data is too big, will there be a potential issue?

Please help. Thanks.

答案1

得分: 2

S3不关心您的数据是“原始”的还是以某种结构化格式存储 - 我为不同的原因存储这两种类型,从未遇到问题。

如果您担心,可以运行一些验证 - 即以任何您想要的格式上传文件,然后确保您可以按需要使用它们 - 但我认为您不会遇到任何问题。

英文:

S3 doesn't care if your data is 'raw' or in some structured format - I store both types for different reasons and have never had a problem.

If you are concerned, run some verifications - i.e. upload files in whatever format you want, and then make sure you can consume them as you need to - but I don't think you are going to have any issues.

答案2

得分: 2

Amazon S3 是一个出色的服务,用于在进程之间共享数据,原因如下:

  • 存储量没有限制
  • 存在精细的安全控制
  • S3 可以在数据上传时触发进程
  • S3 支持“范围检索”,因此您可以检索文件的部分内容。在读取 Parquet 文件格式时非常方便。
  • 许多工具和服务可以直接从 S3 读取数据,例如 Amazon Athena 可以直接从 S3 查询数据,即使是在压缩和 Parquet 格式下。

无论您存储在 S3 中的内容是什么,您将会精确地获取相同的内容。

当两个系统通信时,常见的架构是:

  • 将数据存储在 S3 中
  • 向第二个应用程序发送一条消息,指向 S3 中数据的位置 -- 您可以通过向 Amazon SQS 队列 发送消息来完成此操作
  • 第二个应用程序从队列中检索消息,然后从 S3 中检索数据并处理数据。
英文:

Amazon S3 is an excellent service for sharing data between processes because:

  • There is no limit to the amount of storage
  • There are fine-grained security controls
  • S3 can 'trigger' processes when data is uploaded
  • S3 supports 'ranged retrieval', so you can retrieve portions of a file. This is very handy when reading Parquet file format.
  • Many tools and services can directly read from S3, such as Amazon Athena that can query data directly from S3, even in compressed and Parquet format.

Whatever you store in S3 will be exactly what you get back.

A common architecture when having two systems communicate is:

  • Store the data in S3
  • Send a message to the second application, pointing to the location of the data in S3 -- you might do this by sending a message to an Amazon SQS queue
  • The second application retrieves the message from the queue, retrieves the data from S3 and processess the data.

huangapple
  • 本文由 发表于 2020年8月13日 05:37:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/63385033.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定