如何高效地存储和聚合约 3 亿个 JSON 对象

huangapple go评论69阅读模式
英文:

How to store and aggregate data in ~300m JSON objects efficiently

问题

我有一个应用程序,从一个Kafka主题中接收到300m的JSON文本文件(每天10m,保留30天)。

它包含的数据需要根据不同的属性进行每天的汇总。

我们想要使用Apache Spark在Azure Databricks中构建它,因为数据的大小将会增长,我们无法再垂直扩展这个过程(当前在一个Postgres服务器上运行),而且我们也需要一个具有成本效益的解决方案。

理论上,在Apache Spark中执行这个任务很简单,但我没有找到任何关于如何高效处理JSON对象的实用建议。

我看到的选项如下:

  1. 将数据存储在Postgres中,并使用Spark作业进行摄取(SQL)- 传输数据可能会慢

  2. 将数据以JSON格式存储在Azure Blob存储中 - 我们可能会用尽可以存储的文件数量,此外,读取如此多的文件似乎效率低下

  3. 将JSON数据以大块方式存储,例如100,000个JSON文件为一组 - 当数据更改时,删除/重新插入可能会慢

  4. 将数据转换为CSV或某种具有固定结构的二进制格式,并以大块方式存储在Blob格式中 - 更改格式将是一个挑战,但将来很少会发生,此外CSV /二进制更快解析

非常感谢任何实用的建议。预先感谢。

英文:

I have an app where I receive 300m JSON text files (10m daily, retention = 30 days) from a Kafka topic.

The data it contains needs to be aggregated every day based on different properties.

We would like to build it with Apache Spark, using Azure Databricks, because the size of the data will gro, we cannot vertically scale this process anymore (currently runs in 1 Postgres server) and we also need something that is cost-effective.

Having this job in Apache Spark is straightforward in theory, but I haven't found any practical advice on how to process JSON objects efficiently.

These are the options as I see:

  1. Store the data in Postgres and ingest it with the Spark job (SQL) - may be slow to transfer the data

  2. Store the data in Azure Blob Storage in JSON format - We may run out of the number of files that can be stored, also this seems inefficient to read so many files

  3. Store the JSON data in big chunks, eg. 100.000 JSON in one file - it could be slow to delete/reinsert when the data changes

  4. Convert the data to CSV or some binary format with a fixed structure and store it in blob format in big chunks - Changing the format will be a challenge but it would rarely happen in the future, also CSV/binary is quicker to parse

Any practical advice would be really appreciated. Thanks in advance.

答案1

得分: 1

以下是翻译好的部分:

  1. 如果您尝试以每日方式读取数据,则强烈建议将数据存储为Parquet格式并存储在databricks中。如果不是每日访问,则直接存储在Azure存储桶中(计算成本将最小化)。
  2. 如果需要展平JSON数据,则需要进行所有数据操作,并将其写入带有OPTIMISE条件的Delta表中。
  3. 如果真的需要保留30天,则在文件格式方面要小心,因为数据将以指数方式每天增长。否则,将表属性更改为保留期为7天或15天。
英文:

There are multiple factors to be consider :

  1. If you are trying to read the data on daily manner then strongly suggested to do store the data in Parquet format and store in databricks. If not accessing daily then store in Azure buckets itself (computation cost will be minimised)
  2. If JSON data to be flattened then you need to do all the data manipulations and write into delta tables with OPTIMISE conditions.
  3. If really retention 30 mandatory then be cautious with file formats bcz data will grow exponentially on daily basis. Other wise Alter table properties with retention period to 7 days or 15 days.

huangapple
  • 本文由 发表于 2023年2月8日 16:51:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/75383280.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定