问题

我有一个应用程序，从一个Kafka主题中接收到300m的JSON文本文件（每天10m，保留30天）。

它包含的数据需要根据不同的属性进行每天的汇总。

我们想要使用Apache Spark在Azure Databricks中构建它，因为数据的大小将会增长，我们无法再垂直扩展这个过程（当前在一个Postgres服务器上运行），而且我们也需要一个具有成本效益的解决方案。

理论上，在Apache Spark中执行这个任务很简单，但我没有找到任何关于如何高效处理JSON对象的实用建议。

我看到的选项如下：

将数据存储在Postgres中，并使用Spark作业进行摄取（SQL）- 传输数据可能会慢
将数据以JSON格式存储在Azure Blob存储中 - 我们可能会用尽可以存储的文件数量，此外，读取如此多的文件似乎效率低下
将JSON数据以大块方式存储，例如100,000个JSON文件为一组 - 当数据更改时，删除/重新插入可能会慢
将数据转换为CSV或某种具有固定结构的二进制格式，并以大块方式存储在Blob格式中 - 更改格式将是一个挑战，但将来很少会发生，此外CSV /二进制更快解析

非常感谢任何实用的建议。预先感谢。

英文:

I have an app where I receive 300m JSON text files (10m daily, retention = 30 days) from a Kafka topic.

The data it contains needs to be aggregated every day based on different properties.

We would like to build it with Apache Spark, using Azure Databricks, because the size of the data will gro, we cannot vertically scale this process anymore (currently runs in 1 Postgres server) and we also need something that is cost-effective.

Having this job in Apache Spark is straightforward in theory, but I haven't found any practical advice on how to process JSON objects efficiently.

These are the options as I see:

Store the data in Postgres and ingest it with the Spark job (SQL) - may be slow to transfer the data
Store the data in Azure Blob Storage in JSON format - We may run out of the number of files that can be stored, also this seems inefficient to read so many files
Store the JSON data in big chunks, eg. 100.000 JSON in one file - it could be slow to delete/reinsert when the data changes
Convert the data to CSV or some binary format with a fixed structure and store it in blob format in big chunks - Changing the format will be a challenge but it would rarely happen in the future, also CSV/binary is quicker to parse

Any practical advice would be really appreciated. Thanks in advance.

答案1

得分: 1

以下是翻译好的部分：

如果您尝试以每日方式读取数据，则强烈建议将数据存储为Parquet格式并存储在databricks中。如果不是每日访问，则直接存储在Azure存储桶中（计算成本将最小化）。
如果需要展平JSON数据，则需要进行所有数据操作，并将其写入带有OPTIMISE条件的Delta表中。
如果真的需要保留30天，则在文件格式方面要小心，因为数据将以指数方式每天增长。否则，将表属性更改为保留期为7天或15天。

英文:

There are multiple factors to be consider :

If you are trying to read the data on daily manner then strongly suggested to do store the data in Parquet format and store in databricks. If not accessing daily then store in Azure buckets itself (computation cost will be minimised)
If JSON data to be flattened then you need to do all the data manipulations and write into delta tables with OPTIMISE conditions.
If really retention 30 mandatory then be cautious with file formats bcz data will grow exponentially on daily basis. Other wise Alter table properties with retention period to 7 days or 15 days.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何高效地存储和聚合约 3 亿个 JSON 对象

问题

答案1

SQL：两个日期之间的完全外连接

Description和Discussion字段在TFS中存储在SQL数据库中的位置在哪里？

压缩Spark DataFrame，选择最新的数值并移除空值。

SQL多部分标识符错误，尝试使用子查询中的字段进行筛选。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论