2023年6月15日 06:22:56go评论121阅读模式

英文:

How can I convert extremely large json files (150GB +) that have line lengths greater than 2GB?

问题

每个月我都会收到非常庞大且复杂的JSON文件。这些文件通常在1-50GB之间，有时甚至超过150GB。这些文件会被加载到AWS EMR Hive表中，然后使用Hive的JSON函数进行处理。当尝试加载一个包含长度超过2GB的行的JSON文件时，该过程会失败。

我尝试使用jq（jq -c . < source.json > output.json）来压缩和格式化这些文件。这在较小的测试文件上有效，但无法扩展到如此大的文件。我还尝试了jq的流功能，但在重建结构时会占用大量内存。

我正在寻找一种重新格式化JSON文件的方法，它的行长度超过2GB，但不需要将整个文件加载到内存中。

是否可以使用jq实现这一目标？

50GB 测试文件

英文:

Every month I receive extremely large and complex json files. The files are routinely 1-50GB and sometimes larger than 150GB. The files are loaded into aws emr hive tables, then processed using hive json functions.
When the process attempts to load a json file that contains a line with a length greater than 2GB the process fails.

I tried using jq (jq -c . < source.json > output.json ) to compact and format the files. This worked on smaller test files but does not scale. I tried jq's stream function but the process used up memory when rebuilding the structure.

I am looking for a way to reformat a json file, that has line lengths exceeding 2GB, that does not require loading the entire file into memory.

Can this be accomplished with jq?

[50GB test file] (https://webtpa-public-access.s3.us-west-2.amazonaws.com/subfolder/2023_06_430_65B0_in_network_rates.json.gz)

答案1

得分: 0

JSON 只能在双引号内包含转义的双引号，我理解的是在形如 "key":"value" 的情况下追加换行是安全的。

如果您可以选择一个出现频率足够高的键/值对，那么可以使用 gawk 插入换行。

以您的测试文件中似乎经常出现的一个示例为例：

gawk -v RS=&#39;&quot;negotiation_arrangement&quot;:&quot;(ffs|bundle|capitation)&quot;&#39; \
     &#39;ORS=RT&quot;\n&quot;&#39; &lt;infile &gt;outfile

在这个正则表达式的每次出现之后插入一个换行。

gawk 逐条读取记录，由正则表达式 RS 分隔，并将 ORS 设置为实际字符串 (RT) 后跟换行符。由于 ORS 赋值结果始终为“truthy”，默认的打印操作会发生。

内存使用量将是正则表达式出现之间的最大长度的小倍数。

对于提供的测试文件，我的 5 年前的笔记本在我同时进行其他任务的情况下运行 zcat | gawk | wc -l 需要半个小时，需要不到 44MB 的虚拟内存，受限于 CPU，最终报告了 333443 行。zcat | wc -c 需要 5 分钟。

英文:

My understanding is that JSON can only contain escaped double-quotes inside double-quoted strings. This leads me to believe it is safe to append newlines after something of form "key":"value".

If you can choose a key/value pair that occurs sufficiently often, then you can use gawk to insert newlines.

Taking an example from your "standard" that seems to appear quite often in your test file:

gawk -v RS=&#39;&quot;negotiation_arrangement&quot;:&quot;(ffs|bundle|capitation)&quot;&#39; \
     &#39;ORS=RT&quot;\n&quot;&#39; &lt;infile &gt;outfile

inserts a newline after every occurrence of this regex.

gawk read records one at a time, delimited by the regex in RS, and sets ORS to be the actual string (RT) followed by newline. Since the ORS assignment result is always "truthy", the default print action occurs.

Memory usage will be some small multiple of the maximum length between occurrences of the regex.

For the test file provided, my 5-year old laptop took half an hour to run zcat | gawk | wc -l while I was using the machine for other things, needed <44MB virtual memory, was CPU-bound, and finally reported 333443 lines. zcat | wc -c took 5 minutes.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何将超大的 JSON 文件（150GB+），其中每行长度超过2GB，进行转换？

问题

答案1

IAsyncEnumerable分块大小节流和HttpResponse JSON的格式约定。

在BigQuery中创建无模式架构的备份 / 通过查询从JSON类型创建表

使用Golang的json.NewDecoder / json.NewEncoder。

在Go语言中递归迭代结构体数组。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论