如何将超大的 JSON 文件(150GB+),其中每行长度超过2GB,进行转换?

huangapple go评论59阅读模式
英文:

How can I convert extremely large json files (150GB +) that have line lengths greater than 2GB?

问题

每个月我都会收到非常庞大且复杂的JSON文件。这些文件通常在1-50GB之间,有时甚至超过150GB。这些文件会被加载到AWS EMR Hive表中,然后使用Hive的JSON函数进行处理。当尝试加载一个包含长度超过2GB的行的JSON文件时,该过程会失败。

我尝试使用jq(jq -c . < source.json > output.json)来压缩和格式化这些文件。这在较小的测试文件上有效,但无法扩展到如此大的文件。我还尝试了jq的流功能,但在重建结构时会占用大量内存。

我正在寻找一种重新格式化JSON文件的方法,它的行长度超过2GB,但不需要将整个文件加载到内存中。

是否可以使用jq实现这一目标?

50GB 测试文件

英文:

Every month I receive extremely large and complex json files. The files are routinely 1-50GB and sometimes larger than 150GB. The files are loaded into aws emr hive tables, then processed using hive json functions.
When the process attempts to load a json file that contains a line with a length greater than 2GB the process fails.

I tried using jq (jq -c . < source.json > output.json ) to compact and format the files. This worked on smaller test files but does not scale. I tried jq's stream function but the process used up memory when rebuilding the structure.

I am looking for a way to reformat a json file, that has line lengths exceeding 2GB, that does not require loading the entire file into memory.

Can this be accomplished with jq?

[50GB test file] (https://webtpa-public-access.s3.us-west-2.amazonaws.com/subfolder/2023_06_430_65B0_in_network_rates.json.gz)

答案1

得分: 0

JSON 只能在双引号内包含转义的双引号,我理解的是在形如 &quot;key&quot;:&quot;value&quot; 的情况下追加换行是安全的。

如果您可以选择一个出现频率足够高的键/值对,那么可以使用 gawk 插入换行。

以您的测试文件中似乎经常出现的一个示例为例:

gawk -v RS=&#39;&quot;negotiation_arrangement&quot;:&quot;(ffs|bundle|capitation)&quot;&#39; \
     &#39;ORS=RT&quot;\n&quot;&#39; &lt;infile &gt;outfile

在这个正则表达式的每次出现之后插入一个换行。

gawk 逐条读取记录,由正则表达式 RS 分隔,并将 ORS 设置为实际字符串 (RT) 后跟换行符。由于 ORS 赋值结果始终为“truthy”,默认的打印操作会发生。

内存使用量将是正则表达式出现之间的最大长度的小倍数。

对于提供的测试文件,我的 5 年前的笔记本在我同时进行其他任务的情况下运行 zcat | gawk | wc -l 需要半个小时,需要不到 44MB 的虚拟内存,受限于 CPU,最终报告了 333443 行。zcat | wc -c 需要 5 分钟。

英文:

My understanding is that JSON can only contain escaped double-quotes inside double-quoted strings. This leads me to believe it is safe to append newlines after something of form &quot;key&quot;:&quot;value&quot;.

If you can choose a key/value pair that occurs sufficiently often, then you can use gawk to insert newlines.

Taking an example from your "standard" that seems to appear quite often in your test file:

gawk -v RS=&#39;&quot;negotiation_arrangement&quot;:&quot;(ffs|bundle|capitation)&quot;&#39; \
     &#39;ORS=RT&quot;\n&quot;&#39; &lt;infile &gt;outfile

inserts a newline after every occurrence of this regex.

gawk read records one at a time, delimited by the regex in RS, and sets ORS to be the actual string (RT) followed by newline. Since the ORS assignment result is always "truthy", the default print action occurs.

Memory usage will be some small multiple of the maximum length between occurrences of the regex.

For the test file provided, my 5-year old laptop took half an hour to run zcat | gawk | wc -l while I was using the machine for other things, needed <44MB virtual memory, was CPU-bound, and finally reported 333443 lines. zcat | wc -c took 5 minutes.

huangapple
  • 本文由 发表于 2023年6月15日 06:22:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76477940.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定