使用jq提取多个JSON对象。

huangapple go评论97阅读模式
英文:

Using jq to extract multiple json objects

问题

我一直在使用 jq 成功地从一些相对大的文件中逐个提取 JSON 数据块,并将其写入一个文件,每行一个 JSON 对象,以供进一步处理。以下是 JSON 格式的示例:

{
  "date": "2023-07-30",
  "results1": [
    {
      "data": [
        {"row": [{"key1": "row1", "key2": "row1"}]},
        {"row": [{"key1": "row2", "key2": "row2"}]}
      ]
    },
    {
      "data": [
        {"row": [{"key1": "row3", "key2": "row3"}]},
        {"row": [{"key1": "row4", "key2": "row4"}]}
      ]
    }
  ],
  "results2": [
    {
      "data": [
        {"row": [{"key3": "row1", "key4": "row1"}]},
        {"row": [{"key3": "row2", "key4": "row2"}]}
      ]
    },
    {
      "data": [
        {"row": [{"key3": "row3", "key4": "row3"}]},
        {"row": [{"key3": "row4", "key4": "row4"}]}
      ]
    }
  ]
}

我当前的方法是运行以下命令并将标准输出重定向到文件:

jq -rc ".results1[]" my_json.json

这个方法运行得很好,但似乎 jq 会将整个文件读入内存,以提取我感兴趣的块。

问题:

  1. 当我执行上述语句时,jq 是否会将整个文件读入内存?
  2. 假设答案是肯定的,是否有一种方法可以在同一调用中提取 results1[]results2[],以避免两次读取文件?

我尝试过使用 --stream 选项,但速度非常慢。我还了解到,它是以节省内存为代价的,但目前内存不是问题,所以我希望避免使用这个选项。基本上,我需要一次读取上述 JSON,并将其输出到两个 JSON 行格式的文件中。

编辑:(我稍微更改了输入数据以显示输出中的差异)

输出文件1:

{"data":[{"row":[{"key1":"row1","key2":"row1"}]},{"row":[{"key1":"row2","key2":"row2"}]}]}
{"data":[{"row":[{"key1":"row3","key2":"row3"}]},{"row":[{"key1":"row4","key2":"row4"}]}]}

输出文件2:

{"data":[{"row":[{"key3":"row1","key4":"row1"}]},{"row":[{"key3":"row2","key4":"row2"}]}]}
{"data":[{"row":[{"key3":"row3","key4":"row3"}]},{"row":[{"key3":"row4","key4":"row4"}]}]}

似乎众所周知,流选项速度较慢。请参考此处的讨论。我尝试实施的方法可参考此处的答案

英文:

I have been using jq to successfully extract one JSON blob at a time from some relatively large files and write it out to a file of one JSON object per line for further processing. Here is an example of the JSON format:

{
  "date": "2023-07-30",
  "results1":[
    {
      "data": [    
        {"row": [{"key1": "row1", "key2": "row1"}]},
        {"row": [{"key1": "row2", "key2": "row2"}]}
      ]
    },
    {
      "data": [    
        {"row": [{"key1": "row3", "key2": "row3"}]},
        {"row": [{"key1": "row4", "key2": "row4"}]}
      ]
    }
  ],
  "results2":[
    {
      "data": [    
        {"row": [{"key3": "row1", "key4": "row1"}]},
        {"row": [{"key3": "row2", "key4": "row2"}]}
      ]
    },
    {
      "data": [    
        {"row": [{"key3": "row3", "key4": "row3"}]},
        {"row": [{"key3": "row4", "key4": "row4"}]}
      ]
    }
  ]
}

My current approach is to run the following and redirect the stdout to a file:

jq -rc ".results1[]" my_json.json

This works fine, however, it seems like jq reads the entire file into memory in order to extract the chunk I am interested in.

Questions:

  1. Does jq read the entire file into memory when I execute the above
    statement?
  2. Assuming the answer is yes, is there a way that I can extract results1[] and results2[] on the same call to avoid reading the file twice?

I have used the --stream option but it is very slow. I also read that it sacrifices speed for memory savings, but memory is not an issue at this time so I would prefer to avoid using this option. Basically, what I need is to read in the above json once and output two files in JSON lines format.

Edit: (I changed the input data a bit to show the differences in the output)

Output file 1:

{"data":[{"row":[{"key1":"row1","key2":"row1"}]},{"row":[{"key1":"row2","key2":"row2"}]}]}
{"data":[{"row":[{"key1":"row3","key2":"row3"}]},{"row":[{"key1":"row4","key2":"row4"}]}]}

Output file 2:

{"data":[{"row":[{"key3":"row1","key4":"row1"}]},{"row":[{"key3":"row2","key4":"row2"}]}]}
{"data":[{"row":[{"key3":"row3","key4":"row3"}]},{"row":[{"key3":"row4","key4":"row4"}]}]}

It seems pretty well known that the streaming option is slow. See the discussion here.

My attempt at implementing it followed the answer here.

答案1

得分: 1

[tag:jq] 没有任何文件 I/O 功能,因此无法输出多个文件。

您可以输出每个数据片段与其键,并进行后处理:

jq -r '
    to_entries[]
    | select(.key != "date")
    | .key as $k
    | .value[]
    | [$k, @json]
    | @tsv
' my_json.json

输出

results1	{"data":[{"row":[{"key1":"row1","key2":"row1"}]},{"row":[{"key1":"row2","key2":"row2"}]}]}
results1	{"data":[{"row":[{"key1":"row3","key2":"row3"}]},{"row":[{"key1":"row4","key2":"row4"}]}]}
results2	{"data":[{"row":[{"key3":"row1","key4":"row1"}]},{"row":[{"key3":"row2","key4":"row2"}]}]}
results2	{"data":[{"row":[{"key3":"row3","key4":"row3"}]},{"row":[{"key3":"row4","key4":"row4"}]}]}

因此:

while IFS=$'\t' read -r key json; do
    printf '%s\n' "$json" >> "${key}.jsonl"
done < <(
    jq -r '...'
)

或者

jq -r '...' my_json.json | awk -F '\t' '{print $2 > ($1 ".jsonl")}'
英文:

[tag:jq] doesn't have any file IO facilities, so you can't output multiple files.

You can output each piece of data with it's key and post-process it:

jq -r &#39;
    to_entries[]
    | select(.key != &quot;date&quot;)
    | .key as $k
    | .value[]
    | [$k, @json]
    | @tsv
&#39; my_json.json

outputs

results1	{&quot;data&quot;:[{&quot;row&quot;:[{&quot;key1&quot;:&quot;row1&quot;,&quot;key2&quot;:&quot;row1&quot;}]},{&quot;row&quot;:[{&quot;key1&quot;:&quot;row2&quot;,&quot;key2&quot;:&quot;row2&quot;}]}]}
results1	{&quot;data&quot;:[{&quot;row&quot;:[{&quot;key1&quot;:&quot;row3&quot;,&quot;key2&quot;:&quot;row3&quot;}]},{&quot;row&quot;:[{&quot;key1&quot;:&quot;row4&quot;,&quot;key2&quot;:&quot;row4&quot;}]}]}
results2	{&quot;data&quot;:[{&quot;row&quot;:[{&quot;key3&quot;:&quot;row1&quot;,&quot;key4&quot;:&quot;row1&quot;}]},{&quot;row&quot;:[{&quot;key3&quot;:&quot;row2&quot;,&quot;key4&quot;:&quot;row2&quot;}]}]}
results2	{&quot;data&quot;:[{&quot;row&quot;:[{&quot;key3&quot;:&quot;row3&quot;,&quot;key4&quot;:&quot;row3&quot;}]},{&quot;row&quot;:[{&quot;key3&quot;:&quot;row4&quot;,&quot;key4&quot;:&quot;row4&quot;}]}]}

So:

while IFS=$&#39;\t&#39; read -r key json; do
    printf &#39;%s\n&#39; &quot;$json&quot; &gt;&gt; &quot;${key}.jsonl&quot;
done &lt; &lt;(
    jq -r &#39;...&#39; my_json.json
)

or

jq -r &#39;...&#39; my_json.json | awk -F &#39;\t&#39; &#39;{print $2 &gt; ($1 &quot;.jsonl&quot;)}&#39;

答案2

得分: 1

使用Bash版本≥ 4,可以通过使用mapfile一次读取n行来改进处理更大块的数据:

jq -cr '$ARGS.positional[] as $key | .[$key] | $key, length, .[]' input.json \
  --args results1 results2 | while read -r key; read -r len
do mapfile -t -n $len
  printf '%s\n' "${MAPFILE[@]}" > "$key.jsonl"
done
英文:

With Bash ≥ 4, processing bigger chunks could be improved by reading n lines at once using mapfile:

jq -cr &#39;$ARGS.positional[] as $key | .[$key] | $key, length, .[]&#39; input.json \
  --args results1 results2 | while read -r key; read -r len
do mapfile -t -n $len
  printf &#39;%s\n&#39; &quot;${MAPFILE[@]}&quot; &gt; &quot;$key.jsonl&quot;
done

huangapple
  • 本文由 发表于 2023年7月31日 20:42:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76803755.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定