2023年7月31日 20:42:57go评论175阅读模式

英文:

Using jq to extract multiple json objects

问题

我一直在使用 jq 成功地从一些相对大的文件中逐个提取 JSON 数据块，并将其写入一个文件，每行一个 JSON 对象，以供进一步处理。以下是 JSON 格式的示例：

{
  "date": "2023-07-30",
  "results1": [
    {
      "data": [
        {"row": [{"key1": "row1", "key2": "row1"}]},
        {"row": [{"key1": "row2", "key2": "row2"}]}
      ]
    },
    {
      "data": [
        {"row": [{"key1": "row3", "key2": "row3"}]},
        {"row": [{"key1": "row4", "key2": "row4"}]}
      ]
    }
  ],
  "results2": [
    {
      "data": [
        {"row": [{"key3": "row1", "key4": "row1"}]},
        {"row": [{"key3": "row2", "key4": "row2"}]}
      ]
    },
    {
      "data": [
        {"row": [{"key3": "row3", "key4": "row3"}]},
        {"row": [{"key3": "row4", "key4": "row4"}]}
      ]
    }
  ]
}

我当前的方法是运行以下命令并将标准输出重定向到文件：

jq -rc ".results1[]" my_json.json

这个方法运行得很好，但似乎 jq 会将整个文件读入内存，以提取我感兴趣的块。

问题：

当我执行上述语句时，jq 是否会将整个文件读入内存？
假设答案是肯定的，是否有一种方法可以在同一调用中提取 results1[] 和 results2[]，以避免两次读取文件？

我尝试过使用 --stream 选项，但速度非常慢。我还了解到，它是以节省内存为代价的，但目前内存不是问题，所以我希望避免使用这个选项。基本上，我需要一次读取上述 JSON，并将其输出到两个 JSON 行格式的文件中。

编辑：（我稍微更改了输入数据以显示输出中的差异）

输出文件1：

{"data":[{"row":[{"key1":"row1","key2":"row1"}]},{"row":[{"key1":"row2","key2":"row2"}]}]}
{"data":[{"row":[{"key1":"row3","key2":"row3"}]},{"row":[{"key1":"row4","key2":"row4"}]}]}

输出文件2：

{"data":[{"row":[{"key3":"row1","key4":"row1"}]},{"row":[{"key3":"row2","key4":"row2"}]}]}
{"data":[{"row":[{"key3":"row3","key4":"row3"}]},{"row":[{"key3":"row4","key4":"row4"}]}]}

似乎众所周知，流选项速度较慢。请参考此处的讨论。我尝试实施的方法可参考此处的答案。

英文:

I have been using jq to successfully extract one JSON blob at a time from some relatively large files and write it out to a file of one JSON object per line for further processing. Here is an example of the JSON format:

{
  &quot;date&quot;: &quot;2023-07-30&quot;,
  &quot;results1&quot;:[
    {
      &quot;data&quot;: [    
        {&quot;row&quot;: [{&quot;key1&quot;: &quot;row1&quot;, &quot;key2&quot;: &quot;row1&quot;}]},
        {&quot;row&quot;: [{&quot;key1&quot;: &quot;row2&quot;, &quot;key2&quot;: &quot;row2&quot;}]}
      ]
    },
    {
      &quot;data&quot;: [    
        {&quot;row&quot;: [{&quot;key1&quot;: &quot;row3&quot;, &quot;key2&quot;: &quot;row3&quot;}]},
        {&quot;row&quot;: [{&quot;key1&quot;: &quot;row4&quot;, &quot;key2&quot;: &quot;row4&quot;}]}
      ]
    }
  ],
  &quot;results2&quot;:[
    {
      &quot;data&quot;: [    
        {&quot;row&quot;: [{&quot;key3&quot;: &quot;row1&quot;, &quot;key4&quot;: &quot;row1&quot;}]},
        {&quot;row&quot;: [{&quot;key3&quot;: &quot;row2&quot;, &quot;key4&quot;: &quot;row2&quot;}]}
      ]
    },
    {
      &quot;data&quot;: [    
        {&quot;row&quot;: [{&quot;key3&quot;: &quot;row3&quot;, &quot;key4&quot;: &quot;row3&quot;}]},
        {&quot;row&quot;: [{&quot;key3&quot;: &quot;row4&quot;, &quot;key4&quot;: &quot;row4&quot;}]}
      ]
    }
  ]
}

My current approach is to run the following and redirect the stdout to a file:

jq -rc &quot;.results1[]&quot; my_json.json

This works fine, however, it seems like jq reads the entire file into memory in order to extract the chunk I am interested in.

Questions:

Does jq read the entire file into memory when I execute the above
statement?
Assuming the answer is yes, is there a way that I can extract results1[] and results2[] on the same call to avoid reading the file twice?

I have used the --stream option but it is very slow. I also read that it sacrifices speed for memory savings, but memory is not an issue at this time so I would prefer to avoid using this option. Basically, what I need is to read in the above json once and output two files in JSON lines format.

Edit: (I changed the input data a bit to show the differences in the output)

Output file 1:

{&quot;data&quot;:[{&quot;row&quot;:[{&quot;key1&quot;:&quot;row1&quot;,&quot;key2&quot;:&quot;row1&quot;}]},{&quot;row&quot;:[{&quot;key1&quot;:&quot;row2&quot;,&quot;key2&quot;:&quot;row2&quot;}]}]}
{&quot;data&quot;:[{&quot;row&quot;:[{&quot;key1&quot;:&quot;row3&quot;,&quot;key2&quot;:&quot;row3&quot;}]},{&quot;row&quot;:[{&quot;key1&quot;:&quot;row4&quot;,&quot;key2&quot;:&quot;row4&quot;}]}]}

Output file 2:

{&quot;data&quot;:[{&quot;row&quot;:[{&quot;key3&quot;:&quot;row1&quot;,&quot;key4&quot;:&quot;row1&quot;}]},{&quot;row&quot;:[{&quot;key3&quot;:&quot;row2&quot;,&quot;key4&quot;:&quot;row2&quot;}]}]}
{&quot;data&quot;:[{&quot;row&quot;:[{&quot;key3&quot;:&quot;row3&quot;,&quot;key4&quot;:&quot;row3&quot;}]},{&quot;row&quot;:[{&quot;key3&quot;:&quot;row4&quot;,&quot;key4&quot;:&quot;row4&quot;}]}]}

It seems pretty well known that the streaming option is slow. See the discussion here.

My attempt at implementing it followed the answer here.

答案1

得分: 1

[tag:jq] 没有任何文件 I/O 功能，因此无法输出多个文件。

您可以输出每个数据片段与其键，并进行后处理：

jq -r '
    to_entries[]
    | select(.key != "date")
    | .key as $k
    | .value[]
    | [$k, @json]
    | @tsv
' my_json.json

输出

results1	{"data":[{"row":[{"key1":"row1","key2":"row1"}]},{"row":[{"key1":"row2","key2":"row2"}]}]}
results1	{"data":[{"row":[{"key1":"row3","key2":"row3"}]},{"row":[{"key1":"row4","key2":"row4"}]}]}
results2	{"data":[{"row":[{"key3":"row1","key4":"row1"}]},{"row":[{"key3":"row2","key4":"row2"}]}]}
results2	{"data":[{"row":[{"key3":"row3","key4":"row3"}]},{"row":[{"key3":"row4","key4":"row4"}]}]}

因此：

while IFS=$'\t' read -r key json; do
    printf '%s\n' "$json" >> "${key}.jsonl"
done < <(
    jq -r '...'
)

或者

jq -r '...' my_json.json | awk -F '\t' '{print $2 > ($1 ".jsonl")}'

英文:

[tag:jq] doesn't have any file IO facilities, so you can't output multiple files.

You can output each piece of data with it's key and post-process it:

jq -r &#39;
    to_entries[]
    | select(.key != &quot;date&quot;)
    | .key as $k
    | .value[]
    | [$k, @json]
    | @tsv
&#39; my_json.json

outputs

results1	{&quot;data&quot;:[{&quot;row&quot;:[{&quot;key1&quot;:&quot;row1&quot;,&quot;key2&quot;:&quot;row1&quot;}]},{&quot;row&quot;:[{&quot;key1&quot;:&quot;row2&quot;,&quot;key2&quot;:&quot;row2&quot;}]}]}
results1	{&quot;data&quot;:[{&quot;row&quot;:[{&quot;key1&quot;:&quot;row3&quot;,&quot;key2&quot;:&quot;row3&quot;}]},{&quot;row&quot;:[{&quot;key1&quot;:&quot;row4&quot;,&quot;key2&quot;:&quot;row4&quot;}]}]}
results2	{&quot;data&quot;:[{&quot;row&quot;:[{&quot;key3&quot;:&quot;row1&quot;,&quot;key4&quot;:&quot;row1&quot;}]},{&quot;row&quot;:[{&quot;key3&quot;:&quot;row2&quot;,&quot;key4&quot;:&quot;row2&quot;}]}]}
results2	{&quot;data&quot;:[{&quot;row&quot;:[{&quot;key3&quot;:&quot;row3&quot;,&quot;key4&quot;:&quot;row3&quot;}]},{&quot;row&quot;:[{&quot;key3&quot;:&quot;row4&quot;,&quot;key4&quot;:&quot;row4&quot;}]}]}

So:

while IFS=$&#39;\t&#39; read -r key json; do
    printf &#39;%s\n&#39; &quot;$json&quot; &gt;&gt; &quot;${key}.jsonl&quot;
done &lt; &lt;(
    jq -r &#39;...&#39; my_json.json
)

jq -r &#39;...&#39; my_json.json | awk -F &#39;\t&#39; &#39;{print $2 &gt; ($1 &quot;.jsonl&quot;)}&#39;

答案2

得分: 1

使用Bash版本≥ 4，可以通过使用mapfile一次读取n行来改进处理更大块的数据：

jq -cr '$ARGS.positional[] as $key | .[$key] | $key, length, .[]' input.json \
  --args results1 results2 | while read -r key; read -r len
do mapfile -t -n $len
  printf '%s\n' "${MAPFILE[@]}" > "$key.jsonl"
done

英文:

With Bash ≥ 4, processing bigger chunks could be improved by reading n lines at once using mapfile:

jq -cr &#39;$ARGS.positional[] as $key | .[$key] | $key, length, .[]&#39; input.json \
  --args results1 results2 | while read -r key; read -r len
do mapfile -t -n $len
  printf &#39;%s\n&#39; &quot;${MAPFILE[@]}&quot; &gt; &quot;$key.jsonl&quot;
done

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用jq提取多个JSON对象。

问题

答案1

答案2

golang用于具有任意键的JSON的结构体

无法在Google Go中遍历解析的JSON。

有没有办法通过Java代码在draft-07的JSON模式中获取所需的字段数组

JSON异常使用GSON

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论