英文:
Using jq to extract multiple json objects
问题
我一直在使用 jq
成功地从一些相对大的文件中逐个提取 JSON 数据块,并将其写入一个文件,每行一个 JSON 对象,以供进一步处理。以下是 JSON 格式的示例:
{
"date": "2023-07-30",
"results1": [
{
"data": [
{"row": [{"key1": "row1", "key2": "row1"}]},
{"row": [{"key1": "row2", "key2": "row2"}]}
]
},
{
"data": [
{"row": [{"key1": "row3", "key2": "row3"}]},
{"row": [{"key1": "row4", "key2": "row4"}]}
]
}
],
"results2": [
{
"data": [
{"row": [{"key3": "row1", "key4": "row1"}]},
{"row": [{"key3": "row2", "key4": "row2"}]}
]
},
{
"data": [
{"row": [{"key3": "row3", "key4": "row3"}]},
{"row": [{"key3": "row4", "key4": "row4"}]}
]
}
]
}
我当前的方法是运行以下命令并将标准输出重定向到文件:
jq -rc ".results1[]" my_json.json
这个方法运行得很好,但似乎 jq
会将整个文件读入内存,以提取我感兴趣的块。
问题:
- 当我执行上述语句时,
jq
是否会将整个文件读入内存? - 假设答案是肯定的,是否有一种方法可以在同一调用中提取
results1[]
和results2[]
,以避免两次读取文件?
我尝试过使用 --stream
选项,但速度非常慢。我还了解到,它是以节省内存为代价的,但目前内存不是问题,所以我希望避免使用这个选项。基本上,我需要一次读取上述 JSON,并将其输出到两个 JSON 行格式的文件中。
编辑:(我稍微更改了输入数据以显示输出中的差异)
输出文件1:
{"data":[{"row":[{"key1":"row1","key2":"row1"}]},{"row":[{"key1":"row2","key2":"row2"}]}]}
{"data":[{"row":[{"key1":"row3","key2":"row3"}]},{"row":[{"key1":"row4","key2":"row4"}]}]}
输出文件2:
{"data":[{"row":[{"key3":"row1","key4":"row1"}]},{"row":[{"key3":"row2","key4":"row2"}]}]}
{"data":[{"row":[{"key3":"row3","key4":"row3"}]},{"row":[{"key3":"row4","key4":"row4"}]}]}
似乎众所周知,流选项速度较慢。请参考此处的讨论。我尝试实施的方法可参考此处的答案。
英文:
I have been using jq
to successfully extract one JSON blob at a time from some relatively large files and write it out to a file of one JSON object per line for further processing. Here is an example of the JSON format:
{
"date": "2023-07-30",
"results1":[
{
"data": [
{"row": [{"key1": "row1", "key2": "row1"}]},
{"row": [{"key1": "row2", "key2": "row2"}]}
]
},
{
"data": [
{"row": [{"key1": "row3", "key2": "row3"}]},
{"row": [{"key1": "row4", "key2": "row4"}]}
]
}
],
"results2":[
{
"data": [
{"row": [{"key3": "row1", "key4": "row1"}]},
{"row": [{"key3": "row2", "key4": "row2"}]}
]
},
{
"data": [
{"row": [{"key3": "row3", "key4": "row3"}]},
{"row": [{"key3": "row4", "key4": "row4"}]}
]
}
]
}
My current approach is to run the following and redirect the stdout to a file:
jq -rc ".results1[]" my_json.json
This works fine, however, it seems like jq
reads the entire file into memory in order to extract the chunk I am interested in.
Questions:
- Does jq read the entire file into memory when I execute the above
statement? - Assuming the answer is yes, is there a way that I can extract
results1[]
andresults2[]
on the same call to avoid reading the file twice?
I have used the --stream
option but it is very slow. I also read that it sacrifices speed for memory savings, but memory is not an issue at this time so I would prefer to avoid using this option. Basically, what I need is to read in the above json once and output two files in JSON lines format.
Edit: (I changed the input data a bit to show the differences in the output)
Output file 1:
{"data":[{"row":[{"key1":"row1","key2":"row1"}]},{"row":[{"key1":"row2","key2":"row2"}]}]}
{"data":[{"row":[{"key1":"row3","key2":"row3"}]},{"row":[{"key1":"row4","key2":"row4"}]}]}
Output file 2:
{"data":[{"row":[{"key3":"row1","key4":"row1"}]},{"row":[{"key3":"row2","key4":"row2"}]}]}
{"data":[{"row":[{"key3":"row3","key4":"row3"}]},{"row":[{"key3":"row4","key4":"row4"}]}]}
It seems pretty well known that the streaming option is slow. See the discussion here.
My attempt at implementing it followed the answer here.
答案1
得分: 1
[tag:jq] 没有任何文件 I/O 功能,因此无法输出多个文件。
您可以输出每个数据片段与其键,并进行后处理:
jq -r '
to_entries[]
| select(.key != "date")
| .key as $k
| .value[]
| [$k, @json]
| @tsv
' my_json.json
输出
results1 {"data":[{"row":[{"key1":"row1","key2":"row1"}]},{"row":[{"key1":"row2","key2":"row2"}]}]}
results1 {"data":[{"row":[{"key1":"row3","key2":"row3"}]},{"row":[{"key1":"row4","key2":"row4"}]}]}
results2 {"data":[{"row":[{"key3":"row1","key4":"row1"}]},{"row":[{"key3":"row2","key4":"row2"}]}]}
results2 {"data":[{"row":[{"key3":"row3","key4":"row3"}]},{"row":[{"key3":"row4","key4":"row4"}]}]}
因此:
while IFS=$'\t' read -r key json; do
printf '%s\n' "$json" >> "${key}.jsonl"
done < <(
jq -r '...'
)
或者
jq -r '...' my_json.json | awk -F '\t' '{print $2 > ($1 ".jsonl")}'
英文:
[tag:jq] doesn't have any file IO facilities, so you can't output multiple files.
You can output each piece of data with it's key and post-process it:
jq -r '
to_entries[]
| select(.key != "date")
| .key as $k
| .value[]
| [$k, @json]
| @tsv
' my_json.json
outputs
results1 {"data":[{"row":[{"key1":"row1","key2":"row1"}]},{"row":[{"key1":"row2","key2":"row2"}]}]}
results1 {"data":[{"row":[{"key1":"row3","key2":"row3"}]},{"row":[{"key1":"row4","key2":"row4"}]}]}
results2 {"data":[{"row":[{"key3":"row1","key4":"row1"}]},{"row":[{"key3":"row2","key4":"row2"}]}]}
results2 {"data":[{"row":[{"key3":"row3","key4":"row3"}]},{"row":[{"key3":"row4","key4":"row4"}]}]}
So:
while IFS=$'\t' read -r key json; do
printf '%s\n' "$json" >> "${key}.jsonl"
done < <(
jq -r '...' my_json.json
)
or
jq -r '...' my_json.json | awk -F '\t' '{print $2 > ($1 ".jsonl")}'
答案2
得分: 1
使用Bash版本≥ 4,可以通过使用mapfile
一次读取n行来改进处理更大块的数据:
jq -cr '$ARGS.positional[] as $key | .[$key] | $key, length, .[]' input.json \
--args results1 results2 | while read -r key; read -r len
do mapfile -t -n $len
printf '%s\n' "${MAPFILE[@]}" > "$key.jsonl"
done
英文:
With Bash ≥ 4, processing bigger chunks could be improved by reading n lines at once using mapfile
:
jq -cr '$ARGS.positional[] as $key | .[$key] | $key, length, .[]' input.json \
--args results1 results2 | while read -r key; read -r len
do mapfile -t -n $len
printf '%s\n' "${MAPFILE[@]}" > "$key.jsonl"
done
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论