如何在Golang中解析或控制JSON流的控制流?

huangapple go评论77阅读模式
英文:

How to parse or control the control-flow of json stream in golang?

问题

背景:

我有一些大型的 json 文件(2Gib<myfile<10GiB),需要对其进行解析。由于文件的大小,我无法将其作为变量并进行解组。

这就是为什么我尝试使用 json.NewDecoder 的原因,就像这个示例中所示。

关于数据的一点说明:

我有的数据有点像下面这样:

{
   "key1" : [ "hundreads_of_nested_objects" ],
   "key2" : [ "hundreads_of_nested_objects" ],
   "unknown_unexpected_key" : [ "many_nested_objects" ],
   ........
   "keyN" : [ n_objects ]
}

我尝试使用的代码如下:

	file, err := os.Open("myfile.json")
	dec := json.NewDecoder(file)
	for {
		t, err := dec.Token()
		if err == io.EOF {
			break
		}
		if err != nil {
			log.Fatal(err)
		}
		fmt.Printf("%T: %v", t, t)
        }

问题陈述:

  1. 如何使用 json.NewDecoder 处理这种数据结构?
  2. 处理这种问题的最佳实践是什么?
  3. 指出任何现有的类似代码将会很有帮助。

为了澄清,一些用例可能如下所示:

  • 仅解析 key1keyN 而不是整个文件。
  • 仅获取特定键并查找该键的某些嵌套对象。
  • 将键或其中某些对象的内容转储到另一个文件中。

注意:我是新手开发者,我的问题可能太宽泛了。任何改进它的指导都将很有帮助。

英文:

Background

I have big(2Gib<myfile<10GiB) json files that I need to parse. Due to the size of the file, I cannot keep it as a variable and unmarshal it as I need.

This is why I am trying to use json.NewDecoder as shown in the example here.

A bit about data

The data I have is kinda like the followng,

{
   &quot;key1&quot; : [ &quot;hundreads_of_nested_objects&quot; ],
   &quot;key2&quot; : [ &quot;hundreads_of_nested_objects&quot; ],
   &quot;unknown_unexpected_key&quot; : [ &quot;many_nested_objects&quot; ],
   ........
   &quot;keyN&quot; : [ n_objects ]
}

The Code I am trying to use

	file, err := os.Open(&quot;myfile.json&quot;)
	dec := json.NewDecoder(file)
	for {
		t, err := dec.Token()
		if err == io.EOF {
			break
		}
		if err != nil {
			log.Fatal(err)
		}
		fmt.Printf(&quot;%T: %v&quot;, t, t)
        }

Problem Statement

  1. How should I approach this kind of data structure with the json.NewDecoder?
  2. What are the best practices to deal with this kind of problem?
  3. Pointing out any existing similar code would be helpful.

To clarify, a few use cases could be the followings,

  • Parse only key1 or keyN instead of the whole file.
  • Grab only a specific key and find some nested object of that key.
  • Dump the contents of keys or some objects inside them to another file.

[N.B] I am new to development, my question might be too broad. Any guidance to improve it would be helpful too.

答案1

得分: 4

使用流式解码器

首先,当你使用json.Unmarshal时,你需要提供整个json输入的所有字节,因此在为Go表示数据分配内存之前,你需要先读取整个源文件。

我很少使用json.Unmarshal。可以使用json.NewDecoder来流式传输数据到解码器中,像这样。这样做可以逐位地将数据流入解码器。

你仍然需要将所有数据表示存储在内存中(至少是你建模的部分),但根据Json中的数据,这可能比json表示的内存要少得多。

例如,在JSON中,数字被表示为数字字符的字符串,但通常可以适应更小的intfloat64类型。布尔值和空值在Json中比它们在Go中的表示要大得多。结构字符[]{}:在Go的内存类型中可能不需要太多空间。当然,Json中的任何空格都只会使文件变大而无用。(我建议将json最小化以去除不必要的空格,但这只与存储有关,在流式传输json数据后不会产生太大影响)。

使用struct对数据进行建模,并尽可能省略模型中的数据

如果有很多与你的操作无关的json数据,请从你的模型中省略它们。

如果你让解码器解码为map[string]interface{}等通用类型,你无法做到这一点。但是,如果你使用基于结构的解码机制,你可以指定要存储的字段,这可能会显著减小内存中表示的大小。结合流式解码器,这可能解决你的内存限制问题。

显然,你的一些数据具有未知的键,因此你无法将所有数据存储在一个结构中。如果你无法足够定义与你的数据匹配的结构,这个选项就不适用。

增加系统的内存或交换空间

如今内存很便宜,而用于交换的磁盘空间更便宜。如果你可以通过增加内存或交换空间来缓解内存限制,以便将所有数据的表示都放入内存中,那么这是解决问题的最简单方法。

将数据转换为可以更高效访问存储的格式

Json是一个很棒的格式,也是我最喜欢的格式之一。但是,它不适合存储大量数据,因为按照子集方式访问数据很麻烦。

像Parquet这样的格式以一种使底层存储更高效的方式存储数据,这意味着你可以直接从磁盘存储中读取所需的数据部分,而不是读取所有数据然后在内存中表示它们。

对于良好索引的SQL或NoSQL数据库也是如此。

你甚至可以将数据拆分成多个json文件,然后按顺序读取和处理它们。

最后的手段,实现自己的扫描功能

如果你真的不能(或不想)增加内存或交换空间,更改数据格式或将其分成较小的部分,那么你必须编写一个json扫描器,可以跟踪你在json文件中的位置,以便知道如何处理数据。你不会一次性拥有所有数据的表示,但你可能能够提取出所需的部分,而无需存储所有数据的表示。

不过,这很复杂,并且将根据具体任务而定,没有通用的答案可以告诉你如何做到这一点。

英文:

Use a steaming decoder

For starters, when you use json.Unmarshal you provide it all the bytes of the json input, so you need to read the entire source file before you can even start to allocate memory for your Go representation of the data.

I hardly ever use json.Unmarshal. Use json.NewDecoder like this, which will stream the data into the unmarshaler bit by bit.

You'd still have to fit all of the data representation in memory (at least the parts you modeled), but depending on the data in the Json, that might be quite a bit less memory than the json representation.

For example, in JSON, numbers are represented as strings of digit characters, but can often fit into much smaller int or float64 types. booleans and nulls are also much bigger in Json than their Go representation. The structure characters []{}: probably won't require as much space in go in-memory types. And of course any whitespace in the Json does nothing but make the file larger. (I'd recommend minifying the json to remove unnecessary whitespace, but that's only relevant for storage and won't have much effect once you're streaming the json data).

Model your data with structs and omit as much data from your model as possible

If there's a lot of json data that isn't relevant to your operation, omit them from your models.

You can't do this if you're letting the decoder decode into generic types like map[string]interface{} . But if you use the struct based decoding mechanisms, you can specify which fields you want to store, which might significantly decrease the size of your representation in memory. Combined with the streaming decoder that might solve your memory constraint.

Clearly some of your data has unknown keys, so you can't store all of it in a structure. If you can't sufficiently define the structures that match your data, this option's off the table.

Add memory or swap to your system

Memory is cheap these days, and disk space for swap is even cheaper. If you can get away with adding either to alleviate your memory constraint, so you can fit all the data's representation in memory, that's by far the simplest solution to your problem.

Convert data to a format that can be more efficiently accessed in storage

Json is a great format and one of my favorites. But it's not very good for storing large volumes of data because it's cumbersome to access subsets of that data at a time.

Formats like parquet store data in a way that makes the underlying storage much more efficient to query and navigate. This means that instead of reading all the data and then representing it all in memory, you can read the parts of the data you want directly from the on-disk storage.

The same would be true of a well indexed SQL or NoSQL database.

You could even break your data down into multiple json files that you could read and process sequentially.

Last resort, implement your own scanning functionality

If you really can't (or won't) add memory or swap space, change the format of the data, or break it into smaller parts, then you have to write a json scanner that can keep track of your location in the json file so that you know how to process the data. You won't have a representation of all the data at once, but you might be able to pick out the pieces you need without having to store a representation of all the data at once.

It's complicated though, and would be specific to the task in question, there's no generic answer for how to do it.

huangapple
  • 本文由 发表于 2022年3月19日 22:59:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/71539250.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定