解析一个 JSON 流(不是以换行符分隔的)

huangapple go评论191阅读模式
英文:

Unmarshal a json stream (not newline-separated)

问题

我想将一个 JSON 流转换为对象流。使用换行分隔的 JSON 可以很容易地实现这一点。参考 Go 文档:https://golang.org/pkg/encoding/json/#Decoder.Buffered

然而,我需要从像这样的 JSON 数组生成一个流:

        [{"Name": "Ed", "Text": "Knock knock."},
        {"Name": "Sam", "Text": "Who's there?"},
        {"Name": "Ed", "Text": "Go fmt."},
        {"Name": "Sam", "Text": "Go fmt who?"},
        {"Name": "Ed", "Text": "Go fmt yourself!"}]

有没有一种高效的方法来实现这个?

我考虑过以下方法:

  1. 去掉外部的尖括号
  2. 当有匹配的顶层花括号时,解析括号之间(包括括号本身)的字符串,以获取一个顶层对象。

我不想这样做,因为每次扫描字符串的每个部分会带来性能上的影响。

我能想到的最好的替代方法是复制 Golang encoding/json 包中解码器的源代码,并修改它以返回一个逐个输出对象的 Reader。但是这对于一个如此简单的需求来说似乎太麻烦了。

有没有更好的方法来解码一个 JSON 数组流?

编辑

我希望解析具有嵌套对象和任意结构的 JSON。

英文:

I want to turn a stream of JSON into a stream of objects. This is easy to do with newline-separated JSON. From the Go docs: https://golang.org/pkg/encoding/json/#Decoder.Buffered

However, I need to generate a stream from JSON arrays like this one:

        [{"Name": "Ed", "Text": "Knock knock."},
        {"Name": "Sam", "Text": "Who's there?"},
        {"Name": "Ed", "Text": "Go fmt."},
        {"Name": "Sam", "Text": "Go fmt who?"},
        {"Name": "Ed", "Text": "Go fmt yourself!"}]

What is a performant way to do this?

I've considered this method:

  1. Drop the outside angle brackets
  2. When there are matching top-level curly braces, unmarshal the string between the braces (inclusive) to get one top-level object at a time.

I don't want to do it because of the performance implications of scanning each portion of the string twice.

The best alternative I can thing of is to copy the source code for the decoder in the Golang encoding/json package and modify it so it returns a Reader that spits out one object at a time. But that seems like too much work for such a simple requirement.

Is there a better way to decode a stream that is a JSON array?

EDIT

I'm looking to parse JSON with nested objects and arbitrary structure.

答案1

得分: 1

你可以使用流式解析器。例如megajson的扫描器

package main

import (
	"fmt"
	"strings"

	"github.com/benbjohnson/megajson/scanner"
)

func main() {
	// 我们的输入数据
	rdr := strings.NewReader(`[
		{"Name": "Ed", "Text": "Knock knock."},
        {"Name": "Sam", "Text": "Who's there?"},
        {"Name": "Ed", "Text": "Go fmt."},
        {"Name": "Sam", "Text": "Go fmt who?"},
        {"Name": "Ed", "Text": "Go fmt yourself!"}
	]`)

	// 我们想要创建一个这样的列表
	type Object struct {
		Name string
		Text string
	}
	objects := make([]Object, 0)

	// 在读取时扫描JSON
	s := scanner.NewScanner(rdr)

	// 这是我们跟踪JSON解析位置的方式
	// 如果你需要支持嵌套对象,你需要在这里使用一个栈([]state{}),并在每次看到一个大括号时推入/弹出
	var state struct {
		inKey   bool
		lastKey string
		object  Object
	}
	for {
		tok, data, err := s.Scan()
		if err != nil {
			break
		}

		switch tok {
		case scanner.TLBRACE:
			// 刚刚看到'{',所以开始一个新对象
			state.inKey = true
			state.lastKey = ""
			state.object = Object{}
		case scanner.TRBRACE:
			// 刚刚看到'}',所以存储对象
			objects = append(objects, state.object)
		case scanner.TSTRING:
			// 对于`key: value`,我们刚刚解析了'key'
			if state.inKey {
				state.lastKey = string(data)
			} else {
				// 现在我们在'value'上
				if state.lastKey == "Name" {
					state.object.Name = string(data)
				} else {
					state.object.Text = string(data)
				}
			}
			state.inKey = !state.inKey
		}
	}
	fmt.Println(objects)
}

这可能是你能得到的最高效的方法,但它需要大量的手动处理。

英文:

You can use a streaming parser. For example megajson's scanner:

package main

import (
	"fmt"
	"strings"

	"github.com/benbjohnson/megajson/scanner"
)

func main() {
	// our incoming data
	rdr := strings.NewReader(`[
		{"Name": "Ed", "Text": "Knock knock."},
        {"Name": "Sam", "Text": "Who's there?"},
        {"Name": "Ed", "Text": "Go fmt."},
        {"Name": "Sam", "Text": "Go fmt who?"},
        {"Name": "Ed", "Text": "Go fmt yourself!"}
	]`)

	// we want to create a list of these
	type Object struct {
		Name string
		Text string
	}
	objects := make([]Object, 0)

	// scan the JSON as we read
	s := scanner.NewScanner(rdr)

	// this is how we keep track of where we are parsing the JSON
	// if you needed to support nested objects you would need to
	// use a stack here ([]state{}) and push / pop each time you
	// see a brace
	var state struct {
		inKey   bool
		lastKey string
		object  Object
	}
	for {
		tok, data, err := s.Scan()
		if err != nil {
			break
		}

		switch tok {
		case scanner.TLBRACE:
			// just saw '{' so start a new object
			state.inKey = true
			state.lastKey = ""
			state.object = Object{}
		case scanner.TRBRACE:
			// just saw '}' so store the object
			objects = append(objects, state.object)
		case scanner.TSTRING:
			// for `key: value`, we just parsed 'key'
			if state.inKey {
				state.lastKey = string(data)
			} else {
				// now we are on `value`
				if state.lastKey == "Name" {
					state.object.Name = string(data)
				} else {
					state.object.Text = string(data)
				}
			}
			state.inKey = !state.inKey
		}
	}
	fmt.Println(objects)
}

This is probably as efficient as you can get, but it does require a lot of manual processing.

答案2

得分: 0

假设JSON流如下所示:

{"Name": "Ed", "Text": "Knock knock."}{"Name": "Sam", "Text": "Who's there?"}{"Name": "Ed", "Text": "Go fmt."}

我有一个想法,伪代码如下所示:

1:跳过前缀空格
2:如果第一个字符不是{,则抛出错误
3:加载一些字符,并找到第一个"}"
    4:如果找到,尝试进行json.Unmarshal()
        5:如果解组失败,加载更多字符,并找到第二个"}"
             6:重新执行步骤4
英文:

Assume the json stream like:

{"Name": "Ed", "Text": "Knock knock."}{"Name": "Sam", "Text": "Who's there?"}{"Name": "Ed", "Text": "Go fmt."}

I have idea, pseudo code like below:

1: skip prefix whitespace
2: if first char not {, throw error
3: load some chars, and find the first "}"
    4: if found, try json.Unmarshal()
        5: if unmarshal fail, load more chars, and find second "}"
             6: redo STEP 4

答案3

得分: 0

以下是我项目中已经工作的实现:

package json

import (
	"bytes"
	j "encoding/json"
	"errors"
	"io"
	"strings"
)

// Stream 代表一个 JSON 流
type Stream struct {
	stream *bytes.Buffer
	object *bytes.Buffer
	scrap  *bytes.Buffer
}

// NewStream 返回一个基于 src 的 Stream
func NewStream(src []byte) *Stream {
	return &Stream{
		stream: bytes.NewBuffer(src),
		object: new(bytes.Buffer),
		scrap:  new(bytes.Buffer),
	}
}

// Read 读取一个 JSON 对象
func (s *Stream) Read() ([]byte, error) {
	var obj []byte

	for {
		// 从流中读取一个 rune
		r, _, err := s.stream.ReadRune()
		switch err {
		case nil:
		case io.EOF:
			if strings.TrimSpace(s.object.String()) != "" {
				return nil, errors.New("无效的 JSON")
			}

			fallthrough
		default:
			return nil, err
		}

		// 将 rune 写入对象缓冲区
		if _, err := s.object.WriteRune(r); err != nil {
			return nil, err
		}

		if r == '}' {
			obj = s.object.Bytes()

			// 检查 JSON 字符串是否有效
			err := j.Compact(s.scrap, obj)
			s.scrap.Reset()
			if err != nil {
				continue
			}

			s.object.Reset()

			break
		}
	}

	return obj, nil
}

使用方法如下

func process(src []byte) error {
	s := json.NewStream(src)

	for {
		obj, err := s.Read()
		switch err {
		case nil:
		case io.EOF:
			return nil
		default:
			return err
		}

		// 现在你可以尝试将 obj 解码为结构体/映射等
		// 它也支持混合流,例如:
		a := new(TypeOne)
		b := new(TypeTwo)
		if err := j.Unmarshal(obj, a); err == nil && a.Error != "" {
			// 这是一个 TypeOne 对象
		} else if err := j.Unmarshal(obj, b); err == nil && b.ID != "" {
			// 这是一个 TypeTwo 对象
		} else {
			// 未知类型
		}
	}

	return nil
}

希望对你有帮助!

英文:

Below is an implementation, already working in my project:

package json

import (
	"bytes"
	j "encoding/json"
	"errors"
	"io"
	"strings"
)

// Stream represent a json stream
type Stream struct {
	stream *bytes.Buffer
	object *bytes.Buffer
	scrap  *bytes.Buffer
}

// NewStream return a Stream that based on src
func NewStream(src []byte) *Stream {
	return &Stream{
		stream: bytes.NewBuffer(src),
		object: new(bytes.Buffer),
		scrap:  new(bytes.Buffer),
	}
}

// Read read a json object
func (s *Stream) Read() ([]byte, error) {
	var obj []byte

	for {
		// read a rune from stream
		r, _, err := s.stream.ReadRune()
		switch err {
		case nil:
		case io.EOF:
			if strings.TrimSpace(s.object.String()) != "" {
				return nil, errors.New("Invalid JSON")
			}

			fallthrough
		default:
			return nil, err
		}

		// write the rune to object buffer
		if _, err := s.object.WriteRune(r); err != nil {
			return nil, err
		}

		if r == '}' {
			obj = s.object.Bytes()

			// check whether json string valid
			err := j.Compact(s.scrap, obj)
			s.scrap.Reset()
			if err != nil {
				continue
			}

			s.object.Reset()

			break
		}
	}

	return obj, nil
}

Usage like below:

func process(src []byte) error {
	s := json.NewStream(src)

	for {
		obj, err := s.Read()
		switch err {
		case nil:
		case io.EOF:
			return nil 
		default:
			return err 
		}   

		// now you can try to decode the obj to a struct/map/...
        // it is also support mix stream, ex.:
        a = new(TypeOne)
        b = new(TypeTwo)
        if err := j.Unmarshal(obj, a); err == nil && a.Error != "" {
             // it is a TypeOne object
        } else if err := j.Unmarshal(obj, b); err == nil && a.ID != "" {
             // it is a TypeTwo object
        } else {
             // unkown type
        }
	}

	return nil
}

huangapple
  • 本文由 发表于 2015年5月1日 05:47:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/29978394.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定