Golang 解码/反序列化 JSON 中的无效 Unicode。

huangapple go评论102阅读模式
英文:

Golang Decoding/Unmarshaling invalid unicode in JSON

问题

我正在使用Go语言获取格式不统一的JSON文件。
例如,我可能会得到以下内容:

{"email": "\"blah.blah@blah.com\""}
{"email": "robert@gmail.com"}
{"name": "m33ead"}

我们可以看到转义字符会导致问题。
使用json.Decode

对于:

{"name": "m33ead"}

我会得到错误:invalid character '3' in string escape code

我尝试了几种方法来规范化我的数据,例如通过传递一个字符串数组(它可以工作,但有太多的边界情况),甚至过滤转义字符。

最后,我看到了这篇文章:(http://blog.golang.org/normalization)
他们提出的解决方案似乎非常有趣。

我尝试了以下代码:

isMn := func(r rune) bool {
    return unicode.Is(unicode.Mn, r)
}

t := transform.Chain(norm.NFC, transform.RemoveFunc(isMn), norm.NFD)

fileReader, err := bucket.GetReader(filename)

transformReader := transform.NewReader(fileReader, t)

decoder := json.NewDecoder(tReader)

for {
    var dataModel Model
    if err := decoder.Decode(&kmData); err == io.EOF {
        break
    } else {
      // 做一些操作
    }
}

其中Model定义为:

type Model struct {
    Name  string `json:"name" bson:"name"`
    Email string `json:"email" bson:"email"` 
}

我尝试了几种变化,但都无法使其正常工作。

所以我的问题是如何轻松处理具有不同编码的JSON数据的解码/反序列化?请注意,我无法控制这些JSON文件。

如果你正在阅读这篇文章,无论如何谢谢你。

英文:

I am fetching JSON files in go that are not formatted homogeneously.
For Example, I can have the following:

{"email": "\"blah.blah@blah.com\""}
{"email": "robert@gmail.com"}
{"name": "m33ead"}

We can see that there will be a problem with the escaping character.
Using json.Decode:

With:

{"name": "m33ead"}

I get the error: invalid character '3' in string escape code

I have tried several approaches to normalise my data for example by passing by a string array (it works but there is too many edge cases), or even to filter escape characters.

Finally, I came through this article: (http://blog.golang.org/normalization)
And the solution they proposed seemed very interesting.

I have tried the following

isMn := func(r rune) bool {
    return unicode.Is(unicode.Mn, r)
}

t := transform.Chain(norm.NFC, transform.RemoveFunc(isMn), norm.NFD)

fileReader, err := bucket.GetReader(filename)

transformReader := transform.NewReader(fileReader, t)

decoder := json.NewDecoder(tReader)

for {
    var dataModel Model
    if err := decoder.Decode(&kmData); err == io.EOF {
        break
    } else {
      // DO SOMETHING
    }
}

With Model being:

type Model struct {
    Name  string `json:"name" bson:"name"`
    Email string `json:"email" bson:"email"` 
}

I have tried several variations of it, but haven't been able to have it working.

So my question is how to easily handle decoding/unmarshaling JSON data with different encodings? Knowing, that I have no control on those JSON files.

If you are reading this, thank you anyway.

答案1

得分: 4

你可以使用json.RawMessage代替string,这样json.Decode就不会尝试解码无效字符了。

playground链接:http://play.golang.org/p/fB-38KGAO0

type Model struct {
    N  json.RawMessage `json:"name" bson:"name"`
}

func (m *Model) Name() string {
    return string(m.N)
}

func main() {
    s := "{\"name\": \"m\3\3ead\"}"
    r := strings.NewReader(s)
    d := json.NewDecoder(r)
    m := Model{}
    
    fmt.Println(d.Decode(&m))
    fmt.Println(m.Name())
}

编辑:嗯,你可以使用正则表达式,不确定对你来说是否可行,这是链接:http://play.golang.org/p/VYJKTKmiYm

func cleanUp(s string) string {
    re := regexp.MustCompile(`\b(\\\d\d\d)`)
    return re.ReplaceAllStringFunc(s, func(s string) string {
        return `\u0` + s[1:]
    })
}

func main() {
    s := "{\"name\": \"m\3\3ead\"}"
    s = cleanUp(s)
    r := strings.NewReader(s)
    d := json.NewDecoder(r)
    m := Model{}
    fmt.Println(d.Decode(&m))
    fmt.Println(m.Name())
}
英文:

You can use json.RawMessage instead of string, that way json.Decode won't try to decode the invalid characters.

playground : http://play.golang.org/p/fB-38KGAO0

type Model struct {
	N  json.RawMessage `json:"name" bson:"name"`
}

func (m *Model) Name() string {
	return string(m.N)
}
func main() {
	s := "{\"name\": \"m33ead\"}"
	r := strings.NewReader(s)
	d := json.NewDecoder(r)
	m := Model{}
	
	fmt.Println(d.Decode(&m))
	fmt.Println(m.Name())
}

Edit: Well, you can use regex, not sure how viable that is for you http://play.golang.org/p/VYJKTKmiYm:

func cleanUp(s string) string {
	re := regexp.MustCompile(`\b(\\\d\d\d)`)
	return re.ReplaceAllStringFunc(s, func(s string) string {
		return `\u0` + s[1:]
	})
}
func main() {
	s := "{\"name\": \"m33ead\"}"
	s = cleanUp(s)
	r := strings.NewReader(s)
	d := json.NewDecoder(r)
	m := Model{}
	fmt.Println(d.Decode(&m))
	fmt.Println(m.Name())
}

huangapple
  • 本文由 发表于 2014年6月19日 03:41:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/24293790.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定