解码编组的JSON Unicode

huangapple go评论102阅读模式
英文:

Decode Marshalled JSON unicode

问题

我认为用一个示例是解释我的问题最快的方式:

package main

import (
	"fmt"
	"encoding/json"
)

type JSON struct {
	Body string
}

func main() {
	body := "<html><body>Hello World</body></html>"
	
	obj := JSON{body}
	
	result, _ := json.Marshal(obj)
	fmt.Println(string(result))
}

输出:

{"Body":"\u003chtml\u003e\u003cbody\u003eHello World\u003c/body\u003e\u003c/html\u003e"}

我希望结果是一个与输入相同的UTF-8编码字符串。我该如何实现这一点?我尝试在循环中使用utf8.DecodeRune

str := ""

for _, res := range result {
	decoded, _ := utf8.DecodeRune(res)
	str += string(decoded)
}

但是这导致了一个编译错误

main.go:21: cannot use res (type byte) as type []byte in argument to utf8.DecodeRune

并且在编组对象上调用DecodeRune返回了第一个字符,正如你所期望的那样:

{

编辑:我使用的是Go 1.6.2,显然没有SetEscapeHTML,原因不明。

英文:

I think the quickest way to explain my problem is with an example:

package main

import (
	&quot;fmt&quot;
	&quot;encoding/json&quot;
)

type JSON struct {
	Body string
}

func main() {
	body := &quot;&lt;html&gt;&lt;body&gt;Hello World&lt;/body&gt;&lt;/html&gt;&quot;
	
	obj := JSON{body}
	
	result, _ := json.Marshal(obj)
	fmt.Println(string(result))
}

Output:
> {"Body":"\u003chtml\u003e\u003cbody\u003eHello World\u003c/body\u003e\u003c/html\u003e"}

I'd like the result to be a utf8-encoded string that reads the same as it went in. How can I achieve this? I tried to use utf8.DecodeRune, in a loop:

str := &quot;&quot;

for _, res := range result {
	decoded, _ := utf8.DecodeRune(res)
	str += string(decoded)
}

but that causes a compilation error

> main.go:21: cannot use res (type byte) as type []byte in argument to utf8.DecodeRune

And calling DecodeRune on the marshalled object returns the first character, as you'd expect

> {

Edit: I'm using Go 1.6.2, which apparently doesn't have SetEscapeHTML for whatever reason.

答案1

得分: 10

这是预期的行为。根据文档

> 字符串值被编码为 JSON 字符串,强制转换为有效的 UTF-8,将无效的字节替换为 Unicode 替换符。角括号“<”和“>”被转义为“\u003c”和“\u003e”,以防止某些浏览器将 JSON 输出错误地解释为 HTML。同样出于这个原因,和号“&”也被转义为“\u0026”。可以使用调用了 SetEscapeHTML(false) 的编码器来禁用这种转义。

您可以通过使用 Encoder 并在其上调用 SetEscapeHTML(false) 来获得所需的结果:

func main() {
    body := "<html><body>Hello World</body></html>"
    
    obj := JSON{body}
    
    enc := json.NewEncoder(os.Stdout)
    enc.SetEscapeHTML(false)
    enc.Encode(obj)
}

工作示例:https://play.golang.org/p/lMNCJ16dIo

英文:

This is intended behavior. From the docs:

> String values encode as JSON strings coerced to valid UTF-8, replacing
> invalid bytes with the Unicode replacement rune. The angle brackets
> "<" and ">" are escaped to "\u003c" and "\u003e" to keep some browsers
> from misinterpreting JSON output as HTML. Ampersand "&" is also
> escaped to "\u0026" for the same reason. This escaping can be disabled
> using an Encoder that had SetEscapeHTML(false) called on it.

You can get the required result by using an Encoder and calling SetEscapeHTML(false) on it:

func main() {
	body := &quot;&lt;html&gt;&lt;body&gt;Hello World&lt;/body&gt;&lt;/html&gt;&quot;
	
	obj := JSON{body}
	
	enc := json.NewEncoder(os.Stdout)
	enc.SetEscapeHTML(false)
	enc.Encode(obj)
}

Working example: https://play.golang.org/p/lMNCJ16dIo

答案2

得分: 3

另一种解决方法是将这些转义字符替换为未转义的UTF-8字符。(我曾经这样做是为了使非英文字母在JSON中可读。)

你可以使用strconv.Quote()strconv.Unquote()来进行转换。

func _UnescapeUnicodeCharactersInJSON(_jsonRaw json.RawMessage) (json.RawMessage, error) {
    str, err := strconv.Unquote(strings.Replace(strconv.Quote(string(_jsonRaw)), `\\u`, `\u`, -1))
    if err != nil {
        return nil, err
    }
    return []byte(str), nil
}

func main() {
    // Both are valid JSON.
    var jsonRawEscaped json.RawMessage   // json raw with escaped unicode chars
    var jsonRawUnescaped json.RawMessage // json raw with unescaped unicode chars

    // '&#39;\u263a&#39;' == '&#39;☺&#39;'
    jsonRawEscaped = []byte(`{"HelloWorld": "\uC548\uB155, \uC138\uC0C1(\u4E16\u4E0A). \u263a"}`) // "\\u263a"
    jsonRawUnescaped, _ = _UnescapeUnicodeCharactersInJSON(jsonRawEscaped)                        // "☺"

    fmt.Println(string(jsonRawEscaped))   // {"HelloWorld": "\uC548\uB155, \uC138\uC0C1(\u4E16\u4E0A). \u263a"}
    fmt.Println(string(jsonRawUnescaped)) // {"HelloWorld": "안녕, 세상(世上). ☺"}
}

希望对你有所帮助。

英文:

Another solution to achieve this is to simply replace those escaped characters into unescaped UTF-8 characters. (I used to do this to make non-English letters to be human readable in JSON.)

You can use the strconv.Quote() and strconv.Unquote() to do the conversion.

func _UnescapeUnicodeCharactersInJSON(_jsonRaw json.RawMessage) (json.RawMessage, error) {
	str, err := strconv.Unquote(strings.Replace(strconv.Quote(string(_jsonRaw)), `\\u`, `\u`, -1))
	if err != nil {
		return nil, err
	}
	return []byte(str), nil
}

func main() {
	// Both are valid JSON.
	var jsonRawEscaped json.RawMessage   // json raw with escaped unicode chars
	var jsonRawUnescaped json.RawMessage // json raw with unescaped unicode chars

	// &#39;\u263a&#39; == &#39;☺&#39;
	jsonRawEscaped = []byte(`{&quot;HelloWorld&quot;: &quot;\uC548\uB155, \uC138\uC0C1(\u4E16\u4E0A). \u263a&quot;}`) // &quot;\\u263a&quot;
	jsonRawUnescaped, _ = _UnescapeUnicodeCharactersInJSON(jsonRawEscaped)                        // &quot;☺&quot;

	fmt.Println(string(jsonRawEscaped))   // {&quot;HelloWorld&quot;: &quot;\uC548\uB155, \uC138\uC0C1(\u4E16\u4E0A). \u263a&quot;}
	fmt.Println(string(jsonRawUnescaped)) // {&quot;HelloWorld&quot;: &quot;안녕, 세상(世上). ☺&quot;}
}

https://play.golang.org/p/pUsrzrrcDG-

I hope this helps.

答案3

得分: 0

顺便提一下,这是编译器错误的原因。

json.Marshal 返回的是一个字节切片([]byte),而不是字符串。

当你使用 range 遍历一个字节切片时,你遍历的不是它的符文,而是逐个字节。你不能在一个字节值上使用 DecodeRune(),它期望的是一个符文,即一个表示 Unicode 代码点的 32 位整数值。如果你使用 range 在一个字符串上进行遍历,你会得到这样的结果。

现在,根据你想要实现的目标,看起来你根本不需要使用 DecodeRune。

另一个答案已经很好地描述了如何告诉 JSON 编码器不要转义 &lt;&gt; 字符,即:

enc := json.NewEncoder(os.Stdout)
enc.SetEscapeHTML(false)
英文:

By the way, here's the reason for the compiler error.

json.Marshal returns a byte slice ([]byte), not a string.

When you iterate over a byte slice using range, you are not iterating over its runes but over single bytes at a time. You can't use DecodeRune() on a byte value - it expects a rune, which is a 32-bit integer value representing a Unicode code point. This is what you'd get if you iterate using range on a string.

Now, from what you are wanting to achieve, it doesn't look like you want DecodeRune at all.

The other answer adequately describes how to tell the JSON encode not to escape &lt; and &gt; characters ie

enc := json.NewEncoder(os.Stdout)
enc.SetEscapeHTML(false)

huangapple
  • 本文由 发表于 2017年6月4日 18:35:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/44353109.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定