将包含Unicode的字节数组转换为Golang中的字符串。

huangapple go评论84阅读模式
英文:

golang convert byte array containing unicode

问题

以下是翻译好的内容:

type MyStruct struct {
    Value json.RawMessage `json:"value"`
}

var resp *http.Response

if resp, err = http.DefaultClient.Do(req); err == nil {
    if resp.StatusCode == 200 {
        var buffer []byte
        if buffer, err = ioutil.ReadAll(resp.Body); err == nil {

            mystruct = &MyStruct{}
            err = json.Unmarshal(buffer, mystruct)

        }
    }
}

fmt.Println(string(mystruct.Value))

它产生的结果类似于:

<head>
  </head>
  <body>

在这里查看文档:http://golang.org/pkg/encoding/json/#Unmarshal

文档中写道:
在解组带引号的字符串时,无效的 UTF-8 或无效的 UTF-16 代理对不会被视为错误。相反,它们会被 Unicode 替换字符 U+FFFD 替换。

我有点认为这就是问题所在。只是由于我对 Go 的经验有限,而且我很累,所以看不出答案。

英文:
type MyStruct struct {
	Value json.RawMessage `json:&quot;value&quot;`
}

var resp *http.Response

if resp, err = http.DefaultClient.Do(req); err == nil {
	if resp.StatusCode == 200 {
		var buffer []byte
		if buffer, err = ioutil.ReadAll(resp.Body); err == nil {

			mystruct = &amp;MyStruct{}
			err = json.Unmarshal(buffer, mystruct)

		}
	}
}

fmt.Println(string(mystruct.Value))

it produces something like:

   \u003Chead&gt;\n  \u003C/head&gt;\n  \u003Cbody&gt;

Doc at: http://golang.org/pkg/encoding/json/#Unmarshal

says:
When unmarshaling quoted strings, invalid UTF-8 or invalid UTF-16 surrogate pairs are not treated as an error. Instead, they are replaced by the Unicode replacement character U+FFFD.

I kinda think this is what is going on. I just can't see the answer as my experience with go is minimal and I'm tired.

答案1

得分: 6

有一种方法可以将json.RawMessage中的转义Unicode字符转换为有效的UTF8字符,而无需解组它。(我不得不处理这个问题,因为我的母语是韩语。)

你可以使用strconv.Quote()strconv.Unquote()来进行转换。

func _UnescapeUnicodeCharactersInJSON(_jsonRaw json.RawMessage) (json.RawMessage, error) {
    str, err := strconv.Unquote(strings.Replace(strconv.Quote(string(_jsonRaw)), `\\u`, `\u`, -1))
    if err != nil {
        return nil, err
    }
    return []byte(str), nil
}

func main() {
    // Both are valid JSON.
    var jsonRawEscaped json.RawMessage   // json raw with escaped unicode chars
    var jsonRawUnescaped json.RawMessage // json raw with unescaped unicode chars

    // '&#39;\u263a&#39;' == '&#39;☺&#39;'
    jsonRawEscaped = []byte(`{&quot;HelloWorld&quot;: &quot;\uC548\uB155, \uC138\uC0C1(\u4E16\u4E0A). \u263a&quot;}`) // &quot;\\u263a&quot;
    jsonRawUnescaped, _ = _UnescapeUnicodeCharactersInJSON(jsonRawEscaped)                        // &quot;☺&quot;

    fmt.Println(string(jsonRawEscaped))   // {&quot;HelloWorld&quot;: &quot;\uC548\uB155, \uC138\uC0C1(\u4E16\u4E0A). \u263a&quot;}
    fmt.Println(string(jsonRawUnescaped)) // {&quot;HelloWorld&quot;: &quot;안녕, 세상(世上). ☺&quot;}
}

希望这可以帮到你:D

英文:

There is a way to convert escaped unicode characters in json.RawMessage into just valid UTF8 characters without unmarshalling it. (I had to deal with the issue since my primary language is Korean.)

You can use the strconv.Quote() and strconv.Unquote() to do the conversion.

func _UnescapeUnicodeCharactersInJSON(_jsonRaw json.RawMessage) (json.RawMessage, error) {
    str, err := strconv.Unquote(strings.Replace(strconv.Quote(string(_jsonRaw)), `\\u`, `\u`, -1))
    if err != nil {
        return nil, err
    }
    return []byte(str), nil
}

func main() {
    // Both are valid JSON.
    var jsonRawEscaped json.RawMessage   // json raw with escaped unicode chars
    var jsonRawUnescaped json.RawMessage // json raw with unescaped unicode chars

    // &#39;\u263a&#39; == &#39;☺&#39;
    jsonRawEscaped = []byte(`{&quot;HelloWorld&quot;: &quot;\uC548\uB155, \uC138\uC0C1(\u4E16\u4E0A). \u263a&quot;}`) // &quot;\\u263a&quot;
    jsonRawUnescaped, _ = _UnescapeUnicodeCharactersInJSON(jsonRawEscaped)                        // &quot;☺&quot;

    fmt.Println(string(jsonRawEscaped))   // {&quot;HelloWorld&quot;: &quot;\uC548\uB155, \uC138\uC0C1(\u4E16\u4E0A). \u263a&quot;}
    fmt.Println(string(jsonRawUnescaped)) // {&quot;HelloWorld&quot;: &quot;안녕, 세상(世上). ☺&quot;}
}

https://play.golang.org/p/pUsrzrrcDG-

Hope this helps 将包含Unicode的字节数组转换为Golang中的字符串。

答案2

得分: 3

你决定使用json.RawMessage来防止解析json消息中键为value的值。

字符串字面量&quot;\u003chtml\u003e&quot;&quot;&lt;html&gt;&quot;的有效json表示。

由于你告诉json.Unmarshal不要解析这部分内容,它不会解析它并将其原样返回给你。

如果你想将其解析为UTF-8字符串,那么将MyStruct的定义更改为:

type MyStruct struct {
    Value string `json:&quot;value&quot;`
}
英文:

You decided to use json.RawMessage to prevent parsing of the value with key value in your json message.

The string literal &quot;\u003chtml\u003e&quot; is a valid json representation of &quot;&lt;html&gt;&quot;.

Since you told json.Unmarshal not to parse this part, it does not parse it and returns it to you as-is.

If you want to have it parsed into an UTF-8 string, then change the definition of MyStruct to:

type MyStruct struct {
    Value string `json:&quot;value&quot;`
}

huangapple
  • 本文由 发表于 2015年3月27日 23:26:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/29304338.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定