Convert unicode code point to literal character in Go

huangapple go评论75阅读模式
英文:

Convert unicode code point to literal character in Go

问题

让我们假设我有一个像这样的文本文件。

\u0053
\u0075
\u006E

有没有办法将其转换为这样?

S
u
n

目前,我正在使用ioutil.ReadFile("data.txt"),但是当我打印数据时,我得到的是Unicode代码点,而不是字符串文字。我意识到这是ReadFile的正确行为,但这不是我想要的。

我希望将代码点替换为它们的文字字符。

英文:

Let's say I have a text file like this.

\u0053
\u0075
\u006E

Is there a way I can convert that to this?

S
u
n

Currently, I'm using ioutil.ReadFile("data.txt"), but when I print the data, I get the unicode code points instead of the string literals. I realize this is the correct behavior for ReadFile, it's just not want I want.

I'm aiming for a substitution of the code points with their literal characters.

答案1

得分: 7

你可以使用strconv.Unquote()strconv.UnquoteChar()函数进行转换。

需要注意的一点是,strconv.Unquote()只能解析带引号的字符串(例如以引号字符"或反引号字符`开头和结尾),所以我们需要手动添加引号。

看看这个例子:

lines := []string{
    `\u0053`,
    `\u0075`,
    `\u006E`,
}
fmt.Println(lines)

for i, v := range lines {
    var err error
    lines[i], err = strconv.Unquote(`"` + v + `"`)
    if err != nil {
        fmt.Println(err)
    }
}
fmt.Println(lines)

fmt.Println(strconv.Unquote(`"Go\u0070\x68\x65\x72"`))

输出结果(在Go Playground上尝试):

[\u0053 \u0075 \u006E]
[S u n]
Gopher <nil>

如果你想解析的字符串包含单个rune的转义序列(或者只想解析第一个rune),你可以使用strconv.UnquoteChar()。示例如下(注意:在这种情况下不需要对输入进行引号处理,就像对strconv.Unquote()所需的那样):

runes := []string{
    `\u0053`,
    `\u0075`,
    `\u006E`,
}
fmt.Println(runes)

for _, v := range runes {
    var err error
    value, _, _, err := strconv.UnquoteChar(v, 0)
    if err != nil {
        fmt.Println(err)
    }
    fmt.Printf("%c\n", value)
}

输出结果(在Go Playground上尝试):

[\u0053 \u0075 \u006E]
S
u
n
英文:

You can use the strconv.Unquote() and strconv.UnquoteChar() functions to do the conversion.

One thing you should be aware of is that strconv.Unquote() can only unquote strings that are in quotes (e.g. start and end with a quote char " or a back quote char `), so we have to manually append that.

See this example:

lines := []string{
	`\u0053`,
	`\u0075`,
	`\u006E`,
}
fmt.Println(lines)

for i, v := range lines {
	var err error
	lines[i], err = strconv.Unquote(`"` + v + `"`)
	if err != nil {
		fmt.Println(err)
	}
}
fmt.Println(lines)

fmt.Println(strconv.Unquote(`"Go\u0070\x68\x65\x72"`))

Output (try it on the Go Playground):

[\u0053 \u0075 \u006E]
[S u n]
Gopher <nil>

If the strings you want to unquote contain the escape sequence of a single rune (or you just want to unquote the first rune), you may use strconv.UnquoteChar(). This is how it looks like (note: no quoting of the input is needed in this case, like it was needed for strconv.Unquote()):

runes := []string{
	`\u0053`,
	`\u0075`,
	`\u006E`,
}
fmt.Println(runes)

for _, v := range runes {
	var err error
	value, _, _, err := strconv.UnquoteChar(v, 0)
	if err != nil {
		fmt.Println(err)
	}
	fmt.Printf("%c\n", value)
}

This will output (try it on the Go Playground):

[\u0053 \u0075 \u006E]
S
u
n

答案2

得分: 3

稍微不同的方法是使用strconv.ParseInt,这样可以生成更少的垃圾并且使用更少的内部逻辑(Unquote执行了很多其他检查)来解析行:

for i, v := range lines {
    if len(v) != 6 {
        continue
    }

    if r, err := strconv.ParseInt(v[2:], 16, 32); err == nil {
        lines[i] = string(r)
    }
}

playground

英文:

A slightly different approach is using strconv.ParseInt, this generates less garbage and uses less internal logic (Unquote does a lot of other checks) for parsing the lines:

for i, v := range lines {
	if len(v) != 6 {
		continue
	}

	if r, err := strconv.ParseInt(v[2:], 16, 32); err == nil {
		lines[i] = string(r)
	}
}

<kbd>playground</kbd>

答案3

得分: 1

你可以使用以下代码:

import "github.com/chzyer/readline/runes"

// unicodeUnquote将unicode点(如\u0053)转换为UTF8编码。
func unicodeUnquote(bs []byte) []byte {
	unicodeEscapeRx := regexp.MustCompile(`\\u[0-9a-fA-F]{4}`)
	return unicodeEscapeRx.ReplaceAllFunc(bs, func(code []byte) []byte {
		rune, _, _, _ := strconv.UnquoteChar(string(code), 0)
		width := runes.Width(rune)
		runeBytes := make([]byte, width)
		utf8.EncodeRune(runeBytes, rune)
		return runeBytes
	})
}

完整示例可在https://go.dev/play/p/ElIGZvJNyEF中找到。

英文:

You can use this:

import &quot;github.com/chzyer/readline/runes&quot;

// unicodeUnquote converts unicode points such as \u0053 to UTF8 encoding.
func unicodeUnquote(bs []byte) []byte {
	unicodeEscapeRx := regexp.MustCompile(`\\u[0-9a-fA-F]{4}`)
	return unicodeEscapeRx.ReplaceAllFunc(bs, func(code []byte) []byte {
		rune, _, _, _ := strconv.UnquoteChar(string(code), 0)
		width := runes.Width(rune)
		runeBytes := make([]byte, width)
		utf8.EncodeRune(runeBytes, rune)
		return runeBytes
	})
}

A full example is at https://go.dev/play/p/ElIGZvJNyEF.

huangapple
  • 本文由 发表于 2015年12月7日 12:59:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/34126749.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定