字符串中的符文

huangapple go评论92阅读模式
英文:

runes within strings

问题

我正在阅读Go By Example,其中的字符串和符文部分非常令人困惑。

运行以下代码:

    sample := "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
    fmt.Println(sample)
    fmt.Printf("%%q: %q\n", sample)
    fmt.Printf("%%+q: %+q\n", sample)

输出结果为:

��=� ⌘
%q: "\xbd\xb2=\xbc ⌘"
%+q: "\xbd\xb2=\xbc \u2318"

这是正常的。第一个、第二个和第四个符文似乎是不可打印的,我猜这意味着\xbd\xb2\xbc在Unicode中不被支持,所以它们显示为�。%q%+q都正确地转义了这三个不可打印的符文。

但是当我像这样迭代字符串时:

    for _, runeValue := range sample {
        fmt.Printf("% x, %q, %+q\n", runeValue, runeValue, runeValue)
    }

突然间,%q没有转义这三个不可打印的符文,仍然显示为�,而%+q试图显示它们的底层代码点,这显然是不正确的:

 fffd, '�', '\ufffd'
 fffd, '�', '\ufffd'
 3d,   '=',  '='
 fffd, '�', '\ufffd'
 20,   ' ' ,  ' '
 2318, '⌘', '\u2318'

更奇怪的是,如果我将字符串作为字节切片进行迭代:

    for _, runeValue := range []byte(sample) {
        fmt.Printf("% x, %q, %+q\n", runeValue, runeValue, runeValue)
    }

突然间,这些符文不再是不可打印的,它们的底层代码点也是正确的:

 bd, '½', '\u00bd'
 b2, '²', '\u00b2'
 3d, '=',  '='
 bc, '¼', '\u00bc'
 20, ' ',  ' '
 e2, 'â', '\u00e2'
 8c, '\u008c', '\u008c'
 98, '\u0098', '\u0098'

有人能解释一下这里发生了什么吗?

英文:

I am going through Go By Example, and the strings and runes section is terribly confusing.

Running this:

    sample := "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
    fmt.Println(sample)
    fmt.Printf("%%q: %q\n", sample)
    fmt.Printf("%%+q: %+q\n", sample)

yields this:

��=� ⌘
%q: "\xbd\xb2=\xbc ⌘"
%+q: "\xbd\xb2=\xbc \u2318"

..which is fine. The 1st, 2nd and 4th rune seem to be non-printable, which I guess means that \xbd, \xb2 and \xbc are simply not supported by Unicode or something (correct me if im wrong), and so they show up as �. Both %q and %+q also correctly escape those 3 non-printable runes.

But now when I iterate over the string like so:

    for _, runeValue := range sample {
        fmt.Printf("% x, %q, %+q\n", runeValue, runeValue, runeValue)
    }

suddenly the 3 non-printable runes are not escaped by %q and remain as �, and %+q attempts to reveal their underlying code point, which is obviously incorrect:

 fffd, '�', '\ufffd'
 fffd, '�', '\ufffd'
 3d,   '=' ,  '='
 fffd, '�', '\ufffd'
 20,   ' ' ,  ' '
 2318, '⌘', '\u2318'

Even strangely, if I iterate over the string as a byte slice:

    for _, runeValue := range []byte(sample) {
        fmt.Printf("% x, %q, %+q\n", runeValue, runeValue, runeValue)
    }

suddenly, these runes are no longer non-printable, and their underlying code points are correct:

 bd, '½', '\u00bd'
 b2, '²', '\u00b2'
 3d, '=', '='
 bc, '¼', '\u00bc'
 20, ' ', ' '
 e2, 'â', '\u00e2'
 8c, '\u008c', '\u008c'
 98, '\u0098', '\u0098'

Can someone explain whats happening here?

答案1

得分: 0

fmt.Printf会在内部执行很多操作,通过类型检查等方式来呈现尽可能多的有用信息。如果你想验证一个字符串(或字节切片)是否是有效的UTF-8编码,可以使用标准库包encoding/utf8

例如:

import "unicode/utf8"

var sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"

fmt.Printf("%q valid? %v\n", sample, utf8.ValidString(sample)) // 输出 "false"

通过扫描字符串的每个符文,我们可以确定导致该字符串无效(从UTF-8编码角度)的原因。注意:十六进制值0xfffd表示遇到了无效的符文。这个错误值被定义为一个包常量utf8.RuneError

for _, r := range sample {
    validRune := r != utf8.RuneError // 是否为0xfffd?即无效的符文?
    if validRune {
        fmt.Printf("'%c' validRune: true   hex: %4x\n", r, r)
    } else {
        fmt.Printf("'%c' validRune: false\n", r)
    }
}

输出结果为:

'�' validRune: false
'�' validRune: false
'=' validRune: true   hex:   3d
'�' validRune: false
' ' validRune: true   hex:   20
'⌘' validRune: true   hex: 2318

你可以在这里查看完整的示例代码。

英文:

fmt.Printf will do lot of magic under the covers to render as much useful information via type inspection etc. If you want to verify if a string (or a byte slice) is valid UTF-8 use the standard library package encoding/utf8.

For example:

import "unicode/utf8"

var sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"

fmt.Printf("%q valid? %v\n", sample, utf8.ValidString(sample)) // reports "false"

Scanning the individual runes of the string we can identify what makes this string invalid (from a UTF-8 encoding perspective). Note: the hex value 0xfffd indicates a bad rune was encounter. This error value is defined as a package constant utf8.RuneError:

for _, r := range sample {

	validRune := r != utf8.RuneError // is 0xfffd? i.e. bad rune?

	if validRune {
		fmt.Printf("'%c' validRune: true   hex: %4x\n", r, r)
	} else {
		fmt.Printf("'%c' validRune: false\n", r)
	}
}

https://go.dev/play/p/9NO9xMvcxCp

produces:

'�' validRune: false
'�' validRune: false
'=' validRune: true   hex:   3d
'�' validRune: false
' ' validRune: true   hex:   20
'⌘' validRune: true   hex: 2318

huangapple
  • 本文由 发表于 2023年3月26日 01:24:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/75843374.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定