英文:
Remove invalid UTF-8 characters from a string
问题
我在对字符串列表进行json.Marshal时遇到了这个问题:
json: invalid UTF-8 in string: "...ole\xc5\"
原因很明显,但是我该如何在Go中删除/替换这样的字符串呢?我已经阅读了unicode
和unicode/utf8
包的文档,但似乎没有明显/快速的方法来做到这一点。
例如,在Python中,你可以使用方法来删除无效字符,用指定的字符替换它们,或者使用严格设置,在遇到无效字符时引发异常。我该如何在Go中实现类似的功能?
更新:我指的是出现异常(panic)的原因 - json.Marshal期望的是有效的UTF-8字符串中存在非法字符。
(如何将非法字节序列放入该字符串中并不重要,通常的方式可能是错误、文件损坏、不符合Unicode标准的其他程序等)
英文:
I get this on json.Marshal of a list of strings:
json: invalid UTF-8 in string: "...ole\xc5\"
The reason is obvious, but how can I delete/replace such strings in Go? I've been reading docst on unicode
and unicode/utf8
packages and there seems no obvious/quick way to do it.
In Python for example you have methods for it where the invalid characters can be deleted, replaced by a specified character or strict setting which raises exception on invalid chars. How can I do equivalent thing in Go?
UPDATE: I meant the reason for getting an exception (panic?) - illegal char in what json.Marshal expects to be valid UTF-8 string.
(how the illegal byte sequence got into that string is not important, the usual way - bugs, file corruption, other programs that do not conform to unicode, etc)
答案1
得分: 26
在Go 1.13+中,你可以这样做:
strings.ToValidUTF8("a\xc5z", "")
在Go 1.11+中,你也可以使用Map函数和utf8.RuneError来实现相同的效果,代码如下:
fixUtf := func(r rune) rune {
if r == utf8.RuneError {
return -1
}
return r
}
fmt.Println(strings.Map(fixUtf, "a\xc5z"))
fmt.Println(strings.Map(fixUtf, "posic�o"))
输出结果:
az
posico
Playground链接:这里。
英文:
In Go 1.13+, you can do this:
strings.ToValidUTF8("a\xc5z", "")
In Go 1.11+, it's also very easy to do the same using the Map function and utf8.RuneError like this:
fixUtf := func(r rune) rune {
if r == utf8.RuneError {
return -1
}
return r
}
fmt.Println(strings.Map(fixUtf, "a\xc5z"))
fmt.Println(strings.Map(fixUtf, "posic�o"))
Output:
az
posico
Playground: Here.
答案2
得分: 23
例如,
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
s := "a\xc5z"
fmt.Printf("%q\n", s)
if !utf8.ValidString(s) {
v := make([]rune, 0, len(s))
for i, r := range s {
if r == utf8.RuneError {
_, size := utf8.DecodeRuneInString(s[i:])
if size == 1 {
continue
}
}
v = append(v, r)
}
s = string(v)
}
fmt.Printf("%q\n", s)
}
输出:
"a\xc5z"
"az"
常见问题 - UTF-8、UTF-16、UTF-32和BOM
问:是否有任何不是由UTF生成的字节序列?我应该如何解释它们?
答:没有任何UTF可以生成任意的字节序列。例如,在UTF-8中,形式为110xxxxx2的每个字节后面必须跟着形式为10xxxxxx2的字节。类似<110xxxxx2 0xxxxxxx2>的序列是非法的,绝不能生成。当在转换或解释过程中遇到此非法字节序列时,符合UTF-8的处理过程必须将第一个字节110xxxxx2视为非法终止错误:例如,发出错误信号、过滤掉字节或使用FFFD(替换字符)表示字节。在后两种情况下,它将继续处理第二个字节0xxxxxxx2。
符合规范的处理过程不得将非法或格式错误的字节序列解释为字符,但可以采取错误恢复操作。任何符合规范的处理过程都不得使用不规则的字节序列来编码带外信息。
英文:
For example,
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
s := "a\xc5z"
fmt.Printf("%q\n", s)
if !utf8.ValidString(s) {
v := make([]rune, 0, len(s))
for i, r := range s {
if r == utf8.RuneError {
_, size := utf8.DecodeRuneInString(s[i:])
if size == 1 {
continue
}
}
v = append(v, r)
}
s = string(v)
}
fmt.Printf("%q\n", s)
}
Output:
"a\xc5z"
"az"
> Unicode Standard
>
> FAQ - UTF-8, UTF-16, UTF-32 & BOM
>
> Q: Are there any byte sequences that are not generated by a UTF? How
> should I interpret them?
>
> A: None of the UTFs can generate every arbitrary byte sequence. For
> example, in UTF-8 every byte of the form 110xxxxx2 must be followed
> with a byte of the form 10xxxxxx2. A sequence such as <110xxxxx2
> 0xxxxxxx2> is illegal, and must never be generated. When faced with
> this illegal byte sequence while transforming or interpreting, a UTF-8
> conformant process must treat the first byte 110xxxxx2 as an illegal
> termination error: for example, either signaling an error, filtering
> the byte out, or representing the byte with a marker such as FFFD
> (REPLACEMENT CHARACTER). In the latter two cases, it will continue
> processing at the second byte 0xxxxxxx2.
>
> A conformant process must not interpret illegal or ill-formed byte
> sequences as characters, however, it may take error recovery actions.
> No conformant process may use irregular byte sequences to encode
> out-of-band information.
答案3
得分: 1
另一种方法是根据这个答案,可以这样做:
s = string([]rune(s))
示例:
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
s := "...ole\xc5"
fmt.Println(s, utf8.Valid([]byte(s)))
// 输出:...ole� false
s = string([]rune(s))
fmt.Println(s, utf8.Valid([]byte(s)))
// 输出:...ole� true
}
尽管结果看起来不太“漂亮”,但它仍然将字符串转换为有效的UTF-8编码。
英文:
Another way to do this, according to this answer, could be
s = string([]rune(s))
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
s := "...ole\xc5"
fmt.Println(s, utf8.Valid([]byte(s)))
// Output: ...ole� false
s = string([]rune(s))
fmt.Println(s, utf8.Valid([]byte(s)))
// Output: ...ole� true
}
Even though the result doesn't look "pretty", it still nevertheless converts the string into a valid UTF-8 encoding.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论