从字符串中删除无效的UTF-8字符

huangapple go评论98阅读模式
英文:

Remove invalid UTF-8 characters from a string

问题

我在对字符串列表进行json.Marshal时遇到了这个问题:

json: invalid UTF-8 in string: "...ole\xc5\"

原因很明显,但是我该如何在Go中删除/替换这样的字符串呢?我已经阅读了unicodeunicode/utf8包的文档,但似乎没有明显/快速的方法来做到这一点。

例如,在Python中,你可以使用方法来删除无效字符,用指定的字符替换它们,或者使用严格设置,在遇到无效字符时引发异常。我该如何在Go中实现类似的功能?

更新:我指的是出现异常(panic)的原因 - json.Marshal期望的是有效的UTF-8字符串中存在非法字符。

(如何将非法字节序列放入该字符串中并不重要,通常的方式可能是错误、文件损坏、不符合Unicode标准的其他程序等)

英文:

I get this on json.Marshal of a list of strings:

json: invalid UTF-8 in string: "...ole\xc5\"

The reason is obvious, but how can I delete/replace such strings in Go? I've been reading docst on unicode and unicode/utf8 packages and there seems no obvious/quick way to do it.

In Python for example you have methods for it where the invalid characters can be deleted, replaced by a specified character or strict setting which raises exception on invalid chars. How can I do equivalent thing in Go?

UPDATE: I meant the reason for getting an exception (panic?) - illegal char in what json.Marshal expects to be valid UTF-8 string.

(how the illegal byte sequence got into that string is not important, the usual way - bugs, file corruption, other programs that do not conform to unicode, etc)

答案1

得分: 26

在Go 1.13+中,你可以这样做:

strings.ToValidUTF8("a\xc5z", "")

在Go 1.11+中,你也可以使用Map函数utf8.RuneError来实现相同的效果,代码如下:

fixUtf := func(r rune) rune {
    if r == utf8.RuneError {
        return -1
    }
    return r
}

fmt.Println(strings.Map(fixUtf, "a\xc5z"))
fmt.Println(strings.Map(fixUtf, "posic�o"))

输出结果:

az
posico

Playground链接:这里

英文:

In Go 1.13+, you can do this:

strings.ToValidUTF8("a\xc5z", "")

In Go 1.11+, it's also very easy to do the same using the Map function and utf8.RuneError like this:

fixUtf := func(r rune) rune {
	if r == utf8.RuneError {
		return -1
	}
	return r
}

fmt.Println(strings.Map(fixUtf, "a\xc5z"))
fmt.Println(strings.Map(fixUtf, "posic�o"))

Output:

az
posico

Playground: Here.

答案2

得分: 23

例如,

package main

import (
	"fmt"
	"unicode/utf8"
)

func main() {
	s := "a\xc5z"
	fmt.Printf("%q\n", s)
	if !utf8.ValidString(s) {
		v := make([]rune, 0, len(s))
		for i, r := range s {
			if r == utf8.RuneError {
				_, size := utf8.DecodeRuneInString(s[i:])
				if size == 1 {
					continue
				}
			}
			v = append(v, r)
		}
		s = string(v)
	}
	fmt.Printf("%q\n", s)
}

输出:

"a\xc5z"
"az"

Unicode标准

常见问题 - UTF-8、UTF-16、UTF-32和BOM

问:是否有任何不是由UTF生成的字节序列?我应该如何解释它们?

答:没有任何UTF可以生成任意的字节序列。例如,在UTF-8中,形式为110xxxxx2的每个字节后面必须跟着形式为10xxxxxx2的字节。类似<110xxxxx2 0xxxxxxx2>的序列是非法的,绝不能生成。当在转换或解释过程中遇到此非法字节序列时,符合UTF-8的处理过程必须将第一个字节110xxxxx2视为非法终止错误:例如,发出错误信号、过滤掉字节或使用FFFD(替换字符)表示字节。在后两种情况下,它将继续处理第二个字节0xxxxxxx2。

符合规范的处理过程不得将非法或格式错误的字节序列解释为字符,但可以采取错误恢复操作。任何符合规范的处理过程都不得使用不规则的字节序列来编码带外信息。

英文:

For example,

package main

import (
	&quot;fmt&quot;
	&quot;unicode/utf8&quot;
)

func main() {
	s := &quot;a\xc5z&quot;
	fmt.Printf(&quot;%q\n&quot;, s)
	if !utf8.ValidString(s) {
		v := make([]rune, 0, len(s))
		for i, r := range s {
			if r == utf8.RuneError {
				_, size := utf8.DecodeRuneInString(s[i:])
				if size == 1 {
					continue
				}
			}
			v = append(v, r)
		}
		s = string(v)
	}
	fmt.Printf(&quot;%q\n&quot;, s)
}

Output:

&quot;a\xc5z&quot;
&quot;az&quot;

> Unicode Standard
>
> FAQ - UTF-8, UTF-16, UTF-32 & BOM
>
> Q: Are there any byte sequences that are not generated by a UTF? How
> should I interpret them?
>
> A: None of the UTFs can generate every arbitrary byte sequence. For
> example, in UTF-8 every byte of the form 110xxxxx2 must be followed
> with a byte of the form 10xxxxxx2. A sequence such as <110xxxxx2
> 0xxxxxxx2> is illegal, and must never be generated. When faced with
> this illegal byte sequence while transforming or interpreting, a UTF-8
> conformant process must treat the first byte 110xxxxx2 as an illegal
> termination error: for example, either signaling an error, filtering
> the byte out, or representing the byte with a marker such as FFFD
> (REPLACEMENT CHARACTER). In the latter two cases, it will continue
> processing at the second byte 0xxxxxxx2.
>
> A conformant process must not interpret illegal or ill-formed byte
> sequences as characters, however, it may take error recovery actions.
> No conformant process may use irregular byte sequences to encode
> out-of-band information.

答案3

得分: 1

另一种方法是根据这个答案,可以这样做:

s = string([]rune(s))

示例

package main

import (
	"fmt"
	"unicode/utf8"
)

func main() {
	s := "...ole\xc5"
	fmt.Println(s, utf8.Valid([]byte(s)))
	// 输出:...ole� false

	s = string([]rune(s))
	fmt.Println(s, utf8.Valid([]byte(s)))
	// 输出:...ole� true
}

尽管结果看起来不太“漂亮”,但它仍然将字符串转换为有效的UTF-8编码。

英文:

Another way to do this, according to this answer, could be

s = string([]rune(s))

Example:

package main

import (
	&quot;fmt&quot;
	&quot;unicode/utf8&quot;
)

func main() {
	s := &quot;...ole\xc5&quot;
	fmt.Println(s, utf8.Valid([]byte(s)))
	// Output: ...ole� false

	s = string([]rune(s))
	fmt.Println(s, utf8.Valid([]byte(s)))
	// Output: ...ole� true
}

Even though the result doesn't look "pretty", it still nevertheless converts the string into a valid UTF-8 encoding.

huangapple
  • 本文由 发表于 2013年12月5日 21:56:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/20401873.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定