为什么utf8.ValidString函数无法检测到无效的Unicode字符?

huangapple go评论85阅读模式
英文:

Why utf8.Validstring function not detecting invalid unicode characters?

问题

从https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points上了解到,U+D800到U+DFFF是无效的。所以在十进制系统中,它是55296到57343。

而最大有效的Unicode是'\U0010FFFF'。在十进制系统中,它是1114111。

我的代码:

package main

import "fmt"
import "unicode/utf8"

func main() {

    fmt.Println("Case 1(Invalid Range)")
    str := fmt.Sprintf("%c", rune(55296+1))
    if !utf8.ValidString(str) {
        fmt.Print(str, " is not a valid Unicode")
    } else {
        fmt.Println(str, " is valid unicode character")
    }

    fmt.Println("Case 2(More than maximum valid range)")
    str = fmt.Sprintf("%c", rune(1114111+1))
    if !utf8.ValidString(str) {
        fmt.Print(str, " is not a valid Unicode")
    } else {
        fmt.Println(str, " is valid unicode character")
    }
}

为什么ValidString函数对于输入的无效Unicode字符不返回false?我确定我的理解是错误的,有人能解释一下吗?

英文:

From https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points, I got to know that U+D800 through U+DFFF are invalid. So in decimal system, it is 55296 through 57343.

And Maximum valid Unicode is '\U0010FFFF'. In decimal system, it is 1114111

My code:

package main

import "fmt"
import "unicode/utf8"

func main() {

	fmt.Println("Case 1(Invalid Range)")
	str := fmt.Sprintf("%c", rune(55296+1))
	if !utf8.ValidString(str) {
		fmt.Print(str, " is not a valid Unicode")
	} else {
		fmt.Println(str, " is valid unicode character")
	}

	fmt.Println("Case 2(More than maximum valid range)")
	str = fmt.Sprintf("%c", rune(1114111+1))
	if !utf8.ValidString(str) {
		fmt.Print(str, " is not a valid Unicode")
	} else {
		fmt.Println(str, " is valid unicode character")
	}
}

Why ValidString function is not returning false for invalid unicode characters given as input ? I am sure my understanding is wrong, could some one explain??

答案1

得分: 4

你的问题发生在 Sprintf 中。由于你给了一个无效的字符,Sprintf 会用 rune(65533) 替换它,这是一个用来代替无效字符的 Unicode 替换字符。所以你的字符串是有效的 UTF8。

如果你像这样做:str := string([]rune{ 55297 }),也会发生这种情况,所以这可能是在创建 rune 时发生的事情。从这个链接中并不立即明显:https://blog.golang.org/strings

如果你想强制让你的字符串包含无效的 UTF8,你可以像这样编写第一个字符串:

str := string([]byte{237, 159, 193})
英文:

Your problem happens in Sprintf. Since you give it an invalid character Sprintf replaces with with rune(65533) which is the unicode replacement character used instead of invalid characters. So your string is valid UTF8.

This will also happen if you do something like this: str := string([]rune{ 55297 }) so this might be something that happens when creating runes. It's not immediately obvious from: https://blog.golang.org/strings

If you want to force your string to contain invalid UTF8 you can write the first string like this:

str := string([]byte{237, 159, 193})

答案2

得分: 2

你将一个无效的值使用Sprintf进行转换,它被转换为错误值。然后你检查错误值,它是一个有效的Unicode码点。

package main

import (
	"fmt"
	"unicode/utf8"
)

func main() {

	fmt.Println("Case 1: Invalid Range")
	str := fmt.Sprintf("%c", rune(55296+1))
	fmt.Printf("%q %X %d %d\n", str, str, []rune(str)[0], utf8.RuneError)
	if !utf8.ValidString(str) {
		fmt.Print(str, " is not a valid Unicode")
	} else {
		fmt.Println(str, " is valid unicode character")
	}

	fmt.Println("Case 2: More than maximum valid range")
	str = fmt.Sprintf("%c", rune(1114111+1))
	fmt.Printf("%q %X %d %d\n", str, str, []rune(str)[0], utf8.RuneError)
	if !utf8.ValidString(str) {
		fmt.Print(str, " is not a valid Unicode")
	} else {
		fmt.Println(str, " is valid unicode character")
}

输出:

Case 1: Invalid Range
"�" EFBFBD 65533 65533
�  is valid unicode character
Case 2: More than maximum valid range
"�" EFBFBD 65533 65533
�  is valid unicode character
英文:

You take an invalid value and convert it using Sprintf. It's converted to the error value. You then check the error value, which is a valid Unicode code point.

package main

import (
	"fmt"
	"unicode/utf8"
)

func main() {

	fmt.Println("Case 1: Invalid Range")
	str := fmt.Sprintf("%c", rune(55296+1))
	fmt.Printf("%q %X %d %d\n", str, str, []rune(str)[0], utf8.RuneError)
	if !utf8.ValidString(str) {
		fmt.Print(str, " is not a valid Unicode")
	} else {
		fmt.Println(str, " is valid unicode character")
	}

	fmt.Println("Case 2: More than maximum valid range")
	str = fmt.Sprintf("%c", rune(1114111+1))
	fmt.Printf("%q %X %d %d\n", str, str, []rune(str)[0], utf8.RuneError)
	if !utf8.ValidString(str) {
		fmt.Print(str, " is not a valid Unicode")
	} else {
		fmt.Println(str, " is valid unicode character")
	}

}

Output:

Case 1: Invalid Range
"�" EFBFBD 65533 65533
�  is valid unicode character
Case 2: More than maximum valid range
"�" EFBFBD 65533 65533
�  is valid unicode character

huangapple
  • 本文由 发表于 2016年4月5日 20:25:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/36426327.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定