英文:
Why utf8.Validstring function not detecting invalid unicode characters?
问题
从https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points上了解到,U+D800到U+DFFF是无效的。所以在十进制系统中,它是55296到57343。
而最大有效的Unicode是'\U0010FFFF'。在十进制系统中,它是1114111。
我的代码:
package main
import "fmt"
import "unicode/utf8"
func main() {
fmt.Println("Case 1(Invalid Range)")
str := fmt.Sprintf("%c", rune(55296+1))
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
fmt.Println("Case 2(More than maximum valid range)")
str = fmt.Sprintf("%c", rune(1114111+1))
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
}
为什么ValidString函数对于输入的无效Unicode字符不返回false?我确定我的理解是错误的,有人能解释一下吗?
英文:
From https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points, I got to know that U+D800 through U+DFFF are invalid. So in decimal system, it is 55296 through 57343.
And Maximum valid Unicode is '\U0010FFFF'. In decimal system, it is 1114111
My code:
package main
import "fmt"
import "unicode/utf8"
func main() {
fmt.Println("Case 1(Invalid Range)")
str := fmt.Sprintf("%c", rune(55296+1))
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
fmt.Println("Case 2(More than maximum valid range)")
str = fmt.Sprintf("%c", rune(1114111+1))
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
}
Why ValidString function is not returning false for invalid unicode characters given as input ? I am sure my understanding is wrong, could some one explain??
答案1
得分: 4
你的问题发生在 Sprintf 中。由于你给了一个无效的字符,Sprintf 会用 rune(65533)
替换它,这是一个用来代替无效字符的 Unicode 替换字符。所以你的字符串是有效的 UTF8。
如果你像这样做:str := string([]rune{ 55297 })
,也会发生这种情况,所以这可能是在创建 rune 时发生的事情。从这个链接中并不立即明显:https://blog.golang.org/strings
如果你想强制让你的字符串包含无效的 UTF8,你可以像这样编写第一个字符串:
str := string([]byte{237, 159, 193})
英文:
Your problem happens in Sprintf. Since you give it an invalid character Sprintf replaces with with rune(65533)
which is the unicode replacement character used instead of invalid characters. So your string is valid UTF8.
This will also happen if you do something like this: str := string([]rune{ 55297 })
so this might be something that happens when creating runes. It's not immediately obvious from: https://blog.golang.org/strings
If you want to force your string to contain invalid UTF8 you can write the first string like this:
str := string([]byte{237, 159, 193})
答案2
得分: 2
你将一个无效的值使用Sprintf进行转换,它被转换为错误值。然后你检查错误值,它是一个有效的Unicode码点。
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
fmt.Println("Case 1: Invalid Range")
str := fmt.Sprintf("%c", rune(55296+1))
fmt.Printf("%q %X %d %d\n", str, str, []rune(str)[0], utf8.RuneError)
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
fmt.Println("Case 2: More than maximum valid range")
str = fmt.Sprintf("%c", rune(1114111+1))
fmt.Printf("%q %X %d %d\n", str, str, []rune(str)[0], utf8.RuneError)
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
输出:
Case 1: Invalid Range
"�" EFBFBD 65533 65533
� is valid unicode character
Case 2: More than maximum valid range
"�" EFBFBD 65533 65533
� is valid unicode character
英文:
You take an invalid value and convert it using Sprintf. It's converted to the error value. You then check the error value, which is a valid Unicode code point.
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
fmt.Println("Case 1: Invalid Range")
str := fmt.Sprintf("%c", rune(55296+1))
fmt.Printf("%q %X %d %d\n", str, str, []rune(str)[0], utf8.RuneError)
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
fmt.Println("Case 2: More than maximum valid range")
str = fmt.Sprintf("%c", rune(1114111+1))
fmt.Printf("%q %X %d %d\n", str, str, []rune(str)[0], utf8.RuneError)
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
}
Output:
Case 1: Invalid Range
"�" EFBFBD 65533 65533
� is valid unicode character
Case 2: More than maximum valid range
"�" EFBFBD 65533 65533
� is valid unicode character
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论