2016年4月5日 20:25:13go评论107阅读模式

英文:

Why utf8.Validstring function not detecting invalid unicode characters?

问题

从https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points上了解到，U+D800到U+DFFF是无效的。所以在十进制系统中，它是55296到57343。

而最大有效的Unicode是'\U0010FFFF'。在十进制系统中，它是1114111。

我的代码：

package main
import "fmt"
import "unicode/utf8"
func main() {
    fmt.Println("Case 1(Invalid Range)")
    str := fmt.Sprintf("%c", rune(55296+1))
    if !utf8.ValidString(str) {
        fmt.Print(str, " is not a valid Unicode")
    } else {
        fmt.Println(str, " is valid unicode character")
    }
    fmt.Println("Case 2(More than maximum valid range)")
    str = fmt.Sprintf("%c", rune(1114111+1))
    if !utf8.ValidString(str) {
        fmt.Print(str, " is not a valid Unicode")
    } else {
        fmt.Println(str, " is valid unicode character")
    }
}

为什么ValidString函数对于输入的无效Unicode字符不返回false？我确定我的理解是错误的，有人能解释一下吗？

英文:

From https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points, I got to know that U+D800 through U+DFFF are invalid. So in decimal system, it is 55296 through 57343.

And Maximum valid Unicode is '\U0010FFFF'. In decimal system, it is 1114111

My code:

package main
import &quot;fmt&quot;
import &quot;unicode/utf8&quot;
func main() {
	fmt.Println(&quot;Case 1(Invalid Range)&quot;)
	str := fmt.Sprintf(&quot;%c&quot;, rune(55296+1))
	if !utf8.ValidString(str) {
		fmt.Print(str, &quot; is not a valid Unicode&quot;)
	} else {
		fmt.Println(str, &quot; is valid unicode character&quot;)
	}
	fmt.Println(&quot;Case 2(More than maximum valid range)&quot;)
	str = fmt.Sprintf(&quot;%c&quot;, rune(1114111+1))
	if !utf8.ValidString(str) {
		fmt.Print(str, &quot; is not a valid Unicode&quot;)
	} else {
		fmt.Println(str, &quot; is valid unicode character&quot;)
	}
}

Why ValidString function is not returning false for invalid unicode characters given as input ? I am sure my understanding is wrong, could some one explain??

答案1

得分: 4

你的问题发生在 Sprintf 中。由于你给了一个无效的字符，Sprintf 会用 rune(65533) 替换它，这是一个用来代替无效字符的 Unicode 替换字符。所以你的字符串是有效的 UTF8。

如果你像这样做：str := string([]rune{ 55297 })，也会发生这种情况，所以这可能是在创建 rune 时发生的事情。从这个链接中并不立即明显：https://blog.golang.org/strings

如果你想强制让你的字符串包含无效的 UTF8，你可以像这样编写第一个字符串：

str := string([]byte{237, 159, 193})

英文:

Your problem happens in Sprintf. Since you give it an invalid character Sprintf replaces with with rune(65533) which is the unicode replacement character used instead of invalid characters. So your string is valid UTF8.

This will also happen if you do something like this: str := string([]rune{ 55297 }) so this might be something that happens when creating runes. It's not immediately obvious from: https://blog.golang.org/strings

If you want to force your string to contain invalid UTF8 you can write the first string like this:

str := string([]byte{237, 159, 193})

答案2

得分: 2

你将一个无效的值使用Sprintf进行转换，它被转换为错误值。然后你检查错误值，它是一个有效的Unicode码点。

package main
import (
	"fmt"
	"unicode/utf8"
)
func main() {
	fmt.Println("Case 1: Invalid Range")
	str := fmt.Sprintf("%c", rune(55296+1))
	fmt.Printf("%q %X %d %d\n", str, str, []rune(str)[0], utf8.RuneError)
	if !utf8.ValidString(str) {
		fmt.Print(str, " is not a valid Unicode")
	} else {
		fmt.Println(str, " is valid unicode character")
	}
	fmt.Println("Case 2: More than maximum valid range")
	str = fmt.Sprintf("%c", rune(1114111+1))
	fmt.Printf("%q %X %d %d\n", str, str, []rune(str)[0], utf8.RuneError)
	if !utf8.ValidString(str) {
		fmt.Print(str, " is not a valid Unicode")
	} else {
		fmt.Println(str, " is valid unicode character")
}

输出：

Case 1: Invalid Range
"�" EFBFBD 65533 65533
�  is valid unicode character
Case 2: More than maximum valid range
"�" EFBFBD 65533 65533
�  is valid unicode character

英文:

You take an invalid value and convert it using Sprintf. It's converted to the error value. You then check the error value, which is a valid Unicode code point.

package main
import (
	&quot;fmt&quot;
	&quot;unicode/utf8&quot;
)
func main() {
	fmt.Println(&quot;Case 1: Invalid Range&quot;)
	str := fmt.Sprintf(&quot;%c&quot;, rune(55296+1))
	fmt.Printf(&quot;%q %X %d %d\n&quot;, str, str, []rune(str)[0], utf8.RuneError)
	if !utf8.ValidString(str) {
		fmt.Print(str, &quot; is not a valid Unicode&quot;)
	} else {
		fmt.Println(str, &quot; is valid unicode character&quot;)
	}
	fmt.Println(&quot;Case 2: More than maximum valid range&quot;)
	str = fmt.Sprintf(&quot;%c&quot;, rune(1114111+1))
	fmt.Printf(&quot;%q %X %d %d\n&quot;, str, str, []rune(str)[0], utf8.RuneError)
	if !utf8.ValidString(str) {
		fmt.Print(str, &quot; is not a valid Unicode&quot;)
	} else {
		fmt.Println(str, &quot; is valid unicode character&quot;)
	}
}

Output:

Case 1: Invalid Range
&quot;�&quot; EFBFBD 65533 65533
�  is valid unicode character
Case 2: More than maximum valid range
&quot;�&quot; EFBFBD 65533 65533
�  is valid unicode character

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么utf8.ValidString函数无法检测到无效的Unicode字符？

问题

答案1

答案2

Go语言中类似于Python的fileinput.input()的函数是什么？

在Go程序中，时间始终为0。

Redis Pub/Sub Ack/Nack（确认/否认）

Python: 无法找到正确的编码以打印Tableau结果

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。