Go中的UTF-8范围表

huangapple go评论86阅读模式
英文:

UTF-8 range table in Go

问题

我一直在阅读 Unicode Go 页面,并想知道 range tables 的用例是什么。它们可以用来做什么?是否有一种函数可以获取单个字符所在的范围?

英文:

I have been reading the unicode Go page and I'm wondering what the use case of the range tables are. What can they be used for? Is there a function to retrieve the range that a single character can be found in.

答案1

得分: 3

范围表的目的是以一种高效的方式描述一组字符。由于字符是按照 Unicode 标准添加的方式,具有相似属性的字符通常会被放在一起。因此,通常更节省空间的做法是列出特定字符集存在的范围,而不是列出每个单独的字符。

这样可以通过执行一系列范围检查来查找给定字符是否存在于特定字符集中。如果字符的 Unicode 代码点在范围表中的任何范围内,那么该字符被认为是范围表描述的字符集的元素。

没有通用的函数可以检索单个字符所在的范围,因为字符 -> 范围在一般情况下不是唯一的,也不是特别有用的关系。例如,以字母A为例。它存在于范围[65, 90](ASCII 大写字母),但它也存在于范围[0, 127](所有 ASCII 字符)以及范围[9, 9999][60, 70]等等。

如果你想知道一个字符是否在特定的范围表集合中,你可以使用unicode.In函数。

示例:

package main

import (
	"fmt"
	"unicode"
)

func main() {
	found := unicode.In('A', unicode.Latin)
	fmt.Println(found)
}
true

这将检查A是否存在于给定的范围表unicode.Latin中,或者说“Unicode 中属于拉丁文字符集的字符”。

英文:

The purpose of a range table is that it is an efficient way to describe a set of characters. Due to the way that characters are added to the Unicode standard, characters with similar properties will often be found together. So, it's usually more space-efficient to list the ranges where a specific set of characters exist, rather than listing every individual character.

This allows you to look up if a given character exists within a specific character set by performing a series of range checks. If the character's Unicode code point is within any of the ranges in the range table, then that character is considered to be an element of the character set that the range table describes.

There isn't a general function to retrieve the range that a single character can be found in, because character -> range isn't a unique, or particularly useful relationship in the general case. For example, take the letter A. It exists in the range [65, 90] (ASCII uppercase letters), but it also exists in the range [0, 127] (all ASCII characters), and the ranges [9, 9999], [60, 70], etc..

If you want to know if a character is in a particular set of range tables, you can use the unicode.In function.

Example:

package main

import (
	"fmt"
	"unicode"
)

func main() {
	found := unicode.In('A', unicode.Latin)
	fmt.Println(found)
}
true

This checks if A exists within any of the given range table unicode.Latin, or "the set of Unicode characters in script Latin"

huangapple
  • 本文由 发表于 2021年7月6日 10:13:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/68263737.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定