英文:
Go: How to find out a rune's Unicode properties?
问题
我想了解一个符文的Unicode属性,特别是它的脚本属性的值。Unicode在http://www.unicode.org/reports/tr24/的第1.5节中有这样的说明:
脚本属性为每个字符分配一个单一的值,要么明确地将其与特定脚本关联起来,要么分配一个特殊值之一。
Go的unicode
包提供了一种询问“这个符文是否属于脚本x?”的方法,但没有一种方法可以询问“这个符文属于哪个脚本?”。我可以显然地遍历所有脚本,但那样会很浪费。有没有更聪明的方法来找出一个符文的脚本?(我可以实现一个自组织列表,但我正在寻找的是标准的Go库中已经实现了我想要的功能,而我可能忽略了它。)
谢谢大家!
英文:
I want to find out a rune's Unicode properties, particularly the value of its script property. Unicode has this to say (in http://www.unicode.org/reports/tr24/ Section 1.5):
The script property assigns a single value to each character, either
explicitly associating it with a particular script, or assigning one
of several specail [sic] values.
Go's unicode
package provides me with a way to ask, "Is this rune in script x?", but has no way for me to ask, "In what script is this rune?". I could obviously iterate over all scripts, but that would be wasteful. Is there a cleverer way to find out a rune's script? (I could always implement a self-organising list, but I'm looking for something in the standard go libraries that already does what I want, and that I have overlooked.)
Thanks all!
答案1
得分: 5
最简单和最快的解决方案是编写该函数。例如,
package main
import (
"fmt"
"unicode"
)
var runeScript map[rune]string
func init() {
const nChar = 128172 // 版本 9.0.0
runeScript = make(map[rune]string, nChar*125/100)
for s, rt := range unicode.Scripts {
for _, r := range rt.R16 {
for i := r.Lo; i <= r.Hi; i += r.Stride {
runeScript[rune(i)] = s
}
}
for _, r := range rt.R32 {
for i := r.Lo; i <= r.Hi; i += r.Stride {
runeScript[rune(i)] = s
}
}
}
}
func script(r rune) string {
return runeScript[r]
}
func main() {
chars := []rune{' ', '0', 'a', 'α', 'А', 'ㄱ'}
for _, c := range chars {
s := script(c)
fmt.Printf("%q %s\n", c, s)
}
}
输出:
$ go run script.go
' ' Common
'0' Common
'a' Latin
'α' Greek
'А' Cyrillic
'ㄱ' Hangul
$
英文:
The easiest and quickest solution is to write the function. For example,
package main
import (
"fmt"
"unicode"
)
var runeScript map[rune]string
func init() {
const nChar = 128172 // Version 9.0.0
runeScript = make(map[rune]string, nChar*125/100)
for s, rt := range unicode.Scripts {
for _, r := range rt.R16 {
for i := r.Lo; i <= r.Hi; i += r.Stride {
runeScript[rune(i)] = s
}
}
for _, r := range rt.R32 {
for i := r.Lo; i <= r.Hi; i += r.Stride {
runeScript[rune(i)] = s
}
}
}
}
func script(r rune) string {
return runeScript[r]
}
func main() {
chars := []rune{' ', '0', 'a', 'α', 'А', 'ㄱ'}
for _, c := range chars {
s := script(c)
fmt.Printf("%q %s\n", c, s)
}
}
Output:
$ go run script.go
' ' Common
'0' Common
'a' Latin
'α' Greek
'А' Cyrillic
'ㄱ' Hangul
$
答案2
得分: 3
改进PeterSO的答案
PeterSO的答案很好,很清晰。但是它在内存使用方面并不节省,因为它在一个映射中存储了十几万个条目,值的类型是string
。即使string
值只是存储指针和长度的头部(参见reflect.StringHeader
),但在映射中有这么多string
值仍然需要多个MB(大约6MB)的内存!
由于可能的不同string
值(不同的脚本名称)的数量很小(137个),我们可以选择使用值类型byte
,它只是一个索引,指向存储真实脚本名称的切片。
代码如下所示:
var runeScript map[rune]byte
var names = []string{""}
func init() {
const nChar = 128172 // Version 9.0.0
runeScript = make(map[rune]byte, nChar*125/100)
for s, rt := range unicode.Scripts {
idx := byte(len(names))
names = append(names, s)
for _, r := range rt.R16 {
for i := r.Lo; i <= r.Hi; i += r.Stride {
runeScript[rune(i)] = idx
}
}
for _, r := range rt.R32 {
for i := r.Lo; i <= r.Hi; i += r.Stride {
runeScript[rune(i)] = idx
}
}
}
}
func script(r rune) string {
return names[runeScript[r]]
}
func main() {
chars := []rune{' ', '0', 'a', 'α', 'А', 'ㄱ'}
for _, c := range chars {
s := script(c)
fmt.Printf("%q %s\n", c, s)
}
}
这个简单的改进只需要使用map[rune]string
的三分之一的内存。输出结果与之前相同(在Go Playground上尝试一下):
' ' Common
'0' Common
'a' Latin
'α' Greek
'А' Cyrillic
'ㄱ' Hangul
构建合并的范围切片
使用map[rune]byte
将导致大约2MB的内存使用,并且构建这个映射需要一些时间,这可能是可以接受的,也可能不可接受。
还有另一种方法/解决方案。我们可以选择不构建一个包含“所有”符文的映射,而只存储一个包含所有范围的切片(实际上是两个范围切片,一个包含16位Unicode值,另一个包含32位Unicode码点)。
这种方法的好处在于范围的数量远远少于符文的数量:只有852个范围(与10万多个符文相比)。两个切片的内存使用量加起来只有852个元素,与解决方案#1相比可以忽略不计。
在范围中,我们还存储了脚本(名称),以便我们可以返回这些信息。我们也可以只存储名称索引(与解决方案#1中一样),但由于我们只有852个范围,这样做没有意义。
我们将对范围切片进行排序,以便我们可以在其中使用二分查找(切片中大约有400个元素,二分查找:最多需要7步即可得到结果,最坏情况下在两个切片上重复二分查找:最多需要15步)。
好的,让我们看看。我们使用以下范围包装器:
type myR16 struct {
r16 unicode.Range16
script string
}
type myR32 struct {
r32 unicode.Range32
script string
}
并将它们存储在:
var allR16 = []*myR16{}
var allR32 = []*myR32{}
我们这样初始化/填充它们:
func init() {
for script, rt := range unicode.Scripts {
for _, r16 := range rt.R16 {
allR16 = append(allR16, &myR16{r16, script})
}
for _, r32 := range rt.R32 {
allR32 = append(allR32, &myR32{r32, script})
}
}
// 排序
sort.Slice(allR16, func(i int, j int) bool {
return allR16[i].r16.Lo < allR16[j].r16.Lo
})
sort.Slice(allR32, func(i int, j int) bool {
return allR32[i].r32.Lo < allR32[j].r32.Lo
})
}
最后,在排序后的范围切片中进行搜索:
func script(r rune) string {
// 在范围中进行二分查找
if r <= 0xffff {
r16 := uint16(r)
i := sort.Search(len(allR16), func(i int) bool {
return allR16[i].r16.Hi >= r16
})
if i < len(allR16) && allR16[i].r16.Lo <= r16 && r16 <= allR16[i].r16.Hi {
return allR16[i].script
}
}
r32 := uint32(r)
i := sort.Search(len(allR32), func(i int) bool {
return allR32[i].r32.Hi >= r32
})
if i < len(allR32) && allR32[i].r32.Lo <= r32 && r32 <= allR32[i].r32.Hi {
return allR32[i].script
}
return ""
}
注意:在unicode
包中,所有脚本的Stride
始终为1,我利用了这一点(并没有在算法中包含它)。
使用相同的代码进行测试,我们得到相同的输出结果。在Go Playground上尝试一下。
英文:
Improving PeterSO's answer
PeterSO's answer is nice and clear. It doesn't go easy on memory usage though, as it stores more than a hundred thousand entries in a map, values being of string
type. Even though a string
value is just a header storing a pointer and a length (see reflect.StringHeader
), having so many of them in a map is still multiple MB (like 6 MB)!
Since the number of possible different string
values (the different script names) is small (137), we may opt to use a value type byte
, which will just be an index in a slice storing the real script names.
This is how it could look like:
var runeScript map[rune]byte
var names = []string{""}
func init() {
const nChar = 128172 // Version 9.0.0
runeScript = make(map[rune]byte, nChar*125/100)
for s, rt := range unicode.Scripts {
idx := byte(len(names))
names = append(names, s)
for _, r := range rt.R16 {
for i := r.Lo; i <= r.Hi; i += r.Stride {
runeScript[rune(i)] = idx
}
}
for _, r := range rt.R32 {
for i := r.Lo; i <= r.Hi; i += r.Stride {
runeScript[rune(i)] = idx
}
}
}
}
func script(r rune) string {
return names[runeScript[r]]
}
func main() {
chars := []rune{' ', '0', 'a', 'α', 'А', 'ㄱ'}
for _, c := range chars {
s := script(c)
fmt.Printf("%q %s\n", c, s)
}
}
This simple improvement requires only one third of the memory compared to using map[rune]string
. Output is the same (try it on the Go Playground):
' ' Common
'0' Common
'a' Latin
'α' Greek
'А' Cyrillic
'ㄱ' Hangul
Building merged range slices
Using map[rune]byte
will result in like 2 MB of RAM usage, and it takes "some" time to build this map, which may or may not be acceptable.
There's another approach / solution. We may opt in to not build a map of "all" runes, but only store a slice of all ranges (actually 2 slices of ranges, one with 16-bit Unicode values, and another with 32-bit Unicode codepoints).
The benefit of this originates from the fact that the number of ranges is much less than the number of runes: only 852 (compared to 100,000+ runes). Memory usage of 2 slices having a total of 852 elements will be negligible compared to solution #1.
In our ranges we also store the script (name), so we can return this info. We could also store only a name index (as in solution #1), but since we only have 852 ranges, it's not worth it.
We'll sort the range slices, so we can use binary search in it (~400 elements in a slice, binary search: we get the result in like 7 steps max, and worst case repeating binary search on both: 15 steps).
Ok, so let's see. We're using these range wrappers:
type myR16 struct {
r16 unicode.Range16
script string
}
type myR32 struct {
r32 unicode.Range32
script string
}
And store them in:
var allR16 = []*myR16{}
var allR32 = []*myR32{}
We initialize / fill them like this:
func init() {
for script, rt := range unicode.Scripts {
for _, r16 := range rt.R16 {
allR16 = append(allR16, &myR16{r16, script})
}
for _, r32 := range rt.R32 {
allR32 = append(allR32, &myR32{r32, script})
}
}
// sort
sort.Slice(allR16, func(i int, j int) bool {
return allR16[i].r16.Lo < allR16[j].r16.Lo
})
sort.Slice(allR32, func(i int, j int) bool {
return allR32[i].r32.Lo < allR32[j].r32.Lo
})
}
And finally the search in the sorted range slices:
func script(r rune) string {
// binary search over ranges
if r <= 0xffff {
r16 := uint16(r)
i := sort.Search(len(allR16), func(i int) bool {
return allR16[i].r16.Hi >= r16
})
if i < len(allR16) && allR16[i].r16.Lo <= r16 && r16 <= allR16[i].r16.Hi {
return allR16[i].script
}
}
r32 := uint32(r)
i := sort.Search(len(allR32), func(i int) bool {
return allR32[i].r32.Hi >= r32
})
if i < len(allR32) && allR32[i].r32.Lo <= r32 && r32 <= allR32[i].r32.Hi {
return allR32[i].script
}
return ""
}
Note: the Stride
is always 1 in all scripts in the unicode
package, which I took advantage of (and did not include it in the algorithm).
Testing with the same code, we get the same output. Try it on the Go Playground.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论