如何查找一个rune的Unicode属性?

huangapple go评论87阅读模式
英文:

Go: How to find out a rune's Unicode properties?

问题

我想了解一个符文的Unicode属性,特别是它的脚本属性的值。Unicode在http://www.unicode.org/reports/tr24/的第1.5节中有这样的说明:

脚本属性为每个字符分配一个单一的值,要么明确地将其与特定脚本关联起来,要么分配一个特殊值之一。

Go的unicode包提供了一种询问“这个符文是否属于脚本x?”的方法,但没有一种方法可以询问“这个符文属于哪个脚本?”。我可以显然地遍历所有脚本,但那样会很浪费。有没有更聪明的方法来找出一个符文的脚本?(我可以实现一个自组织列表,但我正在寻找的是标准的Go库中已经实现了我想要的功能,而我可能忽略了它。)

谢谢大家!

英文:

I want to find out a rune's Unicode properties, particularly the value of its script property. Unicode has this to say (in http://www.unicode.org/reports/tr24/ Section 1.5):

The script property assigns a single value to each character, either
explicitly associating it with a particular script, or assigning one
of several specail [sic] values.

Go's unicode package provides me with a way to ask, "Is this rune in script x?", but has no way for me to ask, "In what script is this rune?". I could obviously iterate over all scripts, but that would be wasteful. Is there a cleverer way to find out a rune's script? (I could always implement a self-organising list, but I'm looking for something in the standard go libraries that already does what I want, and that I have overlooked.)

Thanks all!

答案1

得分: 5

最简单和最快的解决方案是编写该函数。例如,

package main

import (
    "fmt"
    "unicode"
)

var runeScript map[rune]string

func init() {
    const nChar = 128172 // 版本 9.0.0
    runeScript = make(map[rune]string, nChar*125/100)
    for s, rt := range unicode.Scripts {
        for _, r := range rt.R16 {
            for i := r.Lo; i <= r.Hi; i += r.Stride {
                runeScript[rune(i)] = s
            }
        }
        for _, r := range rt.R32 {
            for i := r.Lo; i <= r.Hi; i += r.Stride {
                runeScript[rune(i)] = s
            }
        }
    }
}

func script(r rune) string {
    return runeScript[r]
}

func main() {
    chars := []rune{' ', '0', 'a', 'α', 'А', 'ㄱ'}
    for _, c := range chars {
        s := script(c)
        fmt.Printf("%q %s\n", c, s)
    }
}

输出:

$ go run script.go
' ' Common
'0' Common
'a' Latin
'α' Greek
'А' Cyrillic
'ㄱ' Hangul
$
英文:

The easiest and quickest solution is to write the function. For example,

package main

import (
	&quot;fmt&quot;
	&quot;unicode&quot;
)

var runeScript map[rune]string

func init() {
	const nChar = 128172 // Version 9.0.0
	runeScript = make(map[rune]string, nChar*125/100)
	for s, rt := range unicode.Scripts {
		for _, r := range rt.R16 {
			for i := r.Lo; i &lt;= r.Hi; i += r.Stride {
				runeScript[rune(i)] = s
			}
		}
		for _, r := range rt.R32 {
			for i := r.Lo; i &lt;= r.Hi; i += r.Stride {
				runeScript[rune(i)] = s
			}
		}
	}
}

func script(r rune) string {
	return runeScript[r]
}

func main() {
	chars := []rune{&#39; &#39;, &#39;0&#39;, &#39;a&#39;, &#39;α&#39;, &#39;А&#39;, &#39;ㄱ&#39;}
	for _, c := range chars {
		s := script(c)
		fmt.Printf(&quot;%q %s\n&quot;, c, s)
	}
}

Output:

$ go run script.go
&#39; &#39; Common
&#39;0&#39; Common
&#39;a&#39; Latin
&#39;α&#39; Greek
&#39;А&#39; Cyrillic
&#39;ㄱ&#39; Hangul
$ 

答案2

得分: 3

改进PeterSO的答案

PeterSO的答案很好,很清晰。但是它在内存使用方面并不节省,因为它在一个映射中存储了十几万个条目,值的类型是string。即使string值只是存储指针和长度的头部(参见reflect.StringHeader),但在映射中有这么多string值仍然需要多个MB(大约6MB)的内存!

由于可能的不同string值(不同的脚本名称)的数量很小(137个),我们可以选择使用值类型byte,它只是一个索引,指向存储真实脚本名称的切片。

代码如下所示:

var runeScript map[rune]byte

var names = []string{""}

func init() {
    const nChar = 128172 // Version 9.0.0
    runeScript = make(map[rune]byte, nChar*125/100)
    for s, rt := range unicode.Scripts {
        idx := byte(len(names))
        names = append(names, s)
        for _, r := range rt.R16 {
            for i := r.Lo; i <= r.Hi; i += r.Stride {
                runeScript[rune(i)] = idx
            }
        }
        for _, r := range rt.R32 {
            for i := r.Lo; i <= r.Hi; i += r.Stride {
                runeScript[rune(i)] = idx
            }
        }
    }
}

func script(r rune) string {
    return names[runeScript[r]]
}

func main() {
    chars := []rune{' ', '0', 'a', 'α', 'А', 'ㄱ'}
    for _, c := range chars {
        s := script(c)
        fmt.Printf("%q %s\n", c, s)
    }
}

这个简单的改进只需要使用map[rune]string的三分之一的内存。输出结果与之前相同(在Go Playground上尝试一下):

' ' Common
'0' Common
'a' Latin
'α' Greek
'А' Cyrillic
'ㄱ' Hangul

构建合并的范围切片

使用map[rune]byte将导致大约2MB的内存使用,并且构建这个映射需要一些时间,这可能是可以接受的,也可能不可接受。

还有另一种方法/解决方案。我们可以选择不构建一个包含“所有”符文的映射,而只存储一个包含所有范围的切片(实际上是两个范围切片,一个包含16位Unicode值,另一个包含32位Unicode码点)。

这种方法的好处在于范围的数量远远少于符文的数量:只有852个范围(与10万多个符文相比)。两个切片的内存使用量加起来只有852个元素,与解决方案#1相比可以忽略不计。

在范围中,我们还存储了脚本(名称),以便我们可以返回这些信息。我们也可以只存储名称索引(与解决方案#1中一样),但由于我们只有852个范围,这样做没有意义。

我们将对范围切片进行排序,以便我们可以在其中使用二分查找(切片中大约有400个元素,二分查找:最多需要7步即可得到结果,最坏情况下在两个切片上重复二分查找:最多需要15步)。

好的,让我们看看。我们使用以下范围包装器:

type myR16 struct {
    r16    unicode.Range16
    script string
}

type myR32 struct {
    r32    unicode.Range32
    script string
}

并将它们存储在:

var allR16 = []*myR16{}
var allR32 = []*myR32{}

我们这样初始化/填充它们:

func init() {
    for script, rt := range unicode.Scripts {
        for _, r16 := range rt.R16 {
            allR16 = append(allR16, &myR16{r16, script})
        }
        for _, r32 := range rt.R32 {
            allR32 = append(allR32, &myR32{r32, script})
        }
    }

    // 排序
    sort.Slice(allR16, func(i int, j int) bool {
        return allR16[i].r16.Lo < allR16[j].r16.Lo
    })
    sort.Slice(allR32, func(i int, j int) bool {
        return allR32[i].r32.Lo < allR32[j].r32.Lo
    })
}

最后,在排序后的范围切片中进行搜索:

func script(r rune) string {
    // 在范围中进行二分查找
    if r <= 0xffff {
        r16 := uint16(r)
        i := sort.Search(len(allR16), func(i int) bool {
            return allR16[i].r16.Hi >= r16
        })

        if i < len(allR16) && allR16[i].r16.Lo <= r16 && r16 <= allR16[i].r16.Hi {
            return allR16[i].script
        }
    }

    r32 := uint32(r)
    i := sort.Search(len(allR32), func(i int) bool {
        return allR32[i].r32.Hi >= r32
    })

    if i < len(allR32) && allR32[i].r32.Lo <= r32 && r32 <= allR32[i].r32.Hi {
        return allR32[i].script
    }

    return ""
}

注意:在unicode包中,所有脚本的Stride始终为1,我利用了这一点(并没有在算法中包含它)。

使用相同的代码进行测试,我们得到相同的输出结果。在Go Playground上尝试一下。

英文:

Improving PeterSO's answer

PeterSO's answer is nice and clear. It doesn't go easy on memory usage though, as it stores more than a hundred thousand entries in a map, values being of string type. Even though a string value is just a header storing a pointer and a length (see reflect.StringHeader), having so many of them in a map is still multiple MB (like 6 MB)!

Since the number of possible different string values (the different script names) is small (137), we may opt to use a value type byte, which will just be an index in a slice storing the real script names.

This is how it could look like:

var runeScript map[rune]byte

var names = []string{&quot;&quot;}

func init() {
	const nChar = 128172 // Version 9.0.0
	runeScript = make(map[rune]byte, nChar*125/100)
	for s, rt := range unicode.Scripts {
		idx := byte(len(names))
		names = append(names, s)
		for _, r := range rt.R16 {
			for i := r.Lo; i &lt;= r.Hi; i += r.Stride {
				runeScript[rune(i)] = idx
			}
		}
		for _, r := range rt.R32 {
			for i := r.Lo; i &lt;= r.Hi; i += r.Stride {
				runeScript[rune(i)] = idx
			}
		}
	}
}

func script(r rune) string {
	return names[runeScript[r]]
}

func main() {
	chars := []rune{&#39; &#39;, &#39;0&#39;, &#39;a&#39;, &#39;α&#39;, &#39;А&#39;, &#39;ㄱ&#39;}
	for _, c := range chars {
		s := script(c)
		fmt.Printf(&quot;%q %s\n&quot;, c, s)
	}
}

This simple improvement requires only one third of the memory compared to using map[rune]string. Output is the same (try it on the Go Playground):

&#39; &#39; Common
&#39;0&#39; Common
&#39;a&#39; Latin
&#39;α&#39; Greek
&#39;А&#39; Cyrillic
&#39;ㄱ&#39; Hangul

Building merged range slices

Using map[rune]byte will result in like 2 MB of RAM usage, and it takes "some" time to build this map, which may or may not be acceptable.

There's another approach / solution. We may opt in to not build a map of "all" runes, but only store a slice of all ranges (actually 2 slices of ranges, one with 16-bit Unicode values, and another with 32-bit Unicode codepoints).

The benefit of this originates from the fact that the number of ranges is much less than the number of runes: only 852 (compared to 100,000+ runes). Memory usage of 2 slices having a total of 852 elements will be negligible compared to solution #1.

In our ranges we also store the script (name), so we can return this info. We could also store only a name index (as in solution #1), but since we only have 852 ranges, it's not worth it.

We'll sort the range slices, so we can use binary search in it (~400 elements in a slice, binary search: we get the result in like 7 steps max, and worst case repeating binary search on both: 15 steps).

Ok, so let's see. We're using these range wrappers:

type myR16 struct {
	r16    unicode.Range16
	script string
}

type myR32 struct {
	r32    unicode.Range32
	script string
}

And store them in:

var allR16 = []*myR16{}
var allR32 = []*myR32{}

We initialize / fill them like this:

func init() {
	for script, rt := range unicode.Scripts {
		for _, r16 := range rt.R16 {
			allR16 = append(allR16, &amp;myR16{r16, script})
		}
		for _, r32 := range rt.R32 {
			allR32 = append(allR32, &amp;myR32{r32, script})
		}
	}

	// sort
	sort.Slice(allR16, func(i int, j int) bool {
		return allR16[i].r16.Lo &lt; allR16[j].r16.Lo
	})
	sort.Slice(allR32, func(i int, j int) bool {
		return allR32[i].r32.Lo &lt; allR32[j].r32.Lo
	})
}

And finally the search in the sorted range slices:

func script(r rune) string {
	// binary search over ranges
	if r &lt;= 0xffff {
		r16 := uint16(r)
		i := sort.Search(len(allR16), func(i int) bool {
			return allR16[i].r16.Hi &gt;= r16
		})

		if i &lt; len(allR16) &amp;&amp; allR16[i].r16.Lo &lt;= r16 &amp;&amp; r16 &lt;= allR16[i].r16.Hi {
			return allR16[i].script
		}
	}

	r32 := uint32(r)
	i := sort.Search(len(allR32), func(i int) bool {
		return allR32[i].r32.Hi &gt;= r32
	})

	if i &lt; len(allR32) &amp;&amp; allR32[i].r32.Lo &lt;= r32 &amp;&amp; r32 &lt;= allR32[i].r32.Hi {
		return allR32[i].script
	}

	return &quot;&quot;
}

Note: the Stride is always 1 in all scripts in the unicode package, which I took advantage of (and did not include it in the algorithm).

Testing with the same code, we get the same output. Try it on the Go Playground.

huangapple
  • 本文由 发表于 2017年3月27日 18:39:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/43044164.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定