2017年3月27日 18:39:53go评论92阅读模式

英文:

Go: How to find out a rune's Unicode properties?

问题

我想了解一个符文的Unicode属性，特别是它的脚本属性的值。Unicode在http://www.unicode.org/reports/tr24/的第1.5节中有这样的说明：

脚本属性为每个字符分配一个单一的值，要么明确地将其与特定脚本关联起来，要么分配一个特殊值之一。

Go的unicode包提供了一种询问“这个符文是否属于脚本x？”的方法，但没有一种方法可以询问“这个符文属于哪个脚本？”。我可以显然地遍历所有脚本，但那样会很浪费。有没有更聪明的方法来找出一个符文的脚本？（我可以实现一个自组织列表，但我正在寻找的是标准的Go库中已经实现了我想要的功能，而我可能忽略了它。）

谢谢大家！

英文:

I want to find out a rune's Unicode properties, particularly the value of its script property. Unicode has this to say (in http://www.unicode.org/reports/tr24/ Section 1.5):

The script property assigns a single value to each character, either
explicitly associating it with a particular script, or assigning one
of several specail [sic] values.

Go's unicode package provides me with a way to ask, "Is this rune in script x?", but has no way for me to ask, "In what script is this rune?". I could obviously iterate over all scripts, but that would be wasteful. Is there a cleverer way to find out a rune's script? (I could always implement a self-organising list, but I'm looking for something in the standard go libraries that already does what I want, and that I have overlooked.)

Thanks all!

答案1

得分: 5

最简单和最快的解决方案是编写该函数。例如，

package main

import (
    "fmt"
    "unicode"
)

var runeScript map[rune]string

func init() {
    const nChar = 128172 // 版本 9.0.0
    runeScript = make(map[rune]string, nChar*125/100)
    for s, rt := range unicode.Scripts {
        for _, r := range rt.R16 {
            for i := r.Lo; i <= r.Hi; i += r.Stride {
                runeScript[rune(i)] = s
            }
        }
        for _, r := range rt.R32 {
            for i := r.Lo; i <= r.Hi; i += r.Stride {
                runeScript[rune(i)] = s
            }
        }
    }
}

func script(r rune) string {
    return runeScript[r]
}

func main() {
    chars := []rune{' ', '0', 'a', 'α', 'А', 'ㄱ'}
    for _, c := range chars {
        s := script(c)
        fmt.Printf("%q %s\n", c, s)
    }
}

输出：

$ go run script.go
' ' Common
'0' Common
'a' Latin
'α' Greek
'А' Cyrillic
'ㄱ' Hangul
$

英文:

The easiest and quickest solution is to write the function. For example,

package main

import (
	&quot;fmt&quot;
	&quot;unicode&quot;
)

var runeScript map[rune]string

func init() {
	const nChar = 128172 // Version 9.0.0
	runeScript = make(map[rune]string, nChar*125/100)
	for s, rt := range unicode.Scripts {
		for _, r := range rt.R16 {
			for i := r.Lo; i &lt;= r.Hi; i += r.Stride {
				runeScript[rune(i)] = s
			}
		}
		for _, r := range rt.R32 {
			for i := r.Lo; i &lt;= r.Hi; i += r.Stride {
				runeScript[rune(i)] = s
			}
		}
	}
}

func script(r rune) string {
	return runeScript[r]
}

func main() {
	chars := []rune{&#39; &#39;, &#39;0&#39;, &#39;a&#39;, &#39;α&#39;, &#39;А&#39;, &#39;ㄱ&#39;}
	for _, c := range chars {
		s := script(c)
		fmt.Printf(&quot;%q %s\n&quot;, c, s)
	}
}

Output:

$ go run script.go
&#39; &#39; Common
&#39;0&#39; Common
&#39;a&#39; Latin
&#39;α&#39; Greek
&#39;А&#39; Cyrillic
&#39;ㄱ&#39; Hangul
$

答案2

得分: 3

改进PeterSO的答案

PeterSO的答案很好，很清晰。但是它在内存使用方面并不节省，因为它在一个映射中存储了十几万个条目，值的类型是string。即使string值只是存储指针和长度的头部（参见reflect.StringHeader），但在映射中有这么多string值仍然需要多个MB（大约6MB）的内存！

由于可能的不同string值（不同的脚本名称）的数量很小（137个），我们可以选择使用值类型byte，它只是一个索引，指向存储真实脚本名称的切片。

代码如下所示：

var runeScript map[rune]byte

var names = []string{""}

func init() {
    const nChar = 128172 // Version 9.0.0
    runeScript = make(map[rune]byte, nChar*125/100)
    for s, rt := range unicode.Scripts {
        idx := byte(len(names))
        names = append(names, s)
        for _, r := range rt.R16 {
            for i := r.Lo; i <= r.Hi; i += r.Stride {
                runeScript[rune(i)] = idx
            }
        }
        for _, r := range rt.R32 {
            for i := r.Lo; i <= r.Hi; i += r.Stride {
                runeScript[rune(i)] = idx
            }
        }
    }
}

func script(r rune) string {
    return names[runeScript[r]]
}

func main() {
    chars := []rune{' ', '0', 'a', 'α', 'А', 'ㄱ'}
    for _, c := range chars {
        s := script(c)
        fmt.Printf("%q %s\n", c, s)
    }
}

这个简单的改进只需要使用map[rune]string的三分之一的内存。输出结果与之前相同（在Go Playground上尝试一下）：

' ' Common
'0' Common
'a' Latin
'α' Greek
'А' Cyrillic
'ㄱ' Hangul

构建合并的范围切片

使用map[rune]byte将导致大约2MB的内存使用，并且构建这个映射需要一些时间，这可能是可以接受的，也可能不可接受。

还有另一种方法/解决方案。我们可以选择不构建一个包含“所有”符文的映射，而只存储一个包含所有范围的切片（实际上是两个范围切片，一个包含16位Unicode值，另一个包含32位Unicode码点）。

这种方法的好处在于范围的数量远远少于符文的数量：只有852个范围（与10万多个符文相比）。两个切片的内存使用量加起来只有852个元素，与解决方案#1相比可以忽略不计。

在范围中，我们还存储了脚本（名称），以便我们可以返回这些信息。我们也可以只存储名称索引（与解决方案#1中一样），但由于我们只有852个范围，这样做没有意义。

我们将对范围切片进行排序，以便我们可以在其中使用二分查找（切片中大约有400个元素，二分查找：最多需要7步即可得到结果，最坏情况下在两个切片上重复二分查找：最多需要15步）。

好的，让我们看看。我们使用以下范围包装器：

type myR16 struct {
    r16    unicode.Range16
    script string
}

type myR32 struct {
    r32    unicode.Range32
    script string
}

并将它们存储在：

var allR16 = []*myR16{}
var allR32 = []*myR32{}

我们这样初始化/填充它们：

func init() {
    for script, rt := range unicode.Scripts {
        for _, r16 := range rt.R16 {
            allR16 = append(allR16, &myR16{r16, script})
        }
        for _, r32 := range rt.R32 {
            allR32 = append(allR32, &myR32{r32, script})
        }
    }

    // 排序
    sort.Slice(allR16, func(i int, j int) bool {
        return allR16[i].r16.Lo < allR16[j].r16.Lo
    })
    sort.Slice(allR32, func(i int, j int) bool {
        return allR32[i].r32.Lo < allR32[j].r32.Lo
    })
}

最后，在排序后的范围切片中进行搜索：

func script(r rune) string {
    // 在范围中进行二分查找
    if r <= 0xffff {
        r16 := uint16(r)
        i := sort.Search(len(allR16), func(i int) bool {
            return allR16[i].r16.Hi >= r16
        })

        if i < len(allR16) && allR16[i].r16.Lo <= r16 && r16 <= allR16[i].r16.Hi {
            return allR16[i].script
        }
    }

    r32 := uint32(r)
    i := sort.Search(len(allR32), func(i int) bool {
        return allR32[i].r32.Hi >= r32
    })

    if i < len(allR32) && allR32[i].r32.Lo <= r32 && r32 <= allR32[i].r32.Hi {
        return allR32[i].script
    }

    return ""
}

注意：在unicode包中，所有脚本的Stride始终为1，我利用了这一点（并没有在算法中包含它）。

使用相同的代码进行测试，我们得到相同的输出结果。在Go Playground上尝试一下。

英文:

Improving PeterSO's answer

PeterSO's answer is nice and clear. It doesn't go easy on memory usage though, as it stores more than a hundred thousand entries in a map, values being of string type. Even though a string value is just a header storing a pointer and a length (see reflect.StringHeader), having so many of them in a map is still multiple MB (like 6 MB)!

Since the number of possible different string values (the different script names) is small (137), we may opt to use a value type byte, which will just be an index in a slice storing the real script names.

This is how it could look like:

var runeScript map[rune]byte

var names = []string{&quot;&quot;}

func init() {
	const nChar = 128172 // Version 9.0.0
	runeScript = make(map[rune]byte, nChar*125/100)
	for s, rt := range unicode.Scripts {
		idx := byte(len(names))
		names = append(names, s)
		for _, r := range rt.R16 {
			for i := r.Lo; i &lt;= r.Hi; i += r.Stride {
				runeScript[rune(i)] = idx
			}
		}
		for _, r := range rt.R32 {
			for i := r.Lo; i &lt;= r.Hi; i += r.Stride {
				runeScript[rune(i)] = idx
			}
		}
	}
}

func script(r rune) string {
	return names[runeScript[r]]
}

func main() {
	chars := []rune{&#39; &#39;, &#39;0&#39;, &#39;a&#39;, &#39;α&#39;, &#39;А&#39;, &#39;ㄱ&#39;}
	for _, c := range chars {
		s := script(c)
		fmt.Printf(&quot;%q %s\n&quot;, c, s)
	}
}

This simple improvement requires only one third of the memory compared to using map[rune]string. Output is the same (try it on the Go Playground):

&#39; &#39; Common
&#39;0&#39; Common
&#39;a&#39; Latin
&#39;α&#39; Greek
&#39;А&#39; Cyrillic
&#39;ㄱ&#39; Hangul

Building merged range slices

Using map[rune]byte will result in like 2 MB of RAM usage, and it takes "some" time to build this map, which may or may not be acceptable.

There's another approach / solution. We may opt in to not build a map of "all" runes, but only store a slice of all ranges (actually 2 slices of ranges, one with 16-bit Unicode values, and another with 32-bit Unicode codepoints).

The benefit of this originates from the fact that the number of ranges is much less than the number of runes: only 852 (compared to 100,000+ runes). Memory usage of 2 slices having a total of 852 elements will be negligible compared to solution #1.

In our ranges we also store the script (name), so we can return this info. We could also store only a name index (as in solution #1), but since we only have 852 ranges, it's not worth it.

We'll sort the range slices, so we can use binary search in it (~400 elements in a slice, binary search: we get the result in like 7 steps max, and worst case repeating binary search on both: 15 steps).

Ok, so let's see. We're using these range wrappers:

type myR16 struct {
	r16    unicode.Range16
	script string
}

type myR32 struct {
	r32    unicode.Range32
	script string
}

And store them in:

var allR16 = []*myR16{}
var allR32 = []*myR32{}

We initialize / fill them like this:

func init() {
	for script, rt := range unicode.Scripts {
		for _, r16 := range rt.R16 {
			allR16 = append(allR16, &amp;myR16{r16, script})
		}
		for _, r32 := range rt.R32 {
			allR32 = append(allR32, &amp;myR32{r32, script})
		}
	}

	// sort
	sort.Slice(allR16, func(i int, j int) bool {
		return allR16[i].r16.Lo &lt; allR16[j].r16.Lo
	})
	sort.Slice(allR32, func(i int, j int) bool {
		return allR32[i].r32.Lo &lt; allR32[j].r32.Lo
	})
}

And finally the search in the sorted range slices:

func script(r rune) string {
	// binary search over ranges
	if r &lt;= 0xffff {
		r16 := uint16(r)
		i := sort.Search(len(allR16), func(i int) bool {
			return allR16[i].r16.Hi &gt;= r16
		})

		if i &lt; len(allR16) &amp;&amp; allR16[i].r16.Lo &lt;= r16 &amp;&amp; r16 &lt;= allR16[i].r16.Hi {
			return allR16[i].script
		}
	}

	r32 := uint32(r)
	i := sort.Search(len(allR32), func(i int) bool {
		return allR32[i].r32.Hi &gt;= r32
	})

	if i &lt; len(allR32) &amp;&amp; allR32[i].r32.Lo &lt;= r32 &amp;&amp; r32 &lt;= allR32[i].r32.Hi {
		return allR32[i].script
	}

	return &quot;&quot;
}

Note: the Stride is always 1 in all scripts in the unicode package, which I took advantage of (and did not include it in the algorithm).

Testing with the same code, we get the same output. Try it on the Go Playground.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何查找一个rune的Unicode属性？

问题

答案1

答案2

改进PeterSO的答案

构建合并的范围切片

Improving PeterSO's answer

Building merged range slices

使用名称实例化一个包变量。

在Golang中减少代码重复

How to download HTTP directory with all files and sub-directories as they appear on the online files/folders list using golang?

Go（语言）通用数值类型/接口

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论