2015年3月14日 01:20:01go评论98阅读模式

英文:

Determining whitespace in Go

问题

根据Go语言的unicode包文档中的描述，"other definitions"是指由Z类别和Pattern_White_Space属性定义的其他空格字符。调用unicode.IsSpace()、检查字符是否属于Z类别以及检查字符是否属于Pattern_White_Space可能会得到不同的结果。这些差异可能是由于Unicode标准中对空格字符的定义有多种方式。具体的差异和原因需要参考Unicode标准的定义和规范。

英文:

From the documentation of Go's unicode package:

> func IsSpace
>
> func IsSpace(r rune) bool
>
> IsSpace reports whether the rune is a space character as defined by Unicode's White Space property; in the Latin-1 space this is
>
> '\t', '\n', '\v', '\f', '\r', ' ', U+0085 (NEL), U+00A0 (NBSP).
>
> Other definitions of spacing characters are set by category Z and property Pattern_White_Space.

My question is: What does it mean that "other definitions" are set by the Z category and Pattern_White_Space? Does this mean that calling unicode.IsSpace(), checking whether a character is in the Z category, and checking whether a character is in Pattern_White_Space will all yield different results? If so, what are the differences? And why are there differences?

答案1

得分: 5

IsSpace函数首先会检查你的rune是否在Latin1字符空间中。如果是的话，它将使用你列出的空格字符来确定空白间隔。

如果不是的话，会调用isExcludingLatin函数（http://golang.org/src/unicode/letter.go?h=isExcludingLatin#L170），代码如下：

func isExcludingLatin(rangeTab *RangeTable, r rune) bool {
    r16 := rangeTab.R16
    if off := rangeTab.LatinOffset; len(r16) > off && r <= rune(r16[len(r16)-1].Hi) {
        return is16(r16[off:], uint16(r))
    }
    r32 := rangeTab.R32
    if len(r32) > 0 && r >= rune(r32[0].Lo) {
        return is32(r32, uint32(r))
    }
    return false
}

传入的*RangeTable是White_Space，在这里定义：

http://golang.org/src/unicode/tables.go?h=White_Space#L6069

var _White_Space = &RangeTable{
    R16: []Range16{
        {0x0009, 0x000d, 1},
        {0x0020, 0x0020, 1},
        {0x0085, 0x0085, 1},
        {0x00a0, 0x00a0, 1},
        {0x1680, 0x1680, 1},
        {0x2000, 0x200a, 1},
        {0x2028, 0x2029, 1},
        {0x202f, 0x202f, 1},
        {0x205f, 0x205f, 1},
        {0x3000, 0x3000, 1},
    },
    LatinOffset: 4,
}

回答你的主要问题，IsSpace检查不仅限于Latin-1字符集。

编辑
为了澄清，如果你要测试的字符不在Latin-1字符集中，那么会使用范围表查找。表中的Range16值表示16位数字的范围{Low, Hi, Stride}。isExcludingLatin将使用该范围表的子部分（R16）调用is16，并确定提供的rune是否在LatinOffset索引之后的任何范围内（在这种情况下，索引为4）。

因此，会检查以下范围：

{0x1680, 0x1680, 1},
{0x2000, 0x200a, 1},
{0x2028, 0x2029, 1},
{0x202f, 0x202f, 1},
{0x205f, 0x205f, 1},
{0x3000, 0x3000, 1},

上述所有字符都被认为是“空格”。

英文:

The IsSpace function will first check if your rune is in the Latin1 char space. If it is, it will use the space characters you listed to determine white-spacing.

If not, isExcludingLatin (http://golang.org/src/unicode/letter.go?h=isExcludingLatin#L170) is called which looks like:

   170	func isExcludingLatin(rangeTab *RangeTable, r rune) bool {
   171		r16 := rangeTab.R16
   172		if off := rangeTab.LatinOffset; len(r16) &gt; off &amp;&amp; r &lt;= rune(r16[len(r16)-1].Hi) {
   173			return is16(r16[off:], uint16(r))
   174		}
   175		r32 := rangeTab.R32
   176		if len(r32) &gt; 0 &amp;&amp; r &gt;= rune(r32[0].Lo) {
   177			return is32(r32, uint32(r))
   178		}
   179		return false
   180	}

The *RangeTable being passed in is White_Space which looks is defined here:

http://golang.org/src/unicode/tables.go?h=White_Space#L6069

  6069  var _White_Space = &amp;RangeTable{
  6070		R16: []Range16{
  6071			{0x0009, 0x000d, 1},
  6072			{0x0020, 0x0020, 1},
  6073			{0x0085, 0x0085, 1},
  6074			{0x00a0, 0x00a0, 1},
  6075			{0x1680, 0x1680, 1},
  6076			{0x2000, 0x200a, 1},
  6077			{0x2028, 0x2029, 1},
  6078			{0x202f, 0x202f, 1},
  6079			{0x205f, 0x205f, 1},
  6080			{0x3000, 0x3000, 1},
  6081		},
  6082		LatinOffset: 4,
  6083	}

To answer your main question, the IsSpace check is not limited to Latin-1.

EDIT
For clarification, if the character you are testing is not in the Latin-1 charset, then the range table lookup is used. The Range16 values in the table represent ranges of 16bit numbers {Low, Hi, Stride}. The isExcludingLatin will call is16 with that range table sub-section (R16) and determine if the rune provided falls in any of the ranges after the index of LatinOffset (which is 4 in this case).

So, that is checking these ranges:

 {0x1680, 0x1680, 1},
 {0x2000, 0x200a, 1},
 {0x2028, 0x2029, 1},
 {0x202f, 0x202f, 1},
 {0x205f, 0x205f, 1},
 {0x3000, 0x3000, 1},

There are unicode code points for:

All of the above are considers "white space"

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Go语言中确定空白字符的方法

问题

答案1

为什么在`unsafe.Sizeof()`中解引用空指针不会引发恐慌？

在Go语言中将字符串转换为整数类型的方法是什么？

Adding format characters such as \t and \n to a string of JSON so they will work in a linux echo command

Deadlock in book <The Go Programming Language>, how it would happen and why it happen?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论