在Go语言中确定空白字符的方法

huangapple go评论98阅读模式
英文:

Determining whitespace in Go

问题

根据Go语言的unicode包文档中的描述,"other definitions"是指由Z类别和Pattern_White_Space属性定义的其他空格字符。调用unicode.IsSpace()、检查字符是否属于Z类别以及检查字符是否属于Pattern_White_Space可能会得到不同的结果。这些差异可能是由于Unicode标准中对空格字符的定义有多种方式。具体的差异和原因需要参考Unicode标准的定义和规范。

英文:

From the documentation of Go's unicode package:

> func IsSpace
>
> func IsSpace(r rune) bool
>
> IsSpace reports whether the rune is a space character as defined by Unicode's White Space property; in the Latin-1 space this is
>
> '\t', '\n', '\v', '\f', '\r', ' ', U+0085 (NEL), U+00A0 (NBSP).
>
> Other definitions of spacing characters are set by category Z and property Pattern_White_Space.

My question is: What does it mean that "other definitions" are set by the Z category and Pattern_White_Space? Does this mean that calling unicode.IsSpace(), checking whether a character is in the Z category, and checking whether a character is in Pattern_White_Space will all yield different results? If so, what are the differences? And why are there differences?

答案1

得分: 5

IsSpace函数首先会检查你的rune是否在Latin1字符空间中。如果是的话,它将使用你列出的空格字符来确定空白间隔。

如果不是的话,会调用isExcludingLatin函数(http://golang.org/src/unicode/letter.go?h=isExcludingLatin#L170),代码如下:

func isExcludingLatin(rangeTab *RangeTable, r rune) bool {
    r16 := rangeTab.R16
    if off := rangeTab.LatinOffset; len(r16) > off && r <= rune(r16[len(r16)-1].Hi) {
        return is16(r16[off:], uint16(r))
    }
    r32 := rangeTab.R32
    if len(r32) > 0 && r >= rune(r32[0].Lo) {
        return is32(r32, uint32(r))
    }
    return false
}

传入的*RangeTableWhite_Space,在这里定义:

http://golang.org/src/unicode/tables.go?h=White_Space#L6069

var _White_Space = &RangeTable{
    R16: []Range16{
        {0x0009, 0x000d, 1},
        {0x0020, 0x0020, 1},
        {0x0085, 0x0085, 1},
        {0x00a0, 0x00a0, 1},
        {0x1680, 0x1680, 1},
        {0x2000, 0x200a, 1},
        {0x2028, 0x2029, 1},
        {0x202f, 0x202f, 1},
        {0x205f, 0x205f, 1},
        {0x3000, 0x3000, 1},
    },
    LatinOffset: 4,
}

回答你的主要问题,IsSpace检查不仅限于Latin-1字符集。

编辑
为了澄清,如果你要测试的字符不在Latin-1字符集中,那么会使用范围表查找。表中的Range16值表示16位数字的范围{Low, Hi, Stride}。isExcludingLatin将使用该范围表的子部分(R16)调用is16,并确定提供的rune是否在LatinOffset索引之后的任何范围内(在这种情况下,索引为4)。

因此,会检查以下范围:

{0x1680, 0x1680, 1},
{0x2000, 0x200a, 1},
{0x2028, 0x2029, 1},
{0x202f, 0x202f, 1},
{0x205f, 0x205f, 1},
{0x3000, 0x3000, 1},

上述所有字符都被认为是“空格”。

英文:

The IsSpace function will first check if your rune is in the Latin1 char space. If it is, it will use the space characters you listed to determine white-spacing.

If not, isExcludingLatin (http://golang.org/src/unicode/letter.go?h=isExcludingLatin#L170) is called which looks like:

   170	func isExcludingLatin(rangeTab *RangeTable, r rune) bool {
   171		r16 := rangeTab.R16
   172		if off := rangeTab.LatinOffset; len(r16) &gt; off &amp;&amp; r &lt;= rune(r16[len(r16)-1].Hi) {
   173			return is16(r16[off:], uint16(r))
   174		}
   175		r32 := rangeTab.R32
   176		if len(r32) &gt; 0 &amp;&amp; r &gt;= rune(r32[0].Lo) {
   177			return is32(r32, uint32(r))
   178		}
   179		return false
   180	}

The *RangeTable being passed in is White_Space which looks is defined here:

http://golang.org/src/unicode/tables.go?h=White_Space#L6069

  6069  var _White_Space = &amp;RangeTable{
  6070		R16: []Range16{
  6071			{0x0009, 0x000d, 1},
  6072			{0x0020, 0x0020, 1},
  6073			{0x0085, 0x0085, 1},
  6074			{0x00a0, 0x00a0, 1},
  6075			{0x1680, 0x1680, 1},
  6076			{0x2000, 0x200a, 1},
  6077			{0x2028, 0x2029, 1},
  6078			{0x202f, 0x202f, 1},
  6079			{0x205f, 0x205f, 1},
  6080			{0x3000, 0x3000, 1},
  6081		},
  6082		LatinOffset: 4,
  6083	}

To answer your main question, the IsSpace check is not limited to Latin-1.

EDIT
For clarification, if the character you are testing is not in the Latin-1 charset, then the range table lookup is used. The Range16 values in the table represent ranges of 16bit numbers {Low, Hi, Stride}. The isExcludingLatin will call is16 with that range table sub-section (R16) and determine if the rune provided falls in any of the ranges after the index of LatinOffset (which is 4 in this case).

So, that is checking these ranges:

 {0x1680, 0x1680, 1},
 {0x2000, 0x200a, 1},
 {0x2028, 0x2029, 1},
 {0x202f, 0x202f, 1},
 {0x205f, 0x205f, 1},
 {0x3000, 0x3000, 1},

There are unicode code points for:

http://www.fileformat.info/info/unicode/char/1680/index.htm
http://www.fileformat.info/info/unicode/char/2000/index.htm
http://www.fileformat.info/info/unicode/char/2001/index.htm
http://www.fileformat.info/info/unicode/char/2002/index.htm
http://www.fileformat.info/info/unicode/char/2003/index.htm
http://www.fileformat.info/info/unicode/char/2004/index.htm
http://www.fileformat.info/info/unicode/char/2005/index.htm
http://www.fileformat.info/info/unicode/char/2006/index.htm
http://www.fileformat.info/info/unicode/char/2007/index.htm
http://www.fileformat.info/info/unicode/char/2008/index.htm
http://www.fileformat.info/info/unicode/char/2009/index.htm
http://www.fileformat.info/info/unicode/char/200a/index.htm
http://www.fileformat.info/info/unicode/char/2028/index.htm
http://www.fileformat.info/info/unicode/char/2029/index.htm
http://www.fileformat.info/info/unicode/char/202f/index.htm
http://www.fileformat.info/info/unicode/char/205f/index.htm
http://www.fileformat.info/info/unicode/char/3000/index.htm

All of the above are considers "white space"

huangapple
  • 本文由 发表于 2015年3月14日 01:20:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/29038314.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定