英文:
Determining whitespace in Go
问题
根据Go语言的unicode
包文档中的描述,"other definitions"是指由Z类别和Pattern_White_Space
属性定义的其他空格字符。调用unicode.IsSpace()
、检查字符是否属于Z类别以及检查字符是否属于Pattern_White_Space
可能会得到不同的结果。这些差异可能是由于Unicode标准中对空格字符的定义有多种方式。具体的差异和原因需要参考Unicode标准的定义和规范。
英文:
From the documentation of Go's unicode
package:
> func IsSpace
>
> func IsSpace(r rune) bool
>
> IsSpace reports whether the rune is a space character as defined by Unicode's White Space property; in the Latin-1 space this is
>
> '\t', '\n', '\v', '\f', '\r', ' ', U+0085 (NEL), U+00A0 (NBSP).
>
> Other definitions of spacing characters are set by category Z and property Pattern_White_Space.
My question is: What does it mean that "other definitions" are set by the Z category and Pattern_White_Space
? Does this mean that calling unicode.IsSpace()
, checking whether a character is in the Z category, and checking whether a character is in Pattern_White_Space
will all yield different results? If so, what are the differences? And why are there differences?
答案1
得分: 5
IsSpace函数首先会检查你的rune
是否在Latin1字符空间中。如果是的话,它将使用你列出的空格字符来确定空白间隔。
如果不是的话,会调用isExcludingLatin函数(http://golang.org/src/unicode/letter.go?h=isExcludingLatin#L170),代码如下:
func isExcludingLatin(rangeTab *RangeTable, r rune) bool {
r16 := rangeTab.R16
if off := rangeTab.LatinOffset; len(r16) > off && r <= rune(r16[len(r16)-1].Hi) {
return is16(r16[off:], uint16(r))
}
r32 := rangeTab.R32
if len(r32) > 0 && r >= rune(r32[0].Lo) {
return is32(r32, uint32(r))
}
return false
}
传入的*RangeTable
是White_Space
,在这里定义:
http://golang.org/src/unicode/tables.go?h=White_Space#L6069
var _White_Space = &RangeTable{
R16: []Range16{
{0x0009, 0x000d, 1},
{0x0020, 0x0020, 1},
{0x0085, 0x0085, 1},
{0x00a0, 0x00a0, 1},
{0x1680, 0x1680, 1},
{0x2000, 0x200a, 1},
{0x2028, 0x2029, 1},
{0x202f, 0x202f, 1},
{0x205f, 0x205f, 1},
{0x3000, 0x3000, 1},
},
LatinOffset: 4,
}
回答你的主要问题,IsSpace
检查不仅限于Latin-1字符集。
编辑
为了澄清,如果你要测试的字符不在Latin-1字符集中,那么会使用范围表查找。表中的Range16
值表示16位数字的范围{Low, Hi, Stride}。isExcludingLatin
将使用该范围表的子部分(R16
)调用is16
,并确定提供的rune
是否在LatinOffset
索引之后的任何范围内(在这种情况下,索引为4)。
因此,会检查以下范围:
{0x1680, 0x1680, 1},
{0x2000, 0x200a, 1},
{0x2028, 0x2029, 1},
{0x202f, 0x202f, 1},
{0x205f, 0x205f, 1},
{0x3000, 0x3000, 1},
上述所有字符都被认为是“空格”。
英文:
The IsSpace function will first check if your rune
is in the Latin1 char space. If it is, it will use the space characters you listed to determine white-spacing.
If not, isExcludingLatin
(http://golang.org/src/unicode/letter.go?h=isExcludingLatin#L170) is called which looks like:
170 func isExcludingLatin(rangeTab *RangeTable, r rune) bool {
171 r16 := rangeTab.R16
172 if off := rangeTab.LatinOffset; len(r16) > off && r <= rune(r16[len(r16)-1].Hi) {
173 return is16(r16[off:], uint16(r))
174 }
175 r32 := rangeTab.R32
176 if len(r32) > 0 && r >= rune(r32[0].Lo) {
177 return is32(r32, uint32(r))
178 }
179 return false
180 }
The *RangeTable
being passed in is White_Space
which looks is defined here:
http://golang.org/src/unicode/tables.go?h=White_Space#L6069
6069 var _White_Space = &RangeTable{
6070 R16: []Range16{
6071 {0x0009, 0x000d, 1},
6072 {0x0020, 0x0020, 1},
6073 {0x0085, 0x0085, 1},
6074 {0x00a0, 0x00a0, 1},
6075 {0x1680, 0x1680, 1},
6076 {0x2000, 0x200a, 1},
6077 {0x2028, 0x2029, 1},
6078 {0x202f, 0x202f, 1},
6079 {0x205f, 0x205f, 1},
6080 {0x3000, 0x3000, 1},
6081 },
6082 LatinOffset: 4,
6083 }
To answer your main question, the IsSpace
check is not limited to Latin-1.
EDIT
For clarification, if the character you are testing is not in the Latin-1 charset, then the range table lookup is used. The Range16
values in the table represent ranges of 16bit numbers {Low, Hi, Stride}. The isExcludingLatin
will call is16
with that range table sub-section (R16
) and determine if the rune
provided falls in any of the ranges after the index of LatinOffset
(which is 4 in this case).
So, that is checking these ranges:
{0x1680, 0x1680, 1},
{0x2000, 0x200a, 1},
{0x2028, 0x2029, 1},
{0x202f, 0x202f, 1},
{0x205f, 0x205f, 1},
{0x3000, 0x3000, 1},
There are unicode code points for:
http://www.fileformat.info/info/unicode/char/1680/index.htm
http://www.fileformat.info/info/unicode/char/2000/index.htm
http://www.fileformat.info/info/unicode/char/2001/index.htm
http://www.fileformat.info/info/unicode/char/2002/index.htm
http://www.fileformat.info/info/unicode/char/2003/index.htm
http://www.fileformat.info/info/unicode/char/2004/index.htm
http://www.fileformat.info/info/unicode/char/2005/index.htm
http://www.fileformat.info/info/unicode/char/2006/index.htm
http://www.fileformat.info/info/unicode/char/2007/index.htm
http://www.fileformat.info/info/unicode/char/2008/index.htm
http://www.fileformat.info/info/unicode/char/2009/index.htm
http://www.fileformat.info/info/unicode/char/200a/index.htm
http://www.fileformat.info/info/unicode/char/2028/index.htm
http://www.fileformat.info/info/unicode/char/2029/index.htm
http://www.fileformat.info/info/unicode/char/202f/index.htm
http://www.fileformat.info/info/unicode/char/205f/index.htm
http://www.fileformat.info/info/unicode/char/3000/index.htm
All of the above are considers "white space"
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论