在Go语言中,如何表示我想匹配所有类型的空格,包括不换行的空格?

huangapple go评论82阅读模式
英文:

In Go, how can I state I want to match all kinds of space, including the non-breaking one?

问题

我必须匹配一个给定的模式,看起来像这样:

地点 *: *(.*)

换句话说,我有一个标签,一些空格,一个冒号,一些空格,和我想要的值。

然而,我的数据中有一些地方,其中空格不是通常的20 ASCII字符,而是非断行空格(Unicode字符\u00A0)。我该如何匹配它们?我考虑使用

地点\s*:\s*(.*)

但它似乎不能匹配\u00A0空格。这是正则表达式模块的一个错误还是预期的行为?如果是后者,我该如何匹配所有类型的空格而不列举它们?

英文:

I have to match a given pattern that looks like this :

Place *: *(.*)

In other words, I have a label, some spaces, a colon, some spaces, and the value I want.

However, I have in my data some places where spaces are not the usual 20 ASCII character, but non-breaking spaces (unicode character \u00A0). How can I match them ? I thought of using

Place\s*:\s*(.*)

but it does not seem to work on the \u00A0 whitespace. Is this a bug of the regexp module or is this wanted behavior ? If it is the latter, how can I match all kinds of spaces without listing them all ?

答案1

得分: 7

re2语法\s限制为(≡ [\t\n\f\r ]),这似乎是相当标准的。

在使用正则表达式之前,预处理字符串可能更容易做到这一点。
例如,strings.Fields()会将字符串按空格分割,包括Unicode空格符。

// Fields函数根据unicode.IsSpace定义的规则,将字符串s按照一个或多个连续的空白字符分割,返回s的子字符串数组,如果s只包含空白字符,则返回空列表。
func Fields(s string) []string {
    return FieldsFunc(s, unicode.IsSpace)
}

这将处理不可打断的空格,因为unicode.IsSpace()报告该符文是否是由Unicode的空格属性定义的空格字符;在Latin-1空格中,这些字符包括:

'\t', '\n', '\v', '\f', '\r', ' ', U+0085 (NEL), U+00A0 (NBSP).
英文:

The re2 syntax does limit \s to (≡ [\t\n\f\r ]), which seems pretty much standard.

That might be the case where pre-processing the string, before using a regexp, is easier to do.
For example strings.Fields() would split the string around spaces, including unicode space runes.

// Fields splits the string s around each instance of one or more consecutive white space
// characters, as defined by unicode.IsSpace, returning an array of substrings of s or an
// empty list if s contains only white space.
func Fields(s string) []string {
    return FieldsFunc(s, unicode.IsSpace)
}

That would take care of non-breakable space, since unicode.IsSpace() reports whether the rune is a space character as defined by Unicode's White Space property; in the Latin-1 space this is:

'\t', '\n', '\v', '\f', '\r', ' ', U+0085 (NEL), U+00A0 (NBSP).

huangapple
  • 本文由 发表于 2015年2月20日 17:41:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/28625810.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定