英文:
In Go, how can I state I want to match all kinds of space, including the non-breaking one?
问题
我必须匹配一个给定的模式,看起来像这样:
地点 *: *(.*)
换句话说,我有一个标签,一些空格,一个冒号,一些空格,和我想要的值。
然而,我的数据中有一些地方,其中空格不是通常的20
ASCII字符,而是非断行空格(Unicode字符\u00A0
)。我该如何匹配它们?我考虑使用
地点\s*:\s*(.*)
但它似乎不能匹配\u00A0
空格。这是正则表达式模块的一个错误还是预期的行为?如果是后者,我该如何匹配所有类型的空格而不列举它们?
英文:
I have to match a given pattern that looks like this :
Place *: *(.*)
In other words, I have a label, some spaces, a colon, some spaces, and the value I want.
However, I have in my data some places where spaces are not the usual 20
ASCII character, but non-breaking spaces (unicode character \u00A0
). How can I match them ? I thought of using
Place\s*:\s*(.*)
but it does not seem to work on the \u00A0
whitespace. Is this a bug of the regexp module or is this wanted behavior ? If it is the latter, how can I match all kinds of spaces without listing them all ?
答案1
得分: 7
re2语法将\s
限制为(≡ [\t\n\f\r ])
,这似乎是相当标准的。
在使用正则表达式之前,预处理字符串可能更容易做到这一点。
例如,strings.Fields()
会将字符串按空格分割,包括Unicode空格符。
// Fields函数根据unicode.IsSpace定义的规则,将字符串s按照一个或多个连续的空白字符分割,返回s的子字符串数组,如果s只包含空白字符,则返回空列表。
func Fields(s string) []string {
return FieldsFunc(s, unicode.IsSpace)
}
这将处理不可打断的空格,因为unicode.IsSpace()
报告该符文是否是由Unicode的空格属性定义的空格字符;在Latin-1空格中,这些字符包括:
'\t', '\n', '\v', '\f', '\r', ' ', U+0085 (NEL), U+00A0 (NBSP).
英文:
The re2 syntax does limit \s
to (≡ [\t\n\f\r ])
, which seems pretty much standard.
That might be the case where pre-processing the string, before using a regexp, is easier to do.
For example strings.Fields()
would split the string around spaces, including unicode space runes.
// Fields splits the string s around each instance of one or more consecutive white space
// characters, as defined by unicode.IsSpace, returning an array of substrings of s or an
// empty list if s contains only white space.
func Fields(s string) []string {
return FieldsFunc(s, unicode.IsSpace)
}
That would take care of non-breakable space, since unicode.IsSpace()
reports whether the rune is a space character as defined by Unicode's White Space property; in the Latin-1 space this is:
'\t', '\n', '\v', '\f', '\r', ' ', U+0085 (NEL), U+00A0 (NBSP).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论