英文:
Regex of a non-greedy match different behavior
问题
我发现非贪婪正则表达式只有在锚定到开头时才变得非贪婪,而不是锚定到结尾:
$ echo abcabcabc | perl -ne 'print $1 if /^(a.*c)/'
abcabcabc
# OK,贪婪匹配
$ echo abcabcabc | perl -ne 'print $1 if /^(a.*?c)/'
abc
# 是的!非贪婪匹配
现在看看当锚定到结尾时:
$ echo abcabcabc | perl -ne 'print $1 if /(a.*c)$/'
abcabcabc
# OK,贪婪匹配
$ echo abcabcabc | perl -ne 'print $1 if /(a.*?c)$/'
abcabcabc
# 什么,非贪婪变成贪婪了吗?
为什么会这样?为什么它不像之前那样打印abc
?
(这个问题是在我的 Go 代码中发现的,但为了简单起见在 Perl 中进行了说明)。
英文:
I found that non-greedy regex match only become non-greedy when anchoring to the front, not to the end:
$ echo abcabcabc | perl -ne 'print $1 if /^(a.*c)/'
abcabcabc
# OK, greedy match
$ echo abcabcabc | perl -ne 'print $1 if /^(a.*?c)/'
abc
# YES! non-greedy match
Now look at this, when anchoring to the end:
$ echo abcabcabc | perl -ne 'print $1 if /(a.*c)$/'
abcabcabc
# OK, greedy match
$ echo abcabcabc | perl -ne 'print $1 if /(a.*?c)$/'
abcabcabc
# what, non-greedy become greedy?
why is that? how come it doesn't print abc
as before?
(The problem was found in my Go code, but illustrated in Perl for simplicity).
答案1
得分: 7
> $ echo abcabcabc | perl -ne 'print $1 if /(a.*?c)$/'
> abcabcabc
> # 什么,非贪婪变成贪婪了?
非贪婪意味着它会尽可能匹配当前位置下最少的字符,以使整个模式匹配成功。
在位置0匹配到a
后,bcabcab
是在位置1时.*?
可以匹配的最少字符,同时仍满足剩余的模式。
"abcabcabc" = /a.*?c$/
的详细匹配过程如下:
- 在位置0,
a
匹配1个字符(a
)。- 在位置1,
.*?
匹配0个字符(空字符串)。- 在位置1,
c
无法匹配。回溯!
- 在位置1,
- 在位置1,
.*?
匹配1个字符(b
)。- 在位置2,
c
匹配1个字符(c
)。- 在位置3,
$
无法匹配。回溯!
- 在位置3,
- 在位置2,
- 在位置1,
.*?
匹配2个字符(bc
)。- 在位置1,
c
无法匹配。回溯!
- 在位置1,
- ...
- 在位置1,
.*?
匹配7个字符(bcabcab
)。- 在位置8,
c
匹配1个字符(c
)。- 在位置9,
$
匹配0个字符(空字符串)。匹配成功!
- 在位置9,
- 在位置8,
- 在位置1,
"abcabcabc" = /a.*c$/
的详细匹配过程(作为对比)如下:
- 在位置0,
a
匹配1个字符(a
)。- 在位置1,
.*
匹配8个字符(abcabcabc
)。- 在位置9,
c
无法匹配。回溯!
- 在位置9,
- 在位置1,
.*
匹配7个字符(abcabcab
)。- 在位置8,
c
匹配1个字符(c
)。- 在位置9,
$
匹配0个字符(空字符串)。匹配成功!
- 在位置9,
- 在位置8,
- 在位置1,
提示:避免在模式中使用两个非贪婪修饰符。除非你将它们用作优化,否则很有可能会匹配到你不想匹配的内容。这在这里是相关的,因为模式隐式地以\G(?s:.*?)\K
开头(除非被^
、\A
或\G
取消)。
你想要的是以下之一:
/a[^a]*c$/
/a[^c]*c$/
/a[^ac]*c$/
你也可以使用以下之一:
/a(?:(?!a).)c$/s
/a(?:(?!c).)c$/s
/a(?:(?!a|c).)c$/s
在这种情况下,使用后三种模式会效率低下且难以阅读,但它们适用于边界长度超过一个字符的情况。
英文:
> $ echo abcabcabc | perl -ne 'print $1 if /(a.*?c)$/'
> abcabcabc
> # what, non-greedy become greedy?
Non-greedy means it'll match the fewest characters possible at the current location such that the entire pattern matches.
After matching a
at position 0
, bcabcab
is the least .*?
can match at position 1
while still satisfying the rest of the pattern.
"abcabcabc" = /a.*?c$/
in detail:
- At pos 0,
a
matches 1 char (a
).- At pos 1,
.*?
matches 0 chars (empty string).- At pos 1,
c
fails to match. Backtrack!
- At pos 1,
- At pos 1,
.*?
matches 1 char (b
).- At pos 2,
c
matches 1 char (c
).- At pos 3,
$
fails to match. Backtrack!
- At pos 3,
- At pos 2,
- At pos 1,
.*?
matches 2 chars (bc
).- At pos 1,
c
fails to match. Backtrack!
- At pos 1,
- ...
- At pos 1,
.*?
matches 7 chars (bcabcab
).- At pos 8,
c
matches 1 char (c
).- At pos 9,
$
matches 0 chars (empty string). Match successful!
- At pos 9,
- At pos 8,
- At pos 1,
"abcabcabc" = /a.*c$/
in detail (for contrast):
- At pos 0,
a
matches 1 char (a
).- At pos 1,
.*
matches 8 chars (abcabcabc
).- At pos 9,
c
fails to match. Backtrack!
- At pos 9,
- At pos 1,
.*
matches 7 chars (abcabcab
).- At pos 8,
c
matches 1 char (c
).- At pos 9,
$
matches 0 chars (empty string). Match successful!
- At pos 9,
- At pos 8,
- At pos 1,
Tip: Avoid patterns with two instances of a non-greediness modifier. Unless you are using them as an optimization, there's a good chance they can match something you don't want them to match. This is relevant here because patterns implicitly start with \G(?s:.*?)\K
(unless cancelled out by a leading ^
, \A
or \G
).
What you want is one of the following:
/a[^a]*c$/
/a[^c]*c$/
/a[^ac]*c$/
You could also use one of the following:
/a(?:(?!a).)c$/s
/a(?:(?!c).)c$/s
/a(?:(?!a|c).)c$/s
It would be inefficient and unreadable to use these latter three in this situation, but they will work with boundaries that are longer than one character.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论