非贪婪匹配的正则表达式行为不同

huangapple go评论126阅读模式
英文:

Regex of a non-greedy match different behavior

问题

我发现非贪婪正则表达式只有在锚定到开头时才变得非贪婪,而不是锚定到结尾:

$ echo abcabcabc | perl -ne 'print $1 if /^(a.*c)/'
abcabcabc
# OK,贪婪匹配

$ echo abcabcabc | perl -ne 'print $1 if /^(a.*?c)/'
abc
# 是的!非贪婪匹配

现在看看当锚定到结尾时:

$ echo abcabcabc | perl -ne 'print $1 if /(a.*c)$/'
abcabcabc
# OK,贪婪匹配

$ echo abcabcabc | perl -ne 'print $1 if /(a.*?c)$/'
abcabcabc
# 什么,非贪婪变成贪婪了吗?

为什么会这样?为什么它不像之前那样打印abc

(这个问题是在我的 Go 代码中发现的,但为了简单起见在 Perl 中进行了说明)。

英文:

I found that non-greedy regex match only become non-greedy when anchoring to the front, not to the end:

$ echo abcabcabc | perl -ne 'print $1 if /^(a.*c)/'
abcabcabc
# OK, greedy match

$ echo abcabcabc | perl -ne 'print $1 if /^(a.*?c)/'
abc
# YES! non-greedy match

Now look at this, when anchoring to the end:

$ echo abcabcabc | perl -ne 'print $1 if /(a.*c)$/'
abcabcabc
# OK, greedy match

$ echo abcabcabc | perl -ne 'print $1 if /(a.*?c)$/'
abcabcabc
# what, non-greedy become greedy?

why is that? how come it doesn't print abc as before?

(The problem was found in my Go code, but illustrated in Perl for simplicity).

答案1

得分: 7

> $ echo abcabcabc | perl -ne 'print $1 if /(a.*?c)$/'
> abcabcabc
> # 什么,非贪婪变成贪婪了?

非贪婪意味着它会尽可能匹配当前位置下最少的字符,以使整个模式匹配成功。

在位置0匹配到a后,bcabcab是在位置1时.*?可以匹配的最少字符,同时仍满足剩余的模式。

"abcabcabc" = /a.*?c$/ 的详细匹配过程如下:

  1. 在位置0,a匹配1个字符(a)。
    1. 在位置1,.*?匹配0个字符(空字符串)。
      1. 在位置1,c无法匹配。回溯!
    2. 在位置1,.*?匹配1个字符(b)。
      1. 在位置2,c匹配1个字符(c)。
        1. 在位置3,$无法匹配。回溯!
    3. 在位置1,.*?匹配2个字符(bc)。
      1. 在位置1,c无法匹配。回溯!
    4. ...
    5. 在位置1,.*?匹配7个字符(bcabcab)。
      1. 在位置8,c匹配1个字符(c)。
        1. 在位置9,$匹配0个字符(空字符串)。匹配成功!

"abcabcabc" = /a.*c$/ 的详细匹配过程(作为对比)如下:

  1. 在位置0,a匹配1个字符(a)。
    1. 在位置1,.*匹配8个字符(abcabcabc)。
      1. 在位置9,c无法匹配。回溯!
    2. 在位置1,.*匹配7个字符(abcabcab)。
      1. 在位置8,c匹配1个字符(c)。
        1. 在位置9,$匹配0个字符(空字符串)。匹配成功!

提示:避免在模式中使用两个非贪婪修饰符。除非你将它们用作优化,否则很有可能会匹配到你不想匹配的内容。这在这里是相关的,因为模式隐式地以\G(?s:.*?)\K开头(除非被^\A\G取消)。

你想要的是以下之一:

/a[^a]*c$/
/a[^c]*c$/
/a[^ac]*c$/

你也可以使用以下之一:

/a(?:(?!a).)c$/s
/a(?:(?!c).)c$/s
/a(?:(?!a|c).)c$/s

在这种情况下,使用后三种模式会效率低下且难以阅读,但它们适用于边界长度超过一个字符的情况。

英文:

> $ echo abcabcabc | perl -ne 'print $1 if /(a.*?c)$/'
> abcabcabc
> # what, non-greedy become greedy?

Non-greedy means it'll match the fewest characters possible at the current location such that the entire pattern matches.

After matching a at position 0, bcabcab is the least .*? can match at position 1 while still satisfying the rest of the pattern.

"abcabcabc" = /a.*?c$/ in detail:

  1. At pos 0, a matches 1 char (a).
    1. At pos 1, .*? matches 0 chars (empty string).
      1. At pos 1, c fails to match. Backtrack!
    2. At pos 1, .*? matches 1 char (b).
      1. At pos 2, c matches 1 char (c).
        1. At pos 3, $ fails to match. Backtrack!
    3. At pos 1, .*? matches 2 chars (bc).
      1. At pos 1, c fails to match. Backtrack!
    4. ...
    5. At pos 1, .*? matches 7 chars (bcabcab).
      1. At pos 8, c matches 1 char (c).
        1. At pos 9, $ matches 0 chars (empty string). Match successful!

"abcabcabc" = /a.*c$/ in detail (for contrast):

  1. At pos 0, a matches 1 char (a).
    1. At pos 1, .* matches 8 chars (abcabcabc).
      1. At pos 9, c fails to match. Backtrack!
    2. At pos 1, .* matches 7 chars (abcabcab).
      1. At pos 8, c matches 1 char (c).
        1. At pos 9, $ matches 0 chars (empty string). Match successful!

Tip: Avoid patterns with two instances of a non-greediness modifier. Unless you are using them as an optimization, there's a good chance they can match something you don't want them to match. This is relevant here because patterns implicitly start with \G(?s:.*?)\K (unless cancelled out by a leading ^, \A or \G).

What you want is one of the following:

/a[^a]*c$/
/a[^c]*c$/
/a[^ac]*c$/

You could also use one of the following:

/a(?:(?!a).)c$/s
/a(?:(?!c).)c$/s
/a(?:(?!a|c).)c$/s

It would be inefficient and unreadable to use these latter three in this situation, but they will work with boundaries that are longer than one character.

huangapple
  • 本文由 发表于 2016年12月6日 09:52:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/40986527.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定