在Go正则表达式中使用负向前瞻(Negative look-ahead)

huangapple go评论88阅读模式
英文:

Negative look-ahead in Go regular expressions

问题

我正在尝试在Go中使用负向前瞻。

以下是正则表达式:BBB(((?!BBB).)*)EEE

然而,在Go中我得到了以下错误:

error parsing regexp: invalid or unsupported Perl syntax: `(?!`

是否有其他替代方法?

英文:

I'm trying to use negative look-aheads in Go.

The following regular expression: BBB(((?!BBB).)*)EEE

http://rubular.com/r/Zw1vopp1MF

However, in Go I get:

error parsing regexp: invalid or unsupported Perl syntax: `(?!`

Are there any alternatives?

答案1

得分: 49

负向先行断言由于技术原因不受支持,具体原因是它与库的O(n)时间保证冲突。请参阅golang-nuts组讨论以及Regular Expression Matching in the Wild中的注意事项部分。

您可以使用不带负向先行断言的正则表达式来表示您所描述的内容:

BBB([^B]|B[^B]|BB[^B])*EEE

这里有一个示例来演示:

package main

import (
	"fmt"
	"regexp"
)

func main() {
	re := regexp.MustCompile(`BBB([^B]|B[^B]|BB[^B])*EEE`)
	fmt.Printf("%#v\n", re.FindAllString("BBB EEE BBB..BBB...EEE", -1))
}
英文:

Negative lookahead isn't supported for technical reasons, specifically because it conflicts with the O(n)-time guarantees of the library. See the golang-nuts group discussion about this, as well as the Caveats section in Regular Expression Matching in the Wild.

You can express the regular expression you've described without negative lookahead:

BBB([^B]|B[^B]|BB[^B])*EEE

Here's an example to demonstrate:

package main

import (
	"fmt"
	"regexp"
)

func main() {
	re := regexp.MustCompile(`BBB([^B]|B[^B]|BB[^B])*EEE`)
	fmt.Printf("%#v\n", re.FindAllString("BBB EEE BBB..BBB...EEE", -1))
}

答案2

得分: 5

dlclark/regexp2 是 .NET 框架的 System.Text.RegularExpressions.Regex 引擎的一个移植版本。

.NET 字符串和 Go 字符串之间存在一些基本差异,因此在移植过程中需要从 Go 框架的正则表达式引擎中借用一些内容。在移植过程中,我清理了一些比较混乱的部分(例如 regexcharclass.cs),但解析树、生成的代码以及匹配的模式应该是相同的。

它在关于 O(n) 正则表达式的长篇讨论中被提及,并附有警告:

然而,我建议谨慎使用,因为基于 re2 的引擎具有一些其他功能更全面的引擎(包括 lookaround)所不具备的优点。如果有选择的话,建议使用标准库。

功能的代价是实现速度较慢。

英文:

dlclark/regexp2 is a port of the .NET framework's System.Text.RegularExpressions.Regex engine.

> There are some fundamental differences between .NET strings and Go strings that required a bit of borrowing from the Go framework regex engine as well. I cleaned up a couple of the dirtier bits during the port (regexcharclass.cs was terrible), but the parse tree, code emmitted, and therefore patterns matched should be identical.

It's name dropped at the end of the lengthy discussion about O(n) regular expressions, and is caveated:
> However, I would advise caution as there are benefits to the re2-based engine that are not provided by more full featured engines with lookarounds. If you have the option then stick with the stdlib.

The cost of features is a slower implementation.

答案3

得分: 4

根据你的示例和期望的输出,以下代码可以实现:

re := regexp.MustCompile(`BBB([^B]*)EEE`)

GoPlay

英文:

Based off your examples and your expected output, the following would work.

re := regexp.MustCompile(`BBB([^B]*)EEE`)

<kbd>GoPlay</kbd>

答案4

得分: 2

根据dyoo的建议,可以将以下代码段进行填充(其中*char n表示string*的第n个字符):

<pre>
<i>string</i>(?:(?!<i>endString</i>).)*?<i>endString</i></code>
</pre>

填充后的代码如下(其中*char n表示string*的第n个字符):

<pre>
<i>string</i>
(?: # 匹配0个或多个组,每个组要么
  # 不以<i>string</i>的第一个字符开头,
  [^<i>&amp;lt;char 1&amp;gt;</i>]
|
  # 共享第一个字符,但不共享第二个字符,
  <i>&amp;lt;char 1&amp;gt;</i>[^<i>&amp;lt;char 2&amp;gt;</i>]
|
  # 共享前两个字符,但不共享第三个字符,
  <i>&amp;lt;char 1&amp;gt;</i><i>&amp;lt;char 2&amp;gt;</i>[^<i>&amp;lt;char 3&amp;gt;</i>]
|
  # ...
|
  # 共享前n-1个字符,但不共享第n个字符,
  <i>&amp;lt;char 1&amp;gt;</i>...<i>&amp;lt;char n-1&amp;gt;</i>[^<i>&amp;lt;char n&amp;gt;</i>]
)*?
<i>endString</i>
</pre>

非ASCII艺术结构表示如下(其中X表示差异):

<pre>
┌──┬──┬──┬─────────────┬────┬──┐
│C1│C2│C3│     ...     │Cn-1│Cn│
└──┴──┴──┴─────────────┴────┴──┘
├──┼──┼──┼─────────────┼────┼X   <i>C1C2C3</i>...<i>Cn-1</i>[^<i>Cn</i>]
├──┼──┼──┼─────────────┼X   │    <i>C1C2C3</i>...[^<i>Cn-1</i>]
│  │  │  │     ...     │    │          ...
├──┼──┼X │             │    │    <i>C1C2</i>[^<i>C3</i>]
├──┼X │  │             │    │    <i>C1</i>[^<i>C2</i>]
├X │  │  │             │    │    [^<i>C1</i>]
</pre>

另一种使用嵌套非捕获组的表示方法如下:

<pre>
(?: # 匹配0个或多个组,每个组由
  # 第一个字符,后跟要么
  <i>&amp;lt;char 1&amp;gt;</i>
  (?:
    # 第二个字符,后跟要么
    <i>&amp;lt;char 2&amp;gt;</i>
    (?:
      # 第三个字符,后跟要么
      <i>&amp;lt;char 3&amp;gt;</i>
      # ...
        (?:
          # 第(n-1)个字符,后跟
          <i>&amp;lt;char n-1&amp;gt;</i>
          (?:
            # 不是第n个字符的内容
            [^<i>&amp;lt;char n&amp;gt;</i>]
          )
        |
          # 不是第(n-1)个字符的内容
          [^<i>&amp;lt;char n-1&amp;gt;</i>]
        )
      # ...
    |
      # 或者不是第三个字符的内容
      [^<i>&amp;lt;char 3&amp;gt;</i>]
    )
  |
    # 或者不是第二个字符的内容
    [^<i>&amp;lt;char 2&amp;gt;</i>]
  )
|
  # 或者不是第一个字符的内容
  [^<i>&amp;lt;char1&amp;gt;</i>]
)*?
</pre>

非ASCII艺术结构表示如下(其中X表示差异,表示分支):

<pre>
┌──┬──┬──┬─────────────┬────┬──┐
│C1│C2│C3│     ...     │Cn-1│Cn│
└──┴──┴──┴─────────────┴────┴──┘
├┬─┼┬─┼┬─┼─────────────┼┬───┼─X  <i>C1C2C3</i>...<i>Cn-1</i>[^<i>Cn</i>]
││ ││ ││ │             │└X       <i>C1C2C3</i>...[^<i>Cn-1</i>]
               ...                     ...
││ ││ │└X│                       <i>C1C2</i>[^<i>C3</i>]
││ │└X│  │                       <i>C1</i>[^<i>C2</i>]
│└X│  │  │                       [^<i>C1</i>]
</pre>

这两种方法实现的功能是相同的。对于较长的字符串,第二种解决方案会更短。例如,假设<div></div>分别是开始和结束标记。例如:

<div  <span>Foo</span>   div>

将被分隔为(其中_表示空格;带有颜色):

┌───────┬───┬────┬───┬───┬───┬───┬───┬───┬───┬────┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ <div_ │ _ │ <s │ p │ a │ n │ > │ F │ o │ o │ </ │ s │ p │ a │ n │ > │ _ │ _ │ _ │ d │ i │ v │ > |
└───────┴───┴────┴───┴───┴───┴───┴───┴───┴───┴────┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

从上面的两个结构中可以观察到一个重复的模式,这意味着可以通过编程方式生成正则表达式。以下是一个伪代码编写的程序:

function escape(string)
  return string.replace(/[-/\\^$*+?.()|[\]{}]/g, '\$&')
end

function escapeBracket(char)
  return char == ']' ? '\\]' : char
end

function polyfill(string)
  regex = []

  for index from 0 to string.length do
    # 包括string[0],不包括string[index]
    branch = escape(string.slice(0, index)) + '[^' + escapeBracket(string[index]) + ']'
    regex.append(branch)
  end
  
  return '(?:' + regex.join('|') + ')*?'
end

polyfill('BBB')    # '(?:[^B]|B[^B]|BB[^B])*?'
polyfill('<div>')  # '(?:[^<]|<[^d]|<d[^i]|<di[^v]|<div[^>])*?'

...或者:

function polyfill2(string)
  reversedString = string.reverse()
  regex = ''

  for index from 0 to string.length do
    char = reversedString[index]
    firstBranch = regex ? escape(char) + regex + '|' : ''
    secondBranch = '[^' + escapeBracket(char) + ']'
    regex = '(?:' + firstBranch + secondBranch + ')'
  end

  return regex + '*?'
end

polyfill2('BBB')    # '(?:B(?:B(?:[^B])|[^B])|[^B])*?'
polyfill2('<div>')  # '(?:<(?:d(?:i(?:v(?:[^>])|[^v])|[^i])|[^d])|[^<])*?'

在regex101上尝试它们:

  • polyfill
    • BBB(PCRE2:711步)
    • <div>(PCRE2:4021步)
  • polyfill2
    • BBB(PCRE2:814步)
    • <div>(PCRE2:4734步)
  • 负向先行断言:
    • BBB(PCRE2:991步)
    • <div>(PCRE2:5887步)

Playground:

<!-- begin snippet: js hide: true console: false babel: false -->

<!-- language: lang-js -->
const [p1, p2, p3] = document.querySelectorAll('div');
const [start, end] = document.querySelectorAll('input');
const data = document.querySelector('#data');

start.addEventListener('input', inputHandler);
end.addEventListener('input', inputHandler);

start.dispatchEvent(new Event('input'));


function inputHandler() {
  const escaped = [escape(start.value), escape(end.value)];
  
  [
    [p1, polyfill],
    [p2, polyfill2],
    [p3, lookahead]
  ].forEach(([element, fn]) => {
    element.textContent = escaped.join(fn(start.value));
  });
  
  data.textContent = generateData(start.value, end.value);
}

function generateData(start, end) {
  const data = [];

  const middleLength = 10 + Math.random() * 30 | 0;
  const middle = [...Array(middleLength)].map(() => {
    const original = (Math.random() > 0.5 ? end : start);
    return original[Math.random() * original.length | 0];
  }).join('');
  
  for (let i = 0; i < start.length; i++) {
    for (let j = 0; j < end.length; j++) {
      data.push(
        start + start.slice(0, i + 1) +
        middle +
        end.slice(-j - 1) + end
      );
    }
  }
  
  return data.join('\n');
}

function escape(string) {
	return string.replace(/[-/\\^$*+?.()|[\]{}]/g, '\\$&');
}

function escapeBracket(char) {
	return char === ']' ? '\\]' : char;
}


function polyfill(string) {
	const regex = [];

	for (let i = 0; i < string.length; i++) {
		const branch = escape(string.slice(0, i)) + `[^${escapeBracket(string[i])}]`;
		regex.push(branch);
	}

	return `(?:${regex.join('|')})*?`;
}

function polyfill2(string) {
	const reversedString = [...string].reverse();
	let regex = '';

	for (let i = 0; i < string.length; i++) {
		const char = reversedString[i];
		const firstBranch = regex ? escape(char) + regex + '|' : '';
		const secondBranch = `[^${escapeBracket(char)}]`;
		regex = `(?:${firstBranch}${secondBranch})`;
	}

	return regex + '*?';
}

function lookahead(string) {
  return `(?:(?!${escape(string)}).)*?`;
}

<!-- language: lang-css -->
label {
  display: flex;
  align-items: start;
  flex-direction: column;
  gap: 1em;
  margin-top: 2em;
}

label > :nth-child(n + 2) {
  align-self: stretch;
  font-family: monospace;
}

input {
  padding: 1.5em 1em;
  height: 0;
}

div {
  word-break: break-all;
}

#data {
  height: 5em;
  white-space: pre;
}

<!-- language: lang-html -->
<label>
  <span>Start/end:</span>
  <input type="text" value="[foo-bar(baz)qux]">
  <input type="text" value="[foo-bar(baz)qux]">
</label>

<label>
  <span>Polyfill:</span>
  <div id="p1"></div>
</label>

<label>
  <span>Polyfill2:</span>
  <div id="p2"></div>
</label>

<label>
  <span>Negative lookahead:</span>
  <div id="p3"></div>
</label>

<label>
  <span>Random text:</span>
  <textarea id="data"></textarea>
</label>

<!-- end snippet -->

(免责声明:我不了解Go。这个答案是根据@markalex在q/76035959上的建议编写的)

英文:

As suggested by dyoo,

<pre>
<i>string</i>(?:(?!<i>endString</i>).)*?<i>endString</i></code>
</pre>

can be polyfilled as (where char n means the nth char of string):

<pre>
<i>string</i>
(?: # Match 0+ groups, each either

doesn't start with the first char of <i>string</i>,

[^<i>&lt;char 1&gt;</i>]
|

share the first, but not the second,

<i>&lt;char 1&gt;</i>[^<i>&lt;char 2&gt;</i>]
|

share the first two, but not the third,

<i>&lt;char 1&gt;</i><i>&lt;char 2&gt;</i>[^<i>&lt;char 3&gt;</i>]
|

...

|

share the first n - 1, but not the nth,

<i>&lt;char 1&gt;</i>...<i>&lt;char n-1&gt;</i>[^<i>&lt;char n&gt;</i>]
)*?
<i>endString</i>
</pre>

Structure in non-ASCII art (where X denotes a difference):

<pre>
┌──┬──┬──┬─────────────┬────┬──┐
│C1│C2│C3│ ... │Cn-1│Cn│
└──┴──┴──┴─────────────┴────┴──┘
├──┼──┼──┼─────────────┼────┼X <i>C1C2C3</i>...<i>Cn-1</i>[^<i>Cn</i>]
├──┼──┼──┼─────────────┼X │ <i>C1C2C3</i>...[^<i>Cn-1</i>]
│ │ │ │ ... │ │ ...
├──┼──┼X │ │ │ <i>C1C2</i>[^<i>C3</i>]
├──┼X │ │ │ │ <i>C1</i>[^<i>C2</i>]
├X │ │ │ │ │ [^<i>C1</i>]
</pre>

Another representation using nested non-capturing groups:

<pre>
(?: # Match 0+ groups, each consists of

the first char, followed by either

<i>&lt;char 1&gt;</i>
(?:
# the second char, followed by either
<i>&lt;char 2&gt;</i>
(?:
# the third char, followed by either
<i>&lt;char 3&gt;</i>
# ...
(?:
# the (n-1)th char, followed by
<i>&lt;char n-1&gt;</i>
(?:
# something that is not the nth char
[^<i>&lt;char n&gt;</i>]
)
|
# something that is not the (n-1)th char
[^<i>&lt;char n-1&gt;</i>]
)
# ...
|
# or something that is not the third char
[^<i>&lt;char 3&gt;</i>]
)
|
# or something that is not the second char
[^<i>&lt;char 2&gt;</i>]
)
|

or something that is not the first char

[^<i>&lt;char1&gt;</i>]
)*?
</pre>

Structure in non-ASCII art (where X denotes a difference and a branch):

<pre>
┌──┬──┬──┬─────────────┬────┬──┐
│C1│C2│C3│ ... │Cn-1│Cn│
└──┴──┴──┴─────────────┴────┴──┘
├┬─┼┬─┼┬─┼─────────────┼┬───┼─X <i>C1C2C3</i>...<i>Cn-1</i>[^<i>Cn</i>]
││ ││ ││ │ │└X <i>C1C2C3</i>...[^<i>Cn-1</i>]
... ...
││ ││ │└X│ <i>C1C2</i>[^<i>C3</i>]
││ │└X│ │ <i>C1</i>[^<i>C2</i>]
│└X│ │ │ [^<i>C1</i>]
</pre>

These two do the same thing. For long strings, the second solution will be substantially shorter. For example, let &lt;div&gt; and &lt;/div&gt; be the start and end mark, correspondingly. For example:

&lt;div  &lt;span&gt;Foo&lt;/span&gt;   div&gt;

will be separated as (where _ denotes a space; with colors):

┌───────┬───┬────┬───┬───┬───┬───┬───┬───┬───┬────┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ &lt;div_ │ _ │ &lt;s │ p │ a │ n │ &gt; │ F │ o │ o │ &lt;/ │ s │ p │ a │ n │ &gt; │ _ │ _ │ _ │ d │ i │ v │ &gt; |
└───────┴───┴────┴───┴───┴───┴───┴───┴───┴───┴────┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

As one can observe from the two structures above, there is a repeating pattern, which means the regex can be generated programmatically. Here's such a program, written in pseudocode:

function escape(string)
return string.replace(/[-/\\^$*+?.()|[\]{}]/g, &#39;\\$&amp;&#39;)
end
function escapeBracket(char)
return char == &#39;]&#39; ? &#39;\\]&#39; : char
end
function polyfill(string)
regex = []
for index from 0 to string.length do
# string[0] included, string[index] excluded
branch = escape(string.slice(0, index)) + &#39;[^{escapeBracket(string[index])}]&#39;
regex.append(branch)
end
return &#39;(?:{regex.join(&#39;|&#39;)})*?&#39;
end
polyfill(&#39;BBB&#39;)    # &#39;(?:[^B]|B[^B]|BB[^B])*?&#39;
polyfill(&#39;&lt;div&gt;&#39;)  # &#39;(?:[^&lt;]|&lt;[^d]|&lt;d[^i]|&lt;di[^v]|&lt;div[^&gt;])*?&#39;

...or:

function polyfill2(string)
reversedString = string.reverse()
regex = &#39;&#39;
for index from 0 to string.length do
char = reversedString[index]
firstBranch = regex ? escape(char) + regex + &#39;|&#39; : &#39;&#39;
secondBranch = &#39;[^{escapeBracket(char)}]&#39;
regex = &#39;(?:{firstBranch}{secondBranch})&#39;
end
return regex + &#39;*?&#39;
end
polyfill2(&#39;BBB&#39;)    # (?:B(?:B(?:[^B])|[^B])|[^B])*?
polyfill2(&#39;&lt;div&gt;&#39;)  # (?:&lt;(?:d(?:i(?:v(?:[^&gt;])|[^v])|[^i])|[^d])|[^&lt;])*?

Try them on regex101:

Playground:

<!-- begin snippet: js hide: true console: false babel: false -->

<!-- language: lang-js -->

const [p1, p2, p3] = document.querySelectorAll(&#39;div&#39;);
const [start, end] = document.querySelectorAll(&#39;input&#39;);
const data = document.querySelector(&#39;#data&#39;);
start.addEventListener(&#39;input&#39;, inputHandler);
end.addEventListener(&#39;input&#39;, inputHandler);
start.dispatchEvent(new Event(&#39;input&#39;));
function inputHandler() {
const escaped = [escape(start.value), escape(end.value)];
[
[p1, polyfill],
[p2, polyfill2],
[p3, lookahead]
].forEach(([element, fn]) =&gt; {
element.textContent = escaped.join(fn(start.value));
});
data.textContent = generateData(start.value, end.value);
}
function generateData(start, end) {
const data = [];
const middleLength = 10 + Math.random() * 30 | 0;
const middle = [...Array(middleLength)].map(() =&gt; {
const original = (Math.random() &gt; 0.5 ? end : start);
return original[Math.random() * original.length | 0];
}).join(&#39;&#39;);
for (let i = 0; i &lt; start.length; i++) {
for (let j = 0; j &lt; end.length; j++) {
data.push(
start + start.slice(0, i + 1) +
middle +
end.slice(-j - 1) + end
);
}
}
return data.join(&#39;\n&#39;);
}
function escape(string) {
return string.replace(/[-/\\^$*+?.()|[\]{}]/g, &#39;\\$&amp;&#39;);
}
function escapeBracket(char) {
return char === &#39;]&#39; ? &#39;\\]&#39; : char;
}
function polyfill(string) {
const regex = [];
for (let i = 0; i &lt; string.length; i++) {
const branch = escape(string.slice(0, i)) + `[^${escapeBracket(string[i])}]`;
regex.push(branch);
}
return `(?:${regex.join(&#39;|&#39;)})*?`;
}
function polyfill2(string) {
const reversedString = [...string].reverse();
let regex = &#39;&#39;;
for (let i = 0; i &lt; string.length; i++) {
const char = reversedString[i];
const firstBranch = regex ? escape(char) + regex + &#39;|&#39; : &#39;&#39;;
const secondBranch = `[^${escapeBracket(char)}]`;
regex = `(?:${firstBranch}${secondBranch})`;
}
return regex + &#39;*?&#39;;
}
function lookahead(string) {
return `(?:(?!${escape(string)}).)*?`;
}

<!-- language: lang-css -->

label {
display: flex;
align-items: start;
flex-direction: column;
gap: 1em;
margin-top: 2em;
}
label &gt; :nth-child(n + 2) {
align-self: stretch;
font-family: monospace;
}
input {
padding: 1.5em 1em;
height: 0;
}
div {
word-break: break-all;
}
#data {
height: 5em;
white-space: pre;
}

<!-- language: lang-html -->

&lt;label&gt;
&lt;span&gt;Start/end:&lt;/span&gt;
&lt;input type=&quot;text&quot; value=&quot;[foo-bar(baz)qux]&quot;&gt;
&lt;input type=&quot;text&quot; value=&quot;[foo-bar(baz)qux]&quot;&gt;
&lt;/label&gt;
&lt;label&gt;
&lt;span&gt;Polyfill:&lt;/span&gt;
&lt;div id=&quot;p1&quot;&gt;&lt;/div&gt;
&lt;/label&gt;
&lt;label&gt;
&lt;span&gt;Polyfill2:&lt;/span&gt;
&lt;div id=&quot;p2&quot;&gt;&lt;/div&gt;
&lt;/label&gt;
&lt;label&gt;
&lt;span&gt;Negative lookahead:&lt;/span&gt;
&lt;div id=&quot;p3&quot;&gt;&lt;/div&gt;
&lt;/label&gt;
&lt;label&gt;
&lt;span&gt;Random text:&lt;/span&gt;
&lt;textarea id=&quot;data&quot;&gt;&lt;/textarea&gt;
&lt;/label&gt;

<!-- end snippet -->

(Disclaimer: I don't know Go. This answer was written as suggested by @markalex at q/76035959)

答案5

得分: 0

根据Andy的回答,还有一个由h2s05开发的模块实现了扩展正则表达式功能https://pkg.go.dev/github.com/h2so5/goback,并且它实现了与普通regexp库相同的API接口,如果你不想重构以支持dlclark库中的API差异,这个模块非常方便。

英文:

Based on Andy's answer. There is another mod by h2s05 that implements extended regex https://pkg.go.dev/github.com/h2so5/goback AND it implements the same api interface as the normal regexp library, handy if you don't want refactor to support the api differences in dlclark's library

huangapple
  • 本文由 发表于 2014年11月6日 12:13:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/26771592.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定