Is there a way to match everything except a constant string using Go.Regexp?

huangapple go评论133阅读模式
英文:

Is there a way to match everything except a constant string using Go.Regexp?

问题

我发现了许多类似的问题,但它们与Go正则表达式语法不兼容。

我要匹配的字符串的格式是anything/anything/somestring。使用模式\/.*\/.*\/(.*),我可以匹配到somestring,但我想匹配除包含somestring的字符串之外的任何内容。

大多数答案建议使用类似于\/.*\/.*\/((?!somestring).*)的表达式,但是在Go的正则表达式中,我得到了错误信息:? The preceding token is not quantifiable

为了澄清:/test/test/MATCH会产生匹配,而/test/test/somestring则不会。在(有限的)Go正则表达式语法中是否可能实现这一点?

英文:

I have found many similar questions that do not work with the Go regex syntax.

The string that I am attempting to match against is in the form of anything/anything/somestring. With the pattern \/.*\/.*\/(.*), I will match somestring, but I am trying to match anything except strings that contain somestring.

Most answers propose using something like \/.*\/.*\/((?!somestring).*), however in golang regexp I get: ? The preceding token is not quantifiable.

For clarification: /test/test/MATCH would produce a match while /test/test/somestring would not. Is this possible with the (limited) Go regex syntax?

答案1

得分: 9

更新

Go的regexp模块不支持lookaheads,因为该包保证以O(n)时间运行,并且作者没有找到引入lookarounds的方法而不违反这些约束。

然而,你可以使用不同的解决方法。对于当前的解决方法,你可以使用生成POSIX兼容的否定模式的http://www.formauri.es/personal/pgimeno/misc/non-match-regex Web服务。例如,对于somestring,它生成一个^([^s]|s(s|o(s|m(s|es(omes)*(s|t(s|r(s|i(s|ns)))|o(s|ms)))))*([^os]|o([^ms]|m([^es]|e([^s]|s(omes)*([^ost]|t([^rs]|r([^is]|i([^ns]|n[^gs])))|o([^ms]|m([^es]|e[^s]))))))))*(s(s|o(s|m(s|es(omes)*(s|t(s|r(s|i(s|ns)))|o(s|ms)))))*(o((me?)?|mes(omes)*(t(r?|rin?)|o(me?)?)?))?)?$正则表达式。为了在原始正则表达式中使用它,你只需要将最后的(.*)替换为(<part after ^>),即正则表达式将变为

/[^/]*/[^/]*/(([^s]|s(s|o(s|m(s|es(omes)*(s|t(s|r(s|i(s|ns)))|o(s|ms)))))*([^os]|o([^ms]|m([^es]|e([^s]|s(omes)*([^ost]|t([^rs]|r([^is]|i([^ns]|n[^gs])))|o([^ms]|m([^es]|e[^s]))))))))*(s(s|o(s|m(s|es(omes)*(s|t(s|r(s|i(s|ns)))|o(s|ms)))))*(o((me?)?|mes(omes)*(t(r?|rin?)|o(me?)?)?))?)?)$

参见正则表达式演示

为了确保正则表达式只捕获第三个斜杠后面的部分,第一个和第二个.*模式被替换为[^/]*,它匹配除了/之外的零个或多个字符。(在演示中,我还添加了\n,以避免在单个多行字符串演示中跨行匹配)。

最初接受的答案

anything/anything/somestring不应该表示为\/.*\/.*\/(.*)。第一个.*匹配字符串中倒数第二个/之前的内容。你需要使用一个否定字符类[^/](注意在Go正则表达式中不需要转义/)。

由于Go使用的RE2不支持lookaheads,你需要捕获(如JimB在评论中提到的)你感兴趣的所有三个部分,并在检查捕获组#1的值后决定返回什么:

package main

import (
	"fmt"
	"regexp"
)

func main() {
	s := "anything/anything/somestring"
	r := regexp.MustCompile(`^[^/]+/[^/]+/(.*)`)
	val := r.FindStringSubmatch(s)
	// fmt.Println(val[1]) // -> somestring
	if len(val) > 1 && val[1] != "somestring" { // val有多于1个元素且不等于somestring?
		fmt.Println(val[1])		 // 使用val[1]
	} else {
		fmt.Println("No match")  // 否则,报告没有匹配
	}
}

参见[Go演示][3]

  [1]: https://regex101.com/r/JLkVQc/1/
  [2]: https://stackoverflow.com/questions/42515407/is-there-a-way-to-match-everything-except-a-constant-string-using-go-regexp#comment72168422_42515407
  [3]: https://play.golang.org/p/3vlJ5zq6l7

<details>
<summary>英文:</summary>

**Update**

Go `regexp` module does not support lookaheads because this package guarantees to run in O(n) time, and the authors did not find a way to introduce lookarounds without violating these constraints.

**However**, you may use different workarounds. For the current one, you can use the http://www.formauri.es/personal/pgimeno/misc/non-match-regex Web service that generates POSIX-compatible negated patterns. E.g. for `somestring`, it generates a `^([^s]|s(s|o(s|m(s|es(omes)*(s|t(s|r(s|i(s|ns)))|o(s|ms)))))*([^os]|o([^ms]|m([^es]|e([^s]|s(omes)*([^ost]|t([^rs]|r([^is]|i([^ns]|n[^gs])))|o([^ms]|m([^es]|e[^s]))))))))*(s(s|o(s|m(s|es(omes)*(s|t(s|r(s|i(s|ns)))|o(s|ms)))))*(o((me?)?|mes(omes)*(t(r?|rin?)|o(me?)?)?))?)?$` regex, and in order to use it in your original regex, all you need is to replace the last `(.*)` with `(&lt;part after ^&gt;)`, i.e. the regex will look like
```go
/[^/]*/[^/]*/(([^s]|s(s|o(s|m(s|es(omes)*(s|t(s|r(s|i(s|ns)))|o(s|ms)))))*([^os]|o([^ms]|m([^es]|e([^s]|s(omes)*([^ost]|t([^rs]|r([^is]|i([^ns]|n[^gs])))|o([^ms]|m([^es]|e[^s]))))))))*(s(s|o(s|m(s|es(omes)*(s|t(s|r(s|i(s|ns)))|o(s|ms)))))*(o((me?)?|mes(omes)*(t(r?|rin?)|o(me?)?)?))?)?)$

See the regex demo.

To make sure the regex only captures the part after third backslash, the first two .* patterns are replaced with [^/]* that match zero or more chars other than /. (In the demo, I added \n, too, to avoid matching across lines in the single multiline string demo).

Originally accepted answer

The anything/anything/somestring should not be expressed as \/.*\/.*\/(.*). The first .* matches up to the last but one / in the string. You need to use a negated character class [^/] (not the / should not be escaped in Go regex).

Since RE2 that Go uses does not support lookaheads, you need to capture (as JimB mentions in the comments) all three parts you are interested in, and after checking the capture group #1 value, decide what to return:

package main

import (
&quot;fmt&quot;
&quot;regexp&quot;
)
func main() {
s := &quot;anything/anything/somestring&quot;
r := regexp.MustCompile(`^[^/]+/[^/]+/(.*)`)
val := r.FindStringSubmatch(s)
// fmt.Println(val[1]) // -&gt; somestring
if len(val) &gt; 1 &amp;&amp; val[1] != &quot;somestring&quot; { // val has more than 1 element and is not equal to somestring?
fmt.Println(val[1])		 // Use val[1]
} else {
fmt.Println(&quot;No match&quot;)  // Else, report no match
}
}

See the Go demo

答案2

得分: 8

Golang故意不包含这个功能,因为没有办法在O(n)时间内实现它,以满足真正正则表达式的约束条件根据Russ Cox的说法

> 广义断言的缺失,就像反向引用的缺失一样,并不是我们对正则表达式风格的声明。这是由于我们不知道如何高效地实现它们。如果您能够在保持当前包regexp所做的承诺的同时实现它们,即对输入进行单次扫描并在O(n)时间内运行,那么我将很乐意审查和批准该CL。然而,我已经思考了五年,断断续续地思考了如何做到这一点,但没有任何进展。

正如JimB上面提到的,最好的方法是在之后手动检查匹配。

英文:

Golang intentionally leaves this feature out as there is no way to implement it in O(n) time to satisfy the constraints of a true Regular Expression according to Russ Cox:

> The lack of generalized assertions, like the lack of backreferences,
is not a statement on our part about regular expression style. It is
a consequence of not knowing how to implement them efficiently. If
you can implement them while preserving the guarantees made by the
current package regexp, namely that it makes a single scan over the
input and runs in O(n) time, then I would be happy to review and
approve that CL. However, I have pondered how to do this for five
years, off and on, and gotten nowhere.

It looks like the best way to do this is to manually check the match after as JimB mentions above.

答案3

得分: 1

有一个名为regexp2的库,它为Go语言实现了一个功能丰富的正则表达式引擎。它没有像内置的regexp包那样提供恒定时间的性能保证,但它支持回溯。你可以使用(?!somestring)这样的表达式来解决你的问题。

英文:

There is regexp2 which implement a feature-rich RegExp engine for Go, it doesn't have constant time guarantees like the built-in regexp package, but it allows backtracking. You can use then something like (?!somestring) to solve your problem.

huangapple
  • 本文由 发表于 2017年3月1日 01:42:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/42515407.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定