匹配直到某个字符,但不包括该字符。

huangapple go评论83阅读模式
英文:

Match until character but, don't include that character

问题

我正在尝试匹配以下输入:

foo=bar baz foo:1  foo:234.mds32  notfoo:baz  foo:bak foo:nospace foo:bar

并输出6个匹配项:除了notfoo之外的所有内容。匹配项应该是像foo:bar这样的(不包括前导或尾随空格)。

总的来说,我试图匹配的规则是:

  • 查找任何键值对,其中键是foo,键值对由=:分隔。
  • 键值对之间是字符串分隔的。在键值对之间可能有多个空格或随机字符串。
  • 由于上述原因,键值对必须在两侧有空格,或者在行的开头/结尾。

我目前使用的最佳正则表达式是'(?:\s|^)(?P<primary>foo[:=].+?)\s',然后提取primary组。

这种方法的问题是,因为我们将\s作为匹配的一部分,所以在处理重叠的正则表达式时会遇到问题:foo:bak foo:nospace foo:bar被分割了,因为空格字符被匹配了两次,而且golang正则表达式不返回重叠的匹配项。

在其他正则表达式引擎中,我认为可以使用前瞻,但据我所知,golang正则表达式不允许这样做。

有没有办法实现这个目标?

Go Playground链接:https://play.golang.org/p/n8gnWwpiBSR

英文:

I am trying to match against inputs like:

foo=bar baz foo:1  foo:234.mds32  notfoo:baz  foo:bak foo:nospace foo:bar

and output 6 matches: everything but the notfoo. The matches should be like foo:bar (ie not including trailing or leading spaces.

In general, the rules I am trying to match are:

  • Find any kv pair, where the key is foo, and a kv pair is delimited by = or :.
  • Pairs are string split from each other. There may be multiple spaces, or random strings, inbetween kv pairs.
  • As a result of ^, a kv pair must have a space, or line start/end on either side.

The current best regex I have for this is &#39;(?:\s|^)(?P&lt;primary&gt;foo[:=].+?)\s&#39;, and then extracting the primary group.

The problem with this is because we are including the \s as part of the match, we run into issues with overlapping regex: the foo:bak foo:nospace foo:bar is broken because we are attempt the whitespace character is matched 2x, and golang regex doesn't return overlapping matches.

In other regex engines I think lookahead can be used, but as far as I can tell this is not allowed with golang regex.

Is there any way to accomplish this?

Go playground link: https://play.golang.org/p/n8gnWwpiBSR

答案1

得分: 2

很遗憾,Go的regexp库中没有支持lookaround的功能。因此,你可以通过加倍空格(例如使用regexp.MustCompile(\s).ReplaceAllString(d, "$0$0"))来绕过这个限制,然后使用(?:\s|^)(?P<primary>foo[:=]\S+(?:\s+[^:\s]+)*)(?:\s|$)进行匹配:

package main

import (
	"fmt"
	"regexp"
)

func main() {
	var d = `foo=bar baz foo:1  foo:234.mds32  notfoo:baz  foo:bak foo:nospace foo:bar`
	d = regexp.MustCompile(`\s`).ReplaceAllString(d, "$0$0")
	r := regexp.MustCompile(`(?:\s|^)(?P<primary>foo[:=]\S+(?:\s+[^:\s]+)*)(?:\s|$)`)
	idx := r.SubexpIndex("primary")
	for _, m := range r.FindAllStringSubmatch(d, -1) {
		fmt.Printf("%q\n", m[idx])
	}
}

参见Go演示。输出结果为:

"foo=bar  baz"
"foo:1"
"foo:234.mds32"
"foo:bak"
"foo:nospace"
"foo:bar"

详细说明

  • (?:\s|^) - 空格或字符串的开头
  • (?P<primary>foo[:=]\S+(?:\s+[^:\s]+)*) - "primary"组:foo,冒号或等号字符,一个或多个非空格字符,然后零个或多个出现的一个或多个空格字符和一个或多个非空格字符
  • (?:\s|$) - 空格或字符串的结尾。
英文:

It is a pity there is no lookaround support in Go regexp, thus, you can work around this limitation by doubling whitespaces (e.g. with regexp.MustCompile(\s).ReplaceAllString(d, &quot;$0$0&quot;)) and then matching with (?:\s|^)(?P&lt;primary&gt;foo[:=]\S+(?:\s+[^:\s]+)*)(?:\s|$):

package main

import (
	&quot;fmt&quot;
	&quot;regexp&quot;
)

func main() {
	var d = `foo=bar baz foo:1  foo:234.mds32  notfoo:baz  foo:bak foo:nospace foo:bar`
	d = regexp.MustCompile(`\s`).ReplaceAllString(d, &quot;$0$0&quot;)
	r := regexp.MustCompile(`(?:\s|^)(?P&lt;primary&gt;foo[:=]\S+(?:\s+[^:\s]+)*)(?:\s|$)`)
	idx := r.SubexpIndex(&quot;primary&quot;)
	for _, m := range r.FindAllStringSubmatch(d, -1) {
		fmt.Printf(&quot;%q\n&quot;, m[idx])
	}
}

See the Go demo. Output:

&quot;foo=bar  baz&quot;
&quot;foo:1&quot;
&quot;foo:234.mds32&quot;
&quot;foo:bak&quot;
&quot;foo:nospace&quot;
&quot;foo:bar&quot;

Details:

  • (?:\s|^) - a whitespace or start of string
  • (?P&lt;primary&gt;foo[:=]\S+(?:\s+[^:\s]+)*) - Group "primary": foo, a colon or = char, one or more non-whitespaces, and then zero or more occurrences of one or more whitespaces and then one or more chars other than a whitespace or colon
  • (?:\s|$) - a whitepace or end of string.

答案2

得分: 2

有几种方法可以采用:

  1. 只需将模式更改为(?:\s|^)(?P<primary>foo[:=]\S+),如Wiktor Stribiżew在评论中提到的,而不是匹配.+?直到\s。这样可以解决问题,而无需进行其他操作,但我将列出几种可能适用于类似问题的更多选项,这些选项可能不容易被否定。

  2. 由于问题出在FindAll函数不允许重叠,那就不要使用它们!相反,自己编写代码,使用FindStringSubmatchIndex来获取一个匹配的边界,通过切片字符串提取匹配的文本,然后执行d = d[endIndex-1:]并循环直到FindStringSubmatchIndex返回nil。

  3. 使用模式\s+将输入字符串拆分为以空格分隔的组件,然后只丢弃那些在^foo[:=]上没有匹配的组件。你甚至可以使用strings.HasPrefix("foo:") || strings.HasPrefix("foo=")。剩下的部分将是你想要的匹配项,并且它们周围的空格已经被拆分丢弃。在我看来,这个版本比尝试使用匹配更清晰地传达了意图。

英文:

There are several approaches you could take:

  1. Just change your pattern to (?:\s|^)(?P&lt;primary&gt;foo[:=]\S+) as Wiktor Stribiżew mentions in a comment, instead of matching .+? up to \s. This solves the problem with no shenanigans, but I will list a few more options that might be applicable to similar problems that couldn't be so easily negated.

  2. Since the problem is with the FindAll functions not allowing the overlap, don't use them! Instead, roll your own, using FindStringSubmatchIndex to get the boundaries of one match, extract the matched text by slicing the string, then do d = d[endIndex-1:] and loop until FindStringSubmatchIndex returns nil.

  3. Use regexp.Split() with a pattern of \s+ to break the input string into whitespace-separated components, then just discard the ones that don't regexp.Match() on ^foo[:=]. You could even use strings.HasPrefix(&quot;foo:&quot;) || strings.HasPrefix(&quot;foo=&quot;) instead. The remaining ones will be your desired matches, and the whitespace around them will have already been discarded by the split. In my opinion this version conveys intent more clearly than trying to use a match.

答案3

得分: 1

其他人已经给出了使用正则表达式的优秀答案,正如你所要求的那样。我是否可以大胆地提出一个非正则表达式的解决方案呢?

我发现在这种情况下,正则表达式并不是最好的解决方案。最好的方法是使用strings.Fields(original)将字符串拆分为子字符串列表。对于每个字符串,根据它是否包含=:或两者都不包含来进行拆分。Fields()函数在解析时类似于awk中的默认拆分,它会跳过连续的多个空格。

工作示例在这里:https://play.golang.org/p/xXaA9skdplz

original := `foo=bar baz foo:1  foo:234.mds32  notfoo:baz  foo:bak foo:nospace foo:bar`

for _, item := range strings.Fields(original) {
    if kv := strings.SplitN(item, "=", 2); len(kv) == 2 {
        fmt.Printf("key/value: %q -> %q\n", kv[0], kv[1])
    } else if kv := strings.SplitN(item, ":", 2); len(kv) == 2 {
        fmt.Printf("key/value: %q -> %q\n", kv[0], kv[1])
    } else {
        fmt.Printf("key: %q\n", item)
    }
}

显然,你需要修改这段代码以收集答案而不是打印它们。

如果你必须使用正则表达式,请使用其他答案中的方法。

英文:

Other people have given excellent answers using regular expressions as requested. Might I be so bold as to suggest a non-regex answer?

I find that regex's are not the best solution for this situation. It is better to split the string using strings.Fields(original) to get a list of substrings. For each string, split it based on whether it has a = or : or neither. The Fields() function does a great job of parsing similar to the default split in awk, which skips multiple spaces in a row.

Working example here: https://play.golang.org/p/xXaA9skdplz


	original := `foo=bar baz foo:1  foo:234.mds32  notfoo:baz  foo:bak foo:nospace foo:bar`

	for _, item := range strings.Fields(original) {
		if kv := strings.SplitN(item, &quot;=&quot;, 2); len(kv) == 2 {
			fmt.Printf(&quot;key/value: %q -&gt; %q\n&quot;, kv[0], kv[1])
		} else if kv := strings.SplitN(item, &quot;:&quot;, 2); len(kv) == 2 {
			fmt.Printf(&quot;key/value: %q -&gt; %q\n&quot;, kv[0], kv[1])
		} else {
			fmt.Printf(&quot;key: %q\n&quot;, item)
		}

	}

Obviously you'll need to modify this code to collect the answers rather than print them.

If you have to use regex's, then please use the other answers.

huangapple
  • 本文由 发表于 2021年8月4日 02:57:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/68641467.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定