How to match a regex with backreference in Go?

huangapple go评论82阅读模式
英文:

How to match a regex with backreference in Go?

问题

我需要在我的Go代码中匹配使用反向引用(例如\1)的正则表达式。

这在Go中并不容易,因为官方的regexp包使用的是RE2引擎,它选择不支持反向引用(以及其他一些较少知名的功能),以确保线性时间执行,从而避免正则表达式拒绝服务攻击。启用RE2的反向引用支持不是一个选项。

在我的代码中,没有受到攻击者恶意利用的风险,而且我需要使用反向引用。

我该怎么办?

英文:

I need to match a regex that uses backreferences (e.g. \1) in my Go code.

That's not so easy because in Go, the official regexp package uses the RE2 engine, one that have chosen to not support backreferences (and some other lesser-known features) so that there can be a guarantee of linear-time execution, therefore avoiding regex denial-of-service attacks. Enabling backreferences support is not an option with RE2.

In my code, there is no risk of malicious exploitation by attackers, and I need backreferences.

What should I do?

答案1

得分: 13

回答我的问题,我使用golang-pkg-pcre解决了这个问题,它使用libpcre++,支持反向引用的Perl正则表达式。它的API与不同

英文:

Answering my own question here, I solved this using golang-pkg-pcre, it uses libpcre++, perl regexes that do support backreferences. The API is not the same.

答案2

得分: 9

正则表达式非常适合处理正则语法,但如果你的语法不是正则的(即需要反向引用等),你可能应该切换到更好的工具。有很多用于解析上下文无关文法的好工具,包括默认随 Go 发行的 yacc。另外,你也可以编写自己的解析器。例如,可以很容易地手动编写 递归下降解析器

我认为在脚本语言(如Perl、Python、Ruby等)中,正则表达式被过度使用,因为它们的C/ASM实现通常比这些语言本身更优化,但Go并不是这样的语言。正则表达式通常非常慢,并且经常不适合解决问题。

英文:

Regular Expressions are great for working with regular grammars, but if your grammar isn't regular (i.e. requires back-references and stuff like that) you should probably switch to a better tool. There are a lot of good tools available for parsing context-free grammars, including yacc which is shipped with the Go distribution by default. Alternatively, you can also write your own parser. Recursive descent parsers can be easily written by hand for example.

I think regular expressions are overused in scripting languages (like Perl, Python, Ruby, ...) because their C/ASM powered implementation is usually more optimized than those languages itself, but Go isn't such a language. Regular expressions are usually quite slow and are often not suited for the problem at all.

答案3

得分: 3

当我遇到相同的问题时,我使用了一个两步的正则表达式匹配来解决它。原始代码如下:

if m := match(pkgname, `^(.*)$\{DISTNAME:S(.)(\\^?)([^:]*)(\$?)([^:]*)(g?)\}(.*)$`); m != nil {
    before, _, left, from, right, to, mod, after := m[1], m[2], m[3], m[4], m[5], m[6], m[7], m[8]
    // ...
}

这段代码的作用是解析形式为${DISTNAME:S|from|to|g}的字符串,它本身是一种使用熟悉的替换语法S|replace|with|的小型模式语言。

两步的代码如下:

if m, before, sep, subst, after := match4(pkgname, `^(.*)$\{DISTNAME:S(.)([^\\}:]+)\}(.*)$`); m {
	qsep := regexp.QuoteMeta(sep)
	if m, left, from, right, to, mod := match5(subst, `^(\^?)([^:]*)($?)`+qsep+`([^:]*)`+qsep+`(g?)$`); m {
		// ...
	}
}

matchmatch4match5是我自己封装的regexp包的函数,它们缓存了编译后的正则表达式,以至少不浪费编译时间。

英文:

When I had the same problem, I solved it using a two-step regular expression match. The original code is:

if m := match(pkgname, `^(.*)$\{DISTNAME:S(.)(\\^?)([^:]*)(\$?)([^:]*)(g?)\}(.*)$`); m != nil {
    before, _, left, from, right, to, mod, after := m[1], m[2], m[3], m[4], m[5], m[6], m[7], m[8]
    // ...
}

The code is supposed to parse a string of the form ${DISTNAME:S|from|to|g}, which itself is a little pattern language using the familiar substitution syntax S|replace|with|.

The two-stage code looks like this:

if m, before, sep, subst, after := match4(pkgname, `^(.*)$\{DISTNAME:S(.)([^\\}:]+)\}(.*)$`); m {
	qsep := regexp.QuoteMeta(sep)
	if m, left, from, right, to, mod := match5(subst, `^(\^?)([^:]*)($?)`+qsep+`([^:]*)`+qsep+`(g?)$`); m {
		// ...
	}
}

The match, match4 and match5 are my own wrapper around the regexp package, and they cache the compiled regular expressions so that at least the compilation time is not wasted.

答案4

得分: 1

正则表达式包中的函数FindSubmatchIndexExpand可以通过反向引用来捕获内容。虽然不是非常方便,但仍然是可能的。示例

package main

import (
	"fmt"
	"regexp"
)

func main() {
	content := []byte(`
	# comment line
	option1: value1
	option2: value2

	# another comment line
	option3: value3
`)

	pattern := regexp.MustCompile(`(?m)(?P<key>\w+):\s+(?P<value>\w+)$`)

	template := []byte("$key=$value\n")
	result := []byte{}
	for _, submatches := range pattern.FindAllSubmatchIndex(content, -1) {
		result = pattern.Expand(result, template, content, submatches)
	}
	fmt.Println(string(result))
}

输出结果

option1=value1
option2=value2
option3=value3

英文:

regexp package funcs FindSubmatchIndex and Expand can capture content by backreferences. It isn't very convenient, but it is still possible. Example

package main

import (
	&quot;fmt&quot;
	&quot;regexp&quot;
)

func main() {
	content := []byte(`
	# comment line
	option1: value1
	option2: value2

	# another comment line
	option3: value3
`)

	pattern := regexp.MustCompile(`(?m)(?P&lt;key&gt;\w+):\s+(?P&lt;value&gt;\w+)$`)

	template := []byte(&quot;$key=$value\n&quot;)
	result := []byte{}
	for _, submatches := range pattern.FindAllSubmatchIndex(content, -1) {
		result = pattern.Expand(result, template, content, submatches)
	}
	fmt.Println(string(result))
}

output

option1=value1
option2=value2
option3=value3

答案5

得分: 1

我认为这是一个旧问题,但我在上面的答案中没有找到一个简单的解决方案。

此外,“golang-pkg-pcre”在搭载 M1 芯片的 macOS 上无法工作。

因此,我想贡献我的想法。

例如,将 &lt;u&gt; 或 &lt;I&gt; 替换为 &lt;b&gt;,将 &lt;/u&gt; 或 &lt;/I&gt; 替换为 &lt;/b&gt;。搜索时不区分大小写。

让我来比较一下如何在 Python 和 Go 中实现。

在 Python 中,可以这样简单地实现:

import re
content = "&lt;u&gt;test1&lt;/u&gt; &lt;i&gt;test2&lt;/i&gt;\n&lt;U&gt;test3&lt;/U&gt; &lt;I&gt;test4&lt;/I&gt;"
content = re.sub(r"&lt;(u|i)&gt;([^&lt;&gt;]+?)&lt;/&gt;", r"&lt;b&gt;&lt;/b&gt;", content, flags=re.IGNORECASE)
print(content)

在 Go 中,可以这样实现:

package main

import (
    "fmt"
    "regexp"
)

func main() {
    content := "&lt;u&gt;test1&lt;/u&gt; &lt;i&gt;test2&lt;/i&gt;\n&lt;U&gt;test3&lt;/U&gt; &lt;I&gt;test4&lt;/I&gt;"
    content = changeUITagToBTag(content)
    fmt.Println(content)
}

// 将 &lt;u&gt; 或 &lt;i&gt; 替换为 &lt;b&gt;,将 &lt;/u&gt; 或 &lt;/i&gt; 替换为 &lt;/b&gt;
// 不区分大小写搜索
func changeUITagToBTag(content string) string {
    pattern := `&lt;(u|i)&gt;([^&lt;&gt;]+?)&lt;/(u|i)&gt;`
    compiledPattern := regexp.MustCompile(fmt.Sprintf(`(?%v)%v`, "i", pattern))
    content = compiledPattern.ReplaceAllStringFunc(content, func(text string) string {
        allSubStrings := compiledPattern.FindAllStringSubmatch(text, -1)
        if allSubStrings[0][1] == allSubStrings[0][3] {
            return fmt.Sprintf(`&lt;b&gt;%s&lt;/b&gt;`, allSubStrings[0][2])
        }
        return text
    })
    return content
}
英文:

I think this was an old question, but I haven't found a simple solution from answers above.

In addition, "golang-pkg-pcre" does not work on macOS with M1.

Therefore, I would like to contribute my idea.

For example, to replace &lt;u&gt; or &lt;I&gt; with &lt;b&gt; and replace &lt;/u&gt; or &lt;/I&gt; with &lt;/b&gt;. The search is case-insensitive.

Let me compare how to do it in python and in go

In python, it is easy as below:

import re
content = &quot;&lt;u&gt;test1&lt;/u&gt; &lt;i&gt;test2&lt;/i&gt;\n&lt;U&gt;test3&lt;/U&gt; &lt;I&gt;test4&lt;/I&gt;&quot;
content = re.sub(r&quot;&lt;(u|i)&gt;([^&lt;&gt;]+?)&lt;/&gt;&quot;, r&quot;&lt;b&gt;&lt;/b&gt;&quot;, content, flags=re.IGNORECASE)
print(content)

In go, I do it this way:

package main

import (
    &quot;fmt&quot;
    &quot;regexp&quot;
)

func main() {
    content := &quot;&lt;u&gt;test1&lt;/u&gt; &lt;i&gt;test2&lt;/i&gt;\n&lt;U&gt;test3&lt;/U&gt; &lt;I&gt;test4&lt;/I&gt;&quot;
    content = changeUITagToBTag(content)
    fmt.Println(content)
}

// change &lt;u&gt; or &lt;i&gt; to &lt;b&gt; and &lt;/u&gt; or &lt;/i&gt; to &lt;/b&gt;
// case-insensitive search
func changeUITagToBTag(content string) string {
    pattern := `&lt;(u|i)&gt;([^&lt;&gt;]+?)&lt;/(u|i)&gt;`
    compiledPattern := regexp.MustCompile(fmt.Sprintf(`(?%v)%v`, &quot;i&quot;, pattern))
    content = compiledPattern.ReplaceAllStringFunc(content, func(text string) string {
        allSubStrings := compiledPattern.FindAllStringSubmatch(text, -1)
        if allSubStrings[0][1] == allSubStrings[0][3] {
            return fmt.Sprintf(`&lt;b&gt;%s&lt;/b&gt;`, allSubStrings[0][2])
        }
        return text
    })
    return content
}

huangapple
  • 本文由 发表于 2014年5月31日 18:33:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/23968992.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定