Match any string also containing escaped characters and newlines with Go

huangapple go评论105阅读模式
英文:

Match any string also containing escaped characters and newlines with Go

问题

以下是翻译好的内容:

需要使用Go编写的工具来查找文件(C或C++代码)中的任何(格式)字符串,即使其中包含转义字符或换行符。示例:

..."foo"...
...`foo:"foo"`...
..."foo
foo"...
..."foo\r\nfoo"...
...`foo"foo-

lish`

C/C++解析也可以在注释或停用的代码中进行,因此不需要排除这些部分。

我在 https://regex101.com/r/FDhldb/1 上使用了以下正则表达式成功找到了解决方案:

/(["'`])(?:(?=(\?))\2.)*?\1/gms

不幸的是,这个正则表达式在Go中无法编译:

const (
patFmtString = `(?Us)(["'])(?:(?=(\\?)).)*?`
)
var (
matchFmtString = regexp.MustCompile(patFmtString)
) 

即使简化了模式 (?Us)(["'])(?:(\\?).)*?\1,也会出现“error parsing regexp: invalid escape sequence: \1”的错误。

我应该如何在Go中正确实现它,并希望运行速度也很快?

英文:

Any (format) strings in a file (C or C++ code), even containing escaped characters or newlines are needed to be found by a tool written in Go. Examples:

..."foo"...
...`foo:"foo"`...
..."foo
foo"...
..."foo\r\nfoo"...
...`foo"foo-

lish`

The C/C++ parsing is allowed to be done also in comments or deactivated code, so no need to exclude that parts.

I succeeded with

/(["'`])(?:(?=(\?))\2.)*?\1/gms

on https://regex101.com/r/FDhldb/1 searching for a solution.

Unfortunately this does not compile in Go:

const (
patFmtString = `(?Us)(["'])(?:(?=(\\?)).)*?`
)
var (
matchFmtString = regexp.MustCompile(patFmtString)
) 

Even the simplified pattern (?Us)(["'])(?:(\\?).)*?\1 delivers "error parsing regexp: invalid escape sequence: \1".

How do I correctly implement that in Go, hopefully running also fast?

答案1

得分: 1

你可以使用相对简单的Scanner来实现这个,而不是使用PCRE:

import "bufio"

var stringLiterals bufio.SplitFunc = func(data []byte, atEOF bool) (advance int, token []byte, err error) {
	scanning := false
	var delim byte
	var i int
	var start, end int
	for i < len(data) {
		b := data[i]
		switch b {
		case '\\': // 跳过转义序列
			i += 2
			continue
		case '"':
			fallthrough
		case '\'':
			fallthrough
		case '`':
			if scanning && delim == b {
				end = i + 1
				token = data[start:end]
				advance = end
				return
			} else if !scanning {
				scanning = true
				start = i
				delim = b
			}
		}
		i++
	}
	if atEOF {
		return len(data), nil, nil
	}
	return start, nil, nil
}

然后像这样使用它:

func main() {
    input := /* 一些读取器 */
    scanner := bufio.NewScanner(input)
    scanner.Split(stringLiterals)
    for scanner.Scan() {
        stringLit := scanner.Text()
        // 使用 `stringLit` 做一些操作
    }
}

对于你的示例,这将返回与你的正则表达式完全匹配的结果,尽管我不确定这是否实际上对应于C++字符串字面值的语法

你可以在playground上尝试一下。

英文:

You can use a reasonably simple Scanner to accomplish this instead of using PCRE:

import &quot;bufio&quot;

var stringLiterals bufio.SplitFunc = func(data []byte, atEOF bool) (advance int, token []byte, err error) {
	scanning := false
	var delim byte
	var i int
	var start, end int
	for i &lt; len(data) {
		b := data[i]
		switch b {
		case &#39;\\&#39;: // skip escape sequences
			i += 2
			continue
		case &#39;&quot;&#39;:
			fallthrough
		case &#39;\&#39;&#39;:
			fallthrough
		case &#39;`&#39;:
			if scanning &amp;&amp; delim == b {
				end = i + 1
				token = data[start:end]
				advance = end
				return
			} else if !scanning {
				scanning = true
				start = i
				delim = b
			}
		}
		i++
	}
	if atEOF {
		return len(data), nil, nil
	}
	return start, nil, nil
}

and use it like

func main() {
    input := /* some reader */
    scanner := bufio.NewScanner(input)
    scanner.Split(stringLiterals)
    for scanner.Scan() {
        stringLit := scanner.Text()
        // do something with `stringLit`
    }
}

For you examples, this returns exactly the matches that your regex does, though I'm not sure that actually corresponds to the way C++ string literals are defined in the grammar.

You can try it out on the playground.

huangapple
  • 本文由 发表于 2023年6月30日 16:34:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76587323.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定