英文:
Match any string also containing escaped characters and newlines with Go
问题
以下是翻译好的内容:
需要使用Go编写的工具来查找文件(C或C++代码)中的任何(格式)字符串,即使其中包含转义字符或换行符。示例:
..."foo"...
...`foo:"foo"`...
..."foo
foo"...
..."foo\r\nfoo"...
...`foo"foo-
lish`
C/C++解析也可以在注释或停用的代码中进行,因此不需要排除这些部分。
我在 https://regex101.com/r/FDhldb/1 上使用了以下正则表达式成功找到了解决方案:
/(["'`])(?:(?=(\?))\2.)*?\1/gms
不幸的是,这个正则表达式在Go中无法编译:
const (
patFmtString = `(?Us)(["'])(?:(?=(\\?)).)*?`
)
var (
matchFmtString = regexp.MustCompile(patFmtString)
)
即使简化了模式 (?Us)(["'])(?:(\\?).)*?\1
,也会出现“error parsing regexp: invalid escape sequence: \1
”的错误。
我应该如何在Go中正确实现它,并希望运行速度也很快?
英文:
Any (format) strings in a file (C or C++ code), even containing escaped characters or newlines are needed to be found by a tool written in Go. Examples:
..."foo"...
...`foo:"foo"`...
..."foo
foo"...
..."foo\r\nfoo"...
...`foo"foo-
lish`
The C/C++ parsing is allowed to be done also in comments or deactivated code, so no need to exclude that parts.
I succeeded with
/(["'`])(?:(?=(\?))\2.)*?\1/gms
on https://regex101.com/r/FDhldb/1 searching for a solution.
Unfortunately this does not compile in Go:
const (
patFmtString = `(?Us)(["'])(?:(?=(\\?)).)*?`
)
var (
matchFmtString = regexp.MustCompile(patFmtString)
)
Even the simplified pattern (?Us)(["'])(?:(\\?).)*?\1
delivers "error parsing regexp: invalid escape sequence: \1
".
How do I correctly implement that in Go, hopefully running also fast?
答案1
得分: 1
你可以使用相对简单的Scanner
来实现这个,而不是使用PCRE:
import "bufio"
var stringLiterals bufio.SplitFunc = func(data []byte, atEOF bool) (advance int, token []byte, err error) {
scanning := false
var delim byte
var i int
var start, end int
for i < len(data) {
b := data[i]
switch b {
case '\\': // 跳过转义序列
i += 2
continue
case '"':
fallthrough
case '\'':
fallthrough
case '`':
if scanning && delim == b {
end = i + 1
token = data[start:end]
advance = end
return
} else if !scanning {
scanning = true
start = i
delim = b
}
}
i++
}
if atEOF {
return len(data), nil, nil
}
return start, nil, nil
}
然后像这样使用它:
func main() {
input := /* 一些读取器 */
scanner := bufio.NewScanner(input)
scanner.Split(stringLiterals)
for scanner.Scan() {
stringLit := scanner.Text()
// 使用 `stringLit` 做一些操作
}
}
对于你的示例,这将返回与你的正则表达式完全匹配的结果,尽管我不确定这是否实际上对应于C++字符串字面值的语法。
你可以在playground上尝试一下。
英文:
You can use a reasonably simple Scanner
to accomplish this instead of using PCRE:
import "bufio"
var stringLiterals bufio.SplitFunc = func(data []byte, atEOF bool) (advance int, token []byte, err error) {
scanning := false
var delim byte
var i int
var start, end int
for i < len(data) {
b := data[i]
switch b {
case '\\': // skip escape sequences
i += 2
continue
case '"':
fallthrough
case '\'':
fallthrough
case '`':
if scanning && delim == b {
end = i + 1
token = data[start:end]
advance = end
return
} else if !scanning {
scanning = true
start = i
delim = b
}
}
i++
}
if atEOF {
return len(data), nil, nil
}
return start, nil, nil
}
and use it like
func main() {
input := /* some reader */
scanner := bufio.NewScanner(input)
scanner.Split(stringLiterals)
for scanner.Scan() {
stringLit := scanner.Text()
// do something with `stringLit`
}
}
For you examples, this returns exactly the matches that your regex does, though I'm not sure that actually corresponds to the way C++ string literals are defined in the grammar.
You can try it out on the playground.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论