英文:
A regex for Go, that matches content including balanced parentheses
问题
我的使用情况如下:我正在解析一个SQL查询,尝试获取函数名和传递给该函数的参数。这要求我的正则表达式能够找到名称、开括号、内容和闭括号。不幸的是,在测试过程中发现它有时会过于贪婪,抓取额外的括号,而其他时候则会漏掉闭括号。
以下是我在playground上的测试代码:
func getRegex(name string) string {
return fmt.Sprintf(`$__%s\b(?:\((.*?)\))?`, name)
}
func main() {
var rawSQL = `(select min(time) from table where $__timeFilter(time))`
rgx, err := regexp.Compile(getRegex("timeFilter"))
if err != nil {
fmt.Println(err)
}
var match = rgx.FindAllStringSubmatch(rawSQL, -1)
fmt.Println(match)
}
使用实例链接:https://go.dev/play/p/4FpZblia7Ks
我正在测试的4种情况如下:
(select min(time) from table where $__timeFilter(time) ) OK
(select min(time) from table where $__timeFilter(time)) NOK
select * from foo where $__timeFilter(cast(sth as timestamp)) OK
select * from foo where $__timeFilter(cast(sth as timestamp) ) NOK
这是一个实时的regexr版本:https://regexr.com/700oh
我来自JavaScript世界,从未使用过递归正则表达式,看起来这可能是一个需要使用递归的情况?
英文:
So my use case is as follows: I am parsing an SQL query trying to grab a function name and respective parameters sent to this function. This requires my regex to be able to find the name, opening parenthesis, content and the closing parenthesis. Unfortunately while testing it turned out it's sometimes too greedy, grabbing additional parenthesis and other times it misses the closing one.
Here's my test code on playground:
func getRegex(name string) string {
return fmt.Sprintf("\$__%s\\b(?:\\((.*?\\)?)\\))?", name)
}
func main() {
var rawSQL = "(select min(time) from table where $__timeFilter(time))"
rgx, err := regexp.Compile(getRegex("timeFilter"))
if err != nil {
fmt.Println(err)
}
var match = rgx.FindAllStringSubmatch(rawSQL, -1)
fmt.Println(match)
}
with a live example https://go.dev/play/p/4FpZblia7Ks
The 4 cases I am testing are as follows:
(select min(time) from table where $__timeFilter(time) ) OK
(select min(time) from table where $__timeFilter(time)) NOK
select * from foo where $__timeFilter(cast(sth as timestamp)) OK
select * from foo where $__timeFilter(cast(sth as timestamp) ) NOK
here's a live regexr version https://regexr.com/700oh
I come from the javascript world so never used recursive regexes and looks like this might be the case for one ?
答案1
得分: 2
看起来你的正则表达式有两个主要问题,其中一个比另一个更容易解决:
- 正则表达式在处理递归匹配时本质上存在问题,比如匹配开闭括号组,因为它们没有记忆功能。在你的情况下,我认为你尝试通过限制自己在几种特定情况下来解决这个问题,但是正则表达式的贪婪性在这里起到了反作用。
- 你没有匹配可能在闭括号前存在空格的情况。
这两个问题共同导致你的正则表达式在这两种情况下失败,但也导致你的第一种情况匹配成功。
要解决这个问题,你需要在将字符串发送到正则表达式之前对其进行一些预处理:
if strings.HasPrefix(rawSql, "(") {
rawSql = rawSql[1:len(rawSql) - 1]
}
这将去掉任何外部括号,正则表达式在没有记忆或额外子句的情况下无法忽略它们。
接下来,你需要修改你的正则表达式,以处理内部函数调用和 $__timeFilter
调用之间可能存在空格的情况:
func getRegex(name string) string {
return fmt.Sprintf("\$__%s\\b((.*?\\)?)\\s*)?", name)
}
完成这些步骤后,你的正则表达式应该可以工作了。你可以在这个 playground 链接上找到一个完整的示例。
英文:
It appears that your regex has two main problems, one of which is easier to deal with than the other:
- Regular expressions are inherently bad at handling recursive matching, such as grouping opening and closing parentheses, because they have no memory. In your case, I think you've tried to work around this issue by restricting yourself to a few particular cases, but the greedy nature of regular expressions is working against you here.
- You don't match for the case where there might be whitespace before a closing parenthesis.
These two issues are together causing your regex to fail on those two cases but also causing your first case to match.
To fix this, you'll have to do some preprocessing on the string before sending it to the regex:
if strings.HasPrefix(rawSql, "(") {
rawSql = rawSql[1:len(rawSql) - 1]
}
This will strip off any outer parentheses, which a regex would not be able to ignore without memory or extra clauses.
Next, you'll want to modify your regex to handle the case where whitespace could exist between your inner function call and $__timeFilter
call:
func getRegex(name string) string {
return fmt.Sprintf("\$__%s\\b(\\((.*?\\)?)\\s*\\))?", name)
}
After doing this, your regex should work. You can find a full example on this playground link.
答案2
得分: 0
我选择了Woody的答案作为正确答案,尽管最后我不得不选择另一种方法。附加的测试用例没有包含一些情况,并且结果证明我还必须能够提取括号内的参数。所以这是我的最终解决方案,我手动解析文本,找到边界括号并提取括号之间的内容:
// getMacroMatches从给定的SQL输入中提取带有相应参数的宏字符串
// 它手动解析字符串以找到宏的闭合括号(因为正则表达式没有记忆)
func getMacroMatches(input string, name string) ([][]string, error) {
macroName := fmt.Sprintf("\\$__%s\\b", name)
matchedMacros := [][]string{}
rgx, err := regexp.Compile(macroName)
if err != nil {
return nil, err
}
// 获取所有匹配的宏实例
matched := rgx.FindAllStringIndex(input, -1)
if matched == nil {
return nil, nil
}
for matchedIndex := 0; matchedIndex < len(matched); matchedIndex++ {
var macroEnd = 0
var argStart = 0
macroStart := matched[matchedIndex][0]
inputCopy := input[macroStart:]
cache := make([]rune, 0)
// 找到开放和闭合参数括号
for idx, r := range inputCopy {
if len(cache) == 0 && macroEnd > 0 {
break
}
switch r {
case '(':
cache = append(cache, r)
if argStart == 0 {
argStart = idx + 1
}
case ')':
l := len(cache)
if l == 0 {
break
}
cache = cache[:l-1]
macroEnd = idx + 1
default:
continue
}
}
// 如果macroEnd等于0,表示没有括号,所以将其设置为正则表达式匹配的结尾
if macroEnd == 0 {
macroEnd = matched[matchedIndex][1] - macroStart
}
macroString := inputCopy[0:macroEnd]
macroMatch := []string{macroString}
args := ""
// 如果找到了开放括号,提取内容作为参数
if argStart > 0 {
args = inputCopy[argStart : macroEnd-1]
}
macroMatch = append(macroMatch, args)
matchedMacros = append(matchedMacros, macroMatch)
}
return matchedMacros, nil
}
Go playground链接:https://go.dev/play/p/-odWKMBLCBv
英文:
I selected Woody's answer as the correct one even though I finally had to go a different route. The attached test cases didn't include for some scenarios AND it turned out I also had to be able to extract the arguments inside of parentheses. So here's my final solution, where I manually parse the text, find the bounding parentheses and extract whatever is in between them:
// getMacroMatches extracts macro strings with their respective arguments from the sql input given
// It manually parses the string to find the closing parenthesis of the macro (because regex has no memory)
func getMacroMatches(input string, name string) ([][]string, error) {
macroName := fmt.Sprintf("\\$__%s\\b", name)
matchedMacros := [][]string{}
rgx, err := regexp.Compile(macroName)
if err != nil {
return nil, err
}
// get all matching macro instances
matched := rgx.FindAllStringIndex(input, -1)
if matched == nil {
return nil, nil
}
for matchedIndex := 0; matchedIndex < len(matched); matchedIndex++ {
var macroEnd = 0
var argStart = 0
macroStart := matched[matchedIndex][0]
inputCopy := input[macroStart:]
cache := make([]rune, 0)
// find the opening and closing arguments brackets
for idx, r := range inputCopy {
if len(cache) == 0 && macroEnd > 0 {
break
}
switch r {
case '(':
cache = append(cache, r)
if argStart == 0 {
argStart = idx + 1
}
case ')':
l := len(cache)
if l == 0 {
break
}
cache = cache[:l-1]
macroEnd = idx + 1
default:
continue
}
}
// macroEnd equals to 0 means there are no parentheses, so just set it
// to the end of the regex match
if macroEnd == 0 {
macroEnd = matched[matchedIndex][1] - macroStart
}
macroString := inputCopy[0:macroEnd]
macroMatch := []string{macroString}
args := ""
// if opening parenthesis was found, extract contents as arguments
if argStart > 0 {
args = inputCopy[argStart : macroEnd-1]
}
macroMatch = append(macroMatch, args)
matchedMacros = append(matchedMacros, macroMatch)
}
return matchedMacros, nil
}
Go playground link: https://go.dev/play/p/-odWKMBLCBv
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论