正则表达式的重复捕获组只捕获最后一次迭代,但我需要全部捕获。

huangapple go评论85阅读模式
英文:

regex repeated capturing group captures the last iteration but I need all

问题

示例代码:

	var reStr = `"(?:\\"|[^"])*"`
	var reStrSum = regexp.MustCompile(`(?m)(` + reStr + `)` + `(?:\s*\+\s*(` + reStr + `)){0,}`)
	var str = `
test1("This\nis\ta\\string" + 
	"Another\"string" + 
	"Third string summed");

test2("Second string " + "sum");
`

	for i, match := range reStrSum.FindAllStringSubmatch(str, -1) {
		fmt.Println(match, "found at index", i)
		for i, str := range match {
			fmt.Println(i, str)
		}
	}

输出结果:

["This\nis\ta\\string" + 
	"Another\"string" + 
	"Third string summed" "This\nis\ta\\string" "Third string summed"] found at index 0
0 "This\nis\ta\\string" + 
	"Another\"string" + 
	"Third string summed"
1 "This\nis\ta\\string"
2 "Third string summed"
["Second string " + "sum" "Second string " "sum"] found at index 1
0 "Second string " + "sum"
1 "Second string "
2 "sum"

第一个匹配的第0组包含了所有三个字符串(正则表达式匹配正确),但是表达式中只有两个捕获组,第二个组只包含了重复的最后一次迭代。例如,"Another\"string" 在这个过程中丢失了,无法访问。

是否有可能以某种方式在第2组中获取所有迭代(所有重复项)?

我也可以接受使用嵌套循环的任何解决方法。但请注意,我不能简单地用外部的FindAllStringSubmatch调用替换{0,}重复,因为FindAllStringSubmatch调用已经用于迭代“字符串和的和”。换句话说,我必须找到第一个字符串和以及"Second string sum"

英文:

Example code:

	var reStr = `"(?:\\"|[^"])*"`
	var reStrSum = regexp.MustCompile(`(?m)(` + reStr + `)\s*\+\s*(` + reStr + `)\s*\+\s*(` + reStr + `)`)
	var str = `"This\nis\ta\\string" + 
	"Another\"string" + 
	"Third string"
`

	for i, match := range reStrSum.FindAllStringSubmatch(str, -1) {
		fmt.Println(match, "found at index", i)
		for i, str := range match {
			fmt.Println(i, str)
		}
	}

Output:

["This\nis\ta\\string" + 
	"Another\"string" + 
	"Third string" "This\nis\ta\\string" "Another\"string" "Third string"] found at index 0
0 "This\nis\ta\\string" + 
	"Another\"string" + 
	"Third string"
1 "This\nis\ta\\string"
2 "Another\"string"
3 "Third string"

E.g. it matches the "sum of strings" and it captures all three strings correctly.

My problem is that I do not want to match the sum of exactly three strings. I want to match all "sum of strings" where the sum can consist of one or more string literals. I have tried to express this with {0,}

	var reStr = `"(?:\\"|[^"])*"`
	var reStrSum = regexp.MustCompile(`(?m)(` + reStr + `)` + `(?:\s*\+\s*(` + reStr + `)){0,}`)
	var str = `
test1("This\nis\ta\\string" + 
	"Another\"string" + 
	"Third string summed");

test2("Second string " + "sum");
`

	for i, match := range reStrSum.FindAllStringSubmatch(str, -1) {
		fmt.Println(match, "found at index", i)
		for i, str := range match {
			fmt.Println(i, str)
		}
	}
`)){0,}`)

then I get this result:

["This\nis\ta\\string" + 
	"Another\"string" + 
	"Third string summed" "This\nis\ta\\string" "Third string summed"] found at index 0
0 "This\nis\ta\\string" + 
	"Another\"string" + 
	"Third string summed"
1 "This\nis\ta\\string"
2 "Third string summed"
["Second string " + "sum" "Second string " "sum"] found at index 1
0 "Second string " + "sum"
1 "Second string "
2 "sum"

Group 0 of the first match contains all three strings (the regexp matches correctly), but there are only two capturing groups in the expression, and the second group only contains the last iteration of the repetition. E.g. "Another\"string" is lost in the process, it cannot be accessed.

Would it be possible to get all iterations of (all repetitions) inside group 2 somehow?

I would also accept any workaround that uses nested loops. But please be aware that I cannot simply replace the {0,} repetition with an outer FindAllStringSubmatch call, because the FindAllStringSubmatch call is already used for iterating over "sums of strings". In other words, I must find the first string sum and also the "Second string sum".

答案1

得分: 2

我刚刚找到了一个可行的解决方法。我可以进行两次处理。在第一次处理中,我只匹配所有的字符串字面量,并在原始文本中用唯一的占位符替换它们。然后,转换后的文本将不包含任何字符串,这样在第二次处理中对其进行进一步处理就变得更容易了。

大致的代码如下:

type javaString struct {
	value  string
	lineno int
}

// 首先我们找到所有的字符串字面量
var placeholder = "JSTR"
var reJavaStringLiteral = regexp.MustCompile(`(?m)("(?:\\"|[^"])*")`)
javaStringLiterals := make([]javaString, 0)
for pos, strMatch := range reJavaStringLiteral.FindAllStringSubmatch(strContent, -1) {
	pos = strings.Index(strContent, strMatch[0])
	head := strContent[0:pos]
	lineno := strings.Count(head, "\n") + 1
	javaStringLiterals = append(javaStringLiterals, javaString{value: strMatch[1], lineno: lineno})
}
// 接下来,我们用占位符替换所有的字符串字面量。
for i, jstr := range javaStringLiterals {
	strContent = strings.Replace(strContent, jstr.value, fmt.Sprintf("%v(%v)", placeholder, i), 1)
}
// 现在转换后的文本不包含任何字符串字面量。

第一次处理后,原始文本变为:

test1(JSTR(1) +
	JSTR(2) +
	JSTR(3));

test2(JSTR(3) + JSTR(4));

在这一步之后,我可以轻松查找 "JSTR(\d+) + JSTR(\d+) + JSTR(\d+)..." 这样的表达式。现在它们很容易找到,因为文本不包含任何字符串(否则可能包含任何内容并干扰正则表达式)。这些 "字符串之和" 的匹配可以再次使用 FindAllStringSubmatch(在内部循环中)重新匹配,然后我就可以得到所有需要的信息。

这不是一个真正的解决方案,因为它需要编写大量的代码,它只适用于我的具体用例,并且实际上并没有回答原始问题:允许在重复捕获组内访问所有迭代。

但是这个解决方法的一般思路对于面临类似问题的人可能是有益的。

英文:

I just found a workaround that will work. I can do two passes. In the first pass, I just match all string literals, and replace them with unique placeholders in the original text. Then the transformed text won't contain any strings, and it becomes much easier to do further processing on it in a second pass.

Something like this:

type javaString struct {
	value  string
	lineno int
}


	// First we find all string literals
	var placeholder = "JSTR"
	var reJavaStringLiteral = regexp.MustCompile(`(?m)("(?:\\"|[^"])*")`)
	javaStringLiterals := make([]javaString, 0)
	for pos, strMatch := range reJavaStringLiteral.FindAllStringSubmatch(strContent, -1) {
		pos = strings.Index(strContent, strMatch[0])
		head := strContent[0:pos]
		lineno := strings.Count(head, "\n") + 1
		javaStringLiterals = append(javaStringLiterals, javaString{value: strMatch[1], lineno: lineno})
	}
	// Next, we replace all string literals with placeholders.
	for i, jstr := range javaStringLiterals {
		strContent = strings.Replace(strContent, jstr.value, fmt.Sprintf("%v(%v)", placeholder, i), 1)
	}
    // Now the transformed text does not contain any string literals.

After the first pass, the original text becomes:

		test1(JSTR(1) +
			JSTR(2) +
			JSTR(3));

		test2(JSTR(3) + JSTR(4));

After this step, I can easily look for "JSTR(\d+) + JSTR(\d+) + JSTR(\d+)..." expressions. Now they are easy to find, because the text does not contain any strings (that could otherwise contain practically anything and interfere with regular expressions). These "sum of string" matches can then be re-matched with another FindAllStringSubmatch (in an inner loop) and then I'll get all information that I needed.

This is not a real solution, because it requires writting a lot of code, it is specific to my concrete use case, and does not really answer the original question: allow access to all iterations inside a repeated capturing group.

But the general idea of the workaround might be benefical for somebody who is facing a similar problem.

huangapple
  • 本文由 发表于 2022年7月28日 14:12:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/73148024.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定