不支持的 Perl 语法:`(?<`

huangapple go评论161阅读模式
英文:

unsupported Perl syntax: `(?<`

问题

我想解析cmd命令"gpg --list-keys"的结果,并在浏览器上显示它。

cmd输出的格式如下:

    pub   rsa3072 2021-08-03 [SC] [expires: 2023-08-03]
          07C47E284765D5593171C18F00B11D51A071CB55
    uid           [ultimate] user1 &lt;user1@example.com&gt;
    sub   rsa3072 2021-08-03 [E] [expires: 2023-08-03]
    
    pub   rsa3072 2021-08-04 [SC]
          37709ABD4D96324AB8CBFC3B441812AFBCE7A013
    uid           [ultimate] user2 &lt;user2@example.com&gt;
    sub   rsa3072 2021-08-04 [E]

我期望的结果是:

    {
    	{uid : user1@example.com},
    	{uid : user2@example.com},
    }

以下是代码:

    type GPGList struct{
    	uid string
    }
    
    //find list keys
    func Findlistkeys(){
    	pathexec, _ := exec.LookPath("gpg")
    	cmd := exec.Command(pathexec, "--list-keys")
    	cmdOutput := &bytes.Buffer{}
        cmd.Stdout = cmdOutput
        printCommand(cmd)
        err := cmd.Run()
        printError(err)
        output := cmdOutput.Bytes()
        printOutput(output)
        GPG := GPGList{}
        parseOutput(output, &GPG)
        fmt.Println(GPG)
    }
    
    func printCommand(cmd *exec.Cmd) {
    	fmt.Printf("==&gt; Executing: %s\n", strings.Join(cmd.Args, " "))
    }
    
    func printError(err error) {
    	if err != nil {
    			os.Stderr.WriteString(fmt.Sprintf("==&gt; Error: %s\n", err.Error()))
    	}
    }
    
    func printOutput(outs []byte) {
    	if len(outs) &gt; 0 {
    			fmt.Printf("==&gt; Output: %s\n", string(outs))
    	}
    }
    
    func parseOutput(outs []byte, GPG *GPGList) {
    	var uid = regexp.MustCompile(`(?&lt;=\&lt;)(.*?)(?=\&gt;)`)
    	fmt.Println(uid)
    }

代码以以下错误信息结束:

    panic: regexp: Compile(`(?&lt;=\&lt;)(.*?)(?=\&gt;)`): error parsing regexp: invalid or unsupported Perl syntax: `(?&lt;

到目前为止,我在正则表达式上遇到了问题。我不明白为什么它无法编译...
它有什么问题吗?

我在在线模拟器上测试了正则表达式,看起来没问题,但是在这里有些问题。
请给予建议,谢谢!

英文:

I want to parse the result of the cmd 'gpg --list-keys' to display it on the browser.
The cmd ouput is like this:


    pub   rsa3072 2021-08-03 [SC] [expires: 2023-08-03]
          07C47E284765D5593171C18F00B11D51A071CB55
    uid           [ultimate] user1 &lt;user1@example.com&gt;
    sub   rsa3072 2021-08-03 [E] [expires: 2023-08-03]
    
    pub   rsa3072 2021-08-04 [SC]
          37709ABD4D96324AB8CBFC3B441812AFBCE7A013
    uid           [ultimate] user2 &lt;user2@example.com&gt;
    sub   rsa3072 2021-08-04 [E]

I expect something like this :


    {
    	{uid : user1@example.com},
    	{uid : user2@example.com},
    }

Here is the code:

    type GPGList struct{
    	uid string
    }
    
    //find list keys
    func Findlistkeys(){
    	pathexec, _ := exec.LookPath(&quot;gpg&quot;)
    	cmd := exec.Command(pathexec, &quot;--list-keys&quot;)
    	cmdOutput := &amp;bytes.Buffer{}
        cmd.Stdout = cmdOutput
        printCommand(cmd)
        err := cmd.Run()
        printError(err)
        output := cmdOutput.Bytes()
        printOutput(output)
        GPG := GPGList{}
        parseOutput(output, &amp;GPG)
        fmt.Println(GPG)
    }
    
    func printCommand(cmd *exec.Cmd) {
    	fmt.Printf(&quot;==&gt; Executing: %s\n&quot;, strings.Join(cmd.Args, &quot; &quot;))
    }
    
    func printError(err error) {
    	if err != nil {
    			os.Stderr.WriteString(fmt.Sprintf(&quot;==&gt; Error: %s\n&quot;, err.Error()))
    	}
    }
    
    func printOutput(outs []byte) {
    	if len(outs) &gt; 0 {
    			fmt.Printf(&quot;==&gt; Output: %s\n&quot;, string(outs))
    	}
    }
    
    func parseOutput(outs []byte, GPG *GPGList) {
    	var uid = regexp.MustCompile(`(?&lt;=\&lt;)(.*?)(?=\&gt;)`)
    	fmt.Println(uid)
    }

It ends with the following message :

    panic: regexp: Compile(`(?&lt;=\&lt;)(.*?)(?=\&gt;)`): error parsing regexp: invalid or unsupported Perl syntax: `(?&lt;

So far I'm stack with the regex.
It don't understand why it don't want to compile...
What is wrong with it?

I've tested the regex on online simulator and it looks OK, yet there is something wrong with it.
Any suggestion please?

答案1

得分: 3

regexp 包使用 RE2 接受的语法。来源:https://github.com/google/re2/wiki/Syntax

> (?<=re) 在匹配 re 的文本之后(不支持)

因此出现了错误信息:

> error parsing regexp: invalid or unsupported Perl syntax: (?&lt;

在线模拟器可能正在测试不同的正则表达式语法。你需要找到另一种正则表达式编码或者使用不同的正则表达式包。

你可以尝试的另一种编码是 \&lt;([^\&gt;]*)\&gt;playground)。这种编码相当简单,可能与你最初的意图不符。

英文:

The regexp package uses the syntax accepted by RE2. From https://github.com/google/re2/wiki/Syntax

> (?<=re) after text matching re (NOT SUPPORTED)

Hence the error message:

> error parsing regexp: invalid or unsupported Perl syntax: (?&lt;

The online simulator is likely testing a different regular expression syntax. You will need to find an alternative regular expression encoding or a different regular expression package.

An alternative encoding you can try is \&lt;([^\&gt;]*)\&gt; (playground). This is quite simple and may not match your original intent.

答案2

得分: 1

这是另一种基于gpg --list-keys --with-colons机器可读输出的解决方案。

这仍然是一个慢速的解决方案,但易于编写、易于更新,不使用正则表达式。

一个聪明的人可以提出一个更快的解决方案,而不需要添加复杂的代码。只需循环遍历字符串,直到遇到&lt;,然后捕获到&gt;之间的字符串。

这是基于简单的CSV读取器,因此您可以将其插入到命令的输出流中,或者其他任何地方。

它的一个重要优点是它不需要将整个数据缓冲到内存中,可以进行流式解码。

package main

import (
	"encoding/csv"
	"fmt"
	"io"
	"regexp"
	"strings"
)

func main() {
	fmt.Printf("%#v\n", extractEmailsCSV(csvInput))
}

var uid = regexp.MustCompile(`<([^>]+)>`)

func extractEmailsRegexp(input string) (out []string) {
	submatchall := uid.FindAllString(input, -1)
	for _, element := range submatchall {
		element = strings.Trim(element, "<")
		element = strings.Trim(element, ">")
		out = append(out, element)
	}
	return
}

func extractEmailsCSV(input string) (out []string) {
	r := strings.NewReader(input)
	csv := csv.NewReader(r)
	csv.Comma = ':'
	csv.ReuseRecord = true
	csv.FieldsPerRecord = -1

	for {
		records, err := csv.Read()
		if err == io.EOF {
			break
		} else if err != nil {
			panic(err)
		}

		if len(records) < 10 {
			continue
		}

		r := records[9]
		if strings.Contains(r, "@") {
			begin := strings.Index(r, "<")
			end := strings.Index(r, ">")
			if begin+end > 0 {
				out = append(out, r[begin+1:end])
			}
		}
	}
	return
}

var regexpInput = `
    pub   rsa3072 2021-08-03 [SC] [expires: 2023-08-03]
          07C47E284765D5593171C18F00B11D51A071CB55
    uid           [ultimate] user1 <user1@example.com>
    sub   rsa3072 2021-08-03 [E] [expires: 2023-08-03]

    pub   rsa3072 2021-08-04 [SC]
          37709ABD4D96324AB8CBFC3B441812AFBCE7A013
    uid           [ultimate] user2 <user2@example.com>
    sub   rsa3072 2021-08-04 [E]
`

var csvInput = `pub:u:1024:17:51FF9A17136C5B87:1999-04-24::59:-:Tony Nelson <tnelson@techie.com>:
uid:u::::::::Tony Nelson <tnelson@conceptech.com>:
`

我们没有完全相同的基准设置,但无论如何。如果您认为它使比较变得臃肿,请随时提供更好的基准设置。

这是基准设置:

package main

import (
	"strings"
	"testing"
)

func BenchmarkCSV_1(b *testing.B) {
	input := strings.Repeat(csvInput, 1)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = extractEmailsCSV(input)
	}
}
func BenchmarkRegExp_1(b *testing.B) {
	input := strings.Repeat(regexpInput, 1)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = extractEmailsRegexp(input)
	}
}

func BenchmarkCSV_10(b *testing.B) {
	input := strings.Repeat(csvInput, 10)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = extractEmailsCSV(input)
	}
}
func BenchmarkRegExp_10(b *testing.B) {
	input := strings.Repeat(regexpInput, 10)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = extractEmailsRegexp(input)
	}
}

func BenchmarkCSV_100(b *testing.B) {
	input := strings.Repeat(csvInput, 100)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = extractEmailsCSV(input)
	}
}
func BenchmarkRegExp_100(b *testing.B) {
	input := strings.Repeat(regexpInput, 100)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = extractEmailsRegexp(input)
	}
}

这是结果:

BenchmarkCSV_1
BenchmarkCSV_1-4        	  242736	      4200 ns/op	    5072 B/op	      18 allocs/op
BenchmarkRegExp_1
BenchmarkRegExp_1-4     	  252232	      4466 ns/op	     400 B/op	       9 allocs/op
BenchmarkCSV_10
BenchmarkCSV_10-4       	   68257	     17335 ns/op	    7184 B/op	      40 allocs/op
BenchmarkRegExp_10
BenchmarkRegExp_10-4    	   29871	     39947 ns/op	    3414 B/op	      68 allocs/op
BenchmarkCSV_100
BenchmarkCSV_100-4      	    7538	    141609 ns/op	   25872 B/op	     223 allocs/op
BenchmarkRegExp_100
BenchmarkRegExp_100-4   	    1726	    674718 ns/op	   37858 B/op	     615 allocs/op

就原始速度和分配而言,对于小数据集,正则表达式更好,尽管一旦有一点数据,正则表达式就会变慢,并且分配的内存更多。

阅读也可以参考https://pkg.go.dev/testing

我的结论是,不要使用正则表达式...此外,优化正则表达式很难,几乎不可能,而优化解析某些文本输入的算法是可行的,甚至是容易的。

总结一下,即使是最快和最好的运行时,如果没有经过深思熟虑的程序员来驱动它,也无济于事。

英文:

Here is another solution based on gpg --list-keys --with-colons machine readable output.

It is still a slow solution, but easy to write, easy to update, does not use regular expressions.

A smart folk can come with an even faster solution without adding a crazy wall of complexity. (just loop over the string until &lt; then capture the string until &gt;)

this is based on a simple csv reader, so you can plug it onto the output stream of a command.Exec instance, or whatever else.

The big advantage is that it does not need to buffer the whole data in memory, it can stream decode.

package main

import (
	&quot;encoding/csv&quot;
	&quot;fmt&quot;
	&quot;io&quot;
	&quot;regexp&quot;
	&quot;strings&quot;
)

func main() {
	fmt.Printf(&quot;%#v\n&quot;, extractEmailsCSV(csvInput))
}

var uid = regexp.MustCompile(`\&lt;(.*?)\&gt;`)

func extractEmailsRegexp(input string) (out []string) {
	submatchall := uid.FindAllString(input, -1)
	for _, element := range submatchall {
		element = strings.Trim(element, &quot;&lt;&quot;)
		element = strings.Trim(element, &quot;&gt;&quot;)
		out = append(out, element)
	}
	return
}

func extractEmailsCSV(input string) (out []string) {
	r := strings.NewReader(input)
	csv := csv.NewReader(r)
	csv.Comma = &#39;:&#39;
	csv.ReuseRecord = true
	csv.FieldsPerRecord = -1

	for {
		records, err := csv.Read()
		if err == io.EOF {
			break
		} else if err != nil {
			panic(err)
		}

		if len(records) &lt; 10 {
			continue
		}

		r := records[9]
		if strings.Contains(r, &quot;@&quot;) {
			begin := strings.Index(r, &quot;&lt;&quot;)
			end := strings.Index(r, &quot;&gt;&quot;)
			if begin+end &gt; 0 {
				out = append(out, r[begin+1:end])
			}
		}
	}
	return
}

var regexpInput = `
    pub   rsa3072 2021-08-03 [SC] [expires: 2023-08-03]
          07C47E284765D5593171C18F00B11D51A071CB55
    uid           [ultimate] user1 &lt;user1@example.com&gt;
    sub   rsa3072 2021-08-03 [E] [expires: 2023-08-03]

    pub   rsa3072 2021-08-04 [SC]
          37709ABD4D96324AB8CBFC3B441812AFBCE7A013
    uid           [ultimate] user2 &lt;user2@example.com&gt;
    sub   rsa3072 2021-08-04 [E]
`

var csvInput = `pub:u:1024:17:51FF9A17136C5B87:1999-04-24::59:-:Tony Nelson &lt;tnelson@techie.com&gt;:
uid:u::::::::Tony Nelson &lt;tnelson@conceptech.com&gt;:
`

We dont exactly have the same benchmark setup, but anyways. If you think it bloats the comparison feel free to provide better bench setup.

Here is the benchmark setup

package main

import (
	&quot;strings&quot;
	&quot;testing&quot;
)

func BenchmarkCSV_1(b *testing.B) {
	input := strings.Repeat(csvInput, 1)
	b.ResetTimer()
	for i := 0; i &lt; b.N; i++ {
		_ = extractEmailsCSV(input)
	}
}
func BenchmarkRegExp_1(b *testing.B) {
	input := strings.Repeat(regexpInput, 1)
	b.ResetTimer()
	for i := 0; i &lt; b.N; i++ {
		_ = extractEmailsRegexp(input)
	}
}

func BenchmarkCSV_10(b *testing.B) {
	input := strings.Repeat(csvInput, 10)
	b.ResetTimer()
	for i := 0; i &lt; b.N; i++ {
		_ = extractEmailsCSV(input)
	}
}
func BenchmarkRegExp_10(b *testing.B) {
	input := strings.Repeat(regexpInput, 10)
	b.ResetTimer()
	for i := 0; i &lt; b.N; i++ {
		_ = extractEmailsRegexp(input)
	}
}

func BenchmarkCSV_100(b *testing.B) {
	input := strings.Repeat(csvInput, 100)
	b.ResetTimer()
	for i := 0; i &lt; b.N; i++ {
		_ = extractEmailsCSV(input)
	}
}
func BenchmarkRegExp_100(b *testing.B) {
	input := strings.Repeat(regexpInput, 100)
	b.ResetTimer()
	for i := 0; i &lt; b.N; i++ {
		_ = extractEmailsRegexp(input)
	}
}

And here is the result

BenchmarkCSV_1
BenchmarkCSV_1-4        	  242736	      4200 ns/op	    5072 B/op	      18 allocs/op
BenchmarkRegExp_1
BenchmarkRegExp_1-4     	  252232	      4466 ns/op	     400 B/op	       9 allocs/op
BenchmarkCSV_10
BenchmarkCSV_10-4       	   68257	     17335 ns/op	    7184 B/op	      40 allocs/op
BenchmarkRegExp_10
BenchmarkRegExp_10-4    	   29871	     39947 ns/op	    3414 B/op	      68 allocs/op
BenchmarkCSV_100
BenchmarkCSV_100-4      	    7538	    141609 ns/op	   25872 B/op	     223 allocs/op
BenchmarkRegExp_100
BenchmarkRegExp_100-4   	    1726	    674718 ns/op	   37858 B/op	     615 allocs/op

In terms of raw speed and allocations regular expression is better on small dataset, though as soon there is a little bit of data regular expressions are slower and allocates mores by a significant factor.

read also https://pkg.go.dev/testing

My conclusion is, don't use regular expressions ... also, optimizing regexp are hard if not impossible, where as optimizing an algorithm to parse some text input is doable, if not easy.

to summarize, even the fastest and best runtime is nothing without a well thought programmer to drive it.

答案3

得分: 0

所以我更新了正则表达式...但是因为(?<=<)(.*?)(?=>)在在线模拟器上工作正常,我真的很惊讶。
为什么正则表达式不能在所有语言中都起作用呢...

func parseOutput(outs []byte, GPG *GPGList) {
    var uid = regexp.MustCompile(`<([^>]*)>`)
    submatchall := uid.FindAllString(string(outs), -1)
    for _, element := range submatchall {
        element = strings.Trim(element, "<")
        element = strings.Trim(element, ">")
        fmt.Println(element)
    }
}
英文:

So I updated the regex...but since (?&lt;=\&lt;)(.*?)(?=\&gt;) was working on online simulator, I really got surprised.
Why can't regex work the same with all languages...

    func parseOutput(outs []byte, GPG *GPGList) {
var uid = regexp.MustCompile(`\&lt;(.*?)\&gt;`)
submatchall := uid.FindAllString(string(outs), -1)
for _, element := range submatchall {
element = strings.Trim(element, &quot;&lt;&quot;)
element = strings.Trim(element, &quot;&gt;&quot;)
fmt.Println(element)
}
}

huangapple
  • 本文由 发表于 2021年8月5日 09:51:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/68659706.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定