如何在Go正则表达式中使用捕获组功能

huangapple go评论120阅读模式
英文:

How to get capturing group functionality in Go regular expressions

问题

我正在将一个Ruby库移植到Go语言,并且刚刚发现Ruby的正则表达式与Go语言的不兼容(使用的是Google的RE2引擎)。我注意到Ruby和Java(以及其他一些语言)使用的是PCRE正则表达式(Perl兼容),支持捕获组,所以我需要重新编写我的表达式,以便在Go语言中能够编译通过。

例如,我有以下正则表达式:

  1. `(?<Year>\d{4})-(?<Month>\d{2})-(?<Day>\d{2})`

这个表达式可以接受如下输入:

  1. 2001-01-20

捕获组允许将年、月和日捕获到变量中。要获取每个组的值非常简单,你只需要使用组名作为索引来访问匹配到的数据,就可以得到对应的值。所以,例如要获取年份,可以使用类似下面的伪代码:

  1. m = expression.Match("2001-01-20")
  2. year = m["Year"]

这是我在表达式中经常使用的模式,所以我需要进行大量的重写工作。

那么,在Go语言的正则表达式中有没有办法实现这种功能呢?我应该如何重新编写这些表达式?

英文:

I'm porting a library from Ruby to Go, and have just discovered that regular expressions in Ruby are not compatible with Go (google RE2). It's come to my attention that Ruby & Java (plus other languages use PCRE regular expressions (perl compatible, which supports capturing groups)), so I need to re-write my expressions so that they compile ok in Go.

For example, I have the following regex:

  1. `(?&lt;Year&gt;\d{4})-(?&lt;Month&gt;\d{2})-(?&lt;Day&gt;\d{2})`

This should accept input such as:

  1. 2001-01-20

The capturing groups allow the year, month and day to be captured into variables. To get the value of each group, it's very easy; you just index into the returned matched data with the group name and you get the value back. So, for example to get the year, something like this pseudo code:

  1. m=expression.Match(&quot;2001-01-20&quot;)
  2. year = m[&quot;Year&quot;]

This is a pattern I use a lot in my expressions, so I have a lot of re-writing to do.

So, is there a way to get this kind of functionality in Go regexp; how should I re-write these expressions?

答案1

得分: 135

如何重新编写这些表达式?

根据这里的定义,添加一些P:

  1. (?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})

使用re.SubexpNames()交叉引用捕获组名称。

并按照如下的方式使用:

  1. package main
  2. import (
  3. "fmt"
  4. "regexp"
  5. )
  6. func main() {
  7. r := regexp.MustCompile(`(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})`)
  8. fmt.Printf("%#v\n", r.FindStringSubmatch(`2015-05-27`))
  9. fmt.Printf("%#v\n", r.SubexpNames())
  10. }
英文:

> how should I re-write these expressions?

Add some Ps, as defined here:

  1. (?P&lt;Year&gt;\d{4})-(?P&lt;Month&gt;\d{2})-(?P&lt;Day&gt;\d{2})

Cross reference capture group names with re.SubexpNames().

And use as follows:

  1. package main
  2. import (
  3. &quot;fmt&quot;
  4. &quot;regexp&quot;
  5. )
  6. func main() {
  7. r := regexp.MustCompile(`(?P&lt;Year&gt;\d{4})-(?P&lt;Month&gt;\d{2})-(?P&lt;Day&gt;\d{2})`)
  8. fmt.Printf(&quot;%#v\n&quot;, r.FindStringSubmatch(`2015-05-27`))
  9. fmt.Printf(&quot;%#v\n&quot;, r.SubexpNames())
  10. }

答案2

得分: 42

我已经为处理URL表达式创建了一个函数,但它也适用于你的需求。你可以查看这个代码片段,它的工作原理如下:

  1. /**
  2. * 使用给定的正则表达式解析URL,并返回表达式中定义的组值。
  3. *
  4. */
  5. func getParams(regEx, url string) (paramsMap map[string]string) {
  6. var compRegEx = regexp.MustCompile(regEx)
  7. match := compRegEx.FindStringSubmatch(url)
  8. paramsMap = make(map[string]string)
  9. for i, name := range compRegEx.SubexpNames() {
  10. if i > 0 && i <= len(match) {
  11. paramsMap[name] = match[i]
  12. }
  13. }
  14. return paramsMap
  15. }

你可以像这样使用这个函数:

  1. params := getParams(`(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})`, `2015-05-27`)
  2. fmt.Println(params)

输出将是:

  1. map[Year:2015 Month:05 Day:27]
英文:

I had created a function for handling url expressions but it suits your needs too. You can check this snippet but it simply works like this:

  1. /**
  2. * Parses url with the given regular expression and returns the
  3. * group values defined in the expression.
  4. *
  5. */
  6. func getParams(regEx, url string) (paramsMap map[string]string) {
  7. var compRegEx = regexp.MustCompile(regEx)
  8. match := compRegEx.FindStringSubmatch(url)
  9. paramsMap = make(map[string]string)
  10. for i, name := range compRegEx.SubexpNames() {
  11. if i &gt; 0 &amp;&amp; i &lt;= len(match) {
  12. paramsMap[name] = match[i]
  13. }
  14. }
  15. return paramsMap
  16. }

You can use this function like:

  1. params := getParams(`(?P&lt;Year&gt;\d{4})-(?P&lt;Month&gt;\d{2})-(?P&lt;Day&gt;\d{2})`, `2015-05-27`)
  2. fmt.Println(params)

and the output will be:

  1. map[Year:2015 Month:05 Day:27]

答案3

得分: 31

截至GO 1.15版本,您可以使用Regexp.SubexpIndex来简化该过程。您可以在https://golang.org/doc/go1.15#regexp上查看发布说明。

根据您的示例,您可以像下面这样做:

  1. re := regexp.MustCompile(`(?P&lt;Year&gt;\d{4})-(?P&lt;Month&gt;\d{2})-(?P&lt;Day&gt;\d{2})`)
  2. matches := re.FindStringSubmatch(&quot;Some random date: 2001-01-20&quot;)
  3. yearIndex := re.SubexpIndex(&quot;Year&quot;)
  4. fmt.Println(matches[yearIndex])

您可以在https://play.golang.org/p/ImJ7i_ZQ3Hu上检查并执行此示例。

英文:

As of GO 1.15, you can simplify the process by using Regexp.SubexpIndex. You can check the release notes at https://golang.org/doc/go1.15#regexp.

Based in your example, you'd have something like the following:

  1. re := regexp.MustCompile(`(?P&lt;Year&gt;\d{4})-(?P&lt;Month&gt;\d{2})-(?P&lt;Day&gt;\d{2})`)
  2. matches := re.FindStringSubmatch(&quot;Some random date: 2001-01-20&quot;)
  3. yearIndex := re.SubexpIndex(&quot;Year&quot;)
  4. fmt.Println(matches[yearIndex])

You can check and execute this example at https://play.golang.org/p/ImJ7i_ZQ3Hu.

答案4

得分: 27

为了在循环中提高RAM和CPU的使用效率,而又不调用匿名函数并且不在循环内部复制数组,可以参考下面的示例:

您可以使用多行文本存储多个子组,而无需使用'+'连接字符串,也无需在循环内部使用循环嵌套(与其他在此处发布的示例不同)。

  1. txt := `2001-01-20
  2. 2009-03-22
  3. 2018-02-25
  4. 2018-06-07`
  5. regex := *regexp.MustCompile(`(?s)(\d{4})-(\d{2})-(\d{2})`)
  6. res := regex.FindAllStringSubmatch(txt, -1)
  7. for i := range res {
  8. // 类似于Java:match.group(1), match.group(2)等
  9. fmt.Printf("year: %s, month: %s, day: %s\n", res[i][1], res[i][2], res[i][3])
  10. }

输出:

  1. year: 2001, month: 01, day: 20
  2. year: 2009, month: 03, day: 22
  3. year: 2018, month: 02, day: 25
  4. year: 2018, month: 06, day: 07

注意:res[i][0] =~ match.group(0) Java

如果您想要存储这些信息,请使用结构体类型:

  1. type date struct {
  2. y,m,d int
  3. }
  4. ...
  5. func main() {
  6. ...
  7. dates := make([]date, 0, len(res))
  8. for ... {
  9. dates[index] = date{y: res[index][1], m: res[index][2], d: res[index][3]}
  10. }
  11. }

最好使用匿名组(性能改进)

在GitHub上发布的使用"ReplaceAllGroupFunc"的方法是不好的,因为:

  1. 在循环内部使用了循环嵌套
  2. 在循环内部调用了匿名函数
  3. 代码量很大
  4. 在循环内部使用了"append"函数,这是不好的。
    每次调用"append"函数时,都会将数组复制到新的内存位置。
英文:

To improve RAM and CPU usage without calling anonymous functions inside loop and without copying arrays in memory inside loop with "append" function see the next example:

You can store more than one subgroup with multiline text, without appending string with '+' and without using for loop inside for loop (like other examples posted here).

  1. txt := `2001-01-20
  2. 2009-03-22
  3. 2018-02-25
  4. 2018-06-07`
  5. regex := *regexp.MustCompile(`(?s)(\d{4})-(\d{2})-(\d{2})`)
  6. res := regex.FindAllStringSubmatch(txt, -1)
  7. for i := range res {
  8. //like Java: match.group(1), match.group(2), etc
  9. fmt.Printf(&quot;year: %s, month: %s, day: %s\n&quot;, res[i][1], res[i][2], res[i][3])
  10. }

Output:

  1. year: 2001, month: 01, day: 20
  2. year: 2009, month: 03, day: 22
  3. year: 2018, month: 02, day: 25
  4. year: 2018, month: 06, day: 07

Note: res[i][0] =~ match.group(0) Java

If you want to store this information use a struct type:

  1. type date struct {
  2. y,m,d int
  3. }
  4. ...
  5. func main() {
  6. ...
  7. dates := make([]date, 0, len(res))
  8. for ... {
  9. dates[index] = date{y: res[index][1], m: res[index][2], d: res[index][3]}
  10. }
  11. }

It's better to use anonymous groups (performance improvement)

Using "ReplaceAllGroupFunc" posted on Github is bad idea because:

  1. is using loop inside loop
  2. is using anonymous function call inside loop
  3. has a lot of code
  4. is using the "append" function inside loop and that's bad.
    Every time a call is made to "append" function, is copying the array to new memory position

答案5

得分: 8

根据@VasileM的答案,确定组名的简单方法。

免责声明:这不涉及内存/CPU/时间优化。

  1. package main
  2. import (
  3. "fmt"
  4. "regexp"
  5. )
  6. func main() {
  7. r := regexp.MustCompile(`^(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})$`)
  8. res := r.FindStringSubmatch(`2015-05-27`)
  9. names := r.SubexpNames()
  10. for i, _ := range res {
  11. if i != 0 {
  12. fmt.Println(names[i], res[i])
  13. }
  14. }
  15. }

链接:https://play.golang.org/p/Y9cIVhMa2pU

英文:

Simple way to determine group names based on @VasileM answer.

Disclaimer: it's not about memory/cpu/time optimization

  1. package main
  2. import (
  3. &quot;fmt&quot;
  4. &quot;regexp&quot;
  5. )
  6. func main() {
  7. r := regexp.MustCompile(`^(?P&lt;Year&gt;\d{4})-(?P&lt;Month&gt;\d{2})-(?P&lt;Day&gt;\d{2})$`)
  8. res := r.FindStringSubmatch(`2015-05-27`)
  9. names := r.SubexpNames()
  10. for i, _ := range res {
  11. if i != 0 {
  12. fmt.Println(names[i], res[i])
  13. }
  14. }
  15. }

https://play.golang.org/p/Y9cIVhMa2pU

答案6

得分: 2

如果你需要在捕获组的基础上使用函数进行替换,你可以使用以下代码:

  1. import "regexp"
  2. func ReplaceAllGroupFunc(re *regexp.Regexp, str string, repl func([]string) string) string {
  3. result := ""
  4. lastIndex := 0
  5. for _, v := range re.FindAllSubmatchIndex([]byte(str), -1) {
  6. groups := []string{}
  7. for i := 0; i < len(v); i += 2 {
  8. groups = append(groups, str[v[i]:v[i+1]])
  9. }
  10. result += str[lastIndex:v[0]] + repl(groups)
  11. lastIndex = v[1]
  12. }
  13. return result + str[lastIndex:]
  14. }
  15. // 示例:
  16. str := "abc foo:bar def baz:qux ghi"
  17. re := regexp.MustCompile("([a-z]+):([a-z]+)")
  18. result := ReplaceAllGroupFunc(re, str, func(groups []string) string {
  19. return groups[1] + "." + groups[2]
  20. })
  21. fmt.Printf("'%s'\n", result)

你可以在这个链接中找到完整的代码:https://gist.github.com/elliotchance/d419395aa776d632d897

英文:

If you need to replace based on a function while capturing groups you can use this:

  1. import &quot;regexp&quot;
  2. func ReplaceAllGroupFunc(re *regexp.Regexp, str string, repl func([]string) string) string {
  3. result := &quot;&quot;
  4. lastIndex := 0
  5. for _, v := range re.FindAllSubmatchIndex([]byte(str), -1) {
  6. groups := []string{}
  7. for i := 0; i &lt; len(v); i += 2 {
  8. groups = append(groups, str[v[i]:v[i+1]])
  9. }
  10. result += str[lastIndex:v[0]] + repl(groups)
  11. lastIndex = v[1]
  12. }
  13. return result + str[lastIndex:]
  14. }

Example:

  1. str := &quot;abc foo:bar def baz:qux ghi&quot;
  2. re := regexp.MustCompile(&quot;([a-z]+):([a-z]+)&quot;)
  3. result := ReplaceAllGroupFunc(re, str, func(groups []string) string {
  4. return groups[1] + &quot;.&quot; + groups[2]
  5. })
  6. fmt.Printf(&quot;&#39;%s&#39;\n&quot;, result)

https://gist.github.com/elliotchance/d419395aa776d632d897

答案7

得分: 1

你可以使用regroup库来实现这个功能。这是一个GitHub上的库,你可以在这里找到它:https://github.com/oriser/regroup

下面是一个示例代码:

  1. package main
  2. import (
  3. "fmt"
  4. "github.com/oriser/regroup"
  5. )
  6. func main() {
  7. r := regroup.MustCompile(`(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})`)
  8. matches, err := r.Groups("2015-05-27")
  9. if err != nil {
  10. panic(err)
  11. }
  12. fmt.Printf("%+v\n", matches)
  13. }

运行结果将会打印:map[Year:2015 Month:05 Day:27]

另外,你也可以按照下面的方式使用它:

  1. package main
  2. import (
  3. "fmt"
  4. "github.com/oriser/regroup"
  5. )
  6. type Date struct {
  7. Year int `regroup:"Year"`
  8. Month int `regroup:"Month"`
  9. Day int `regroup:"Day"`
  10. }
  11. func main() {
  12. date := &Date{}
  13. r := regroup.MustCompile(`(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})`)
  14. if err := r.MatchToTarget("2015-05-27", date); err != nil {
  15. panic(err)
  16. }
  17. fmt.Printf("%+v\n", date)
  18. }

运行结果将会打印:&{Year:2015 Month:5 Day:27}

英文:

You can use regroup library for that
https://github.com/oriser/regroup

Example:

  1. package main
  2. import (
  3. &quot;fmt&quot;
  4. &quot;github.com/oriser/regroup&quot;
  5. )
  6. func main() {
  7. r := regroup.MustCompile(`(?P&lt;Year&gt;\d{4})-(?P&lt;Month&gt;\d{2})-(?P&lt;Day&gt;\d{2})`)
  8. mathces, err := r.Groups(&quot;2015-05-27&quot;)
  9. if err != nil {
  10. panic(err)
  11. }
  12. fmt.Printf(&quot;%+v\n&quot;, mathces)
  13. }

Will print: map[Year:2015 Month:05 Day:27]

Alternatively, you can use it like this:

  1. package main
  2. import (
  3. &quot;fmt&quot;
  4. &quot;github.com/oriser/regroup&quot;
  5. )
  6. type Date struct {
  7. Year int `regroup:&quot;Year&quot;`
  8. Month int `regroup:&quot;Month&quot;`
  9. Day int `regroup:&quot;Day&quot;`
  10. }
  11. func main() {
  12. date := &amp;Date{}
  13. r := regroup.MustCompile(`(?P&lt;Year&gt;\d{4})-(?P&lt;Month&gt;\d{2})-(?P&lt;Day&gt;\d{2})`)
  14. if err := r.MatchToTarget(&quot;2015-05-27&quot;, date); err != nil {
  15. panic(err)
  16. }
  17. fmt.Printf(&quot;%+v\n&quot;, date)
  18. }

Will print: &amp;{Year:2015 Month:5 Day:27}

答案8

得分: -2

  1. // GetRxParams - 使用提供的正则表达式从字符串中获取所有的正则参数
  2. func GetRxParams(rx *regexp.Regexp, str string) (pm map[string]string) {
  3. if !rx.MatchString(str) {
  4. return nil
  5. }
  6. p := rx.FindStringSubmatch(str)
  7. n := rx.SubexpNames()
  8. pm = map[string]string{}
  9. for i := range n {
  10. if i == 0 {
  11. continue
  12. }
  13. if n[i] != "" && p[i] != "" {
  14. pm[n[i]] = p[i]
  15. }
  16. }
  17. return
  18. }

这是一个用于获取带有空指针检查的正则表达式参数的函数。如果发生错误,将返回一个空的map[]。

英文:

Function for get regexp parameters wit nil pointer checking. Returns map[] if error ocured

  1. // GetRxParams - Get all regexp params from string with provided regular expression
  2. func GetRxParams(rx *regexp.Regexp, str string) (pm map[string]string) {
  3. if !rx.MatchString(str) {
  4. return nil
  5. }
  6. p := rx.FindStringSubmatch(str)
  7. n := rx.SubexpNames()
  8. pm = map[string]string{}
  9. for i := range n {
  10. if i == 0 {
  11. continue
  12. }
  13. if n[i] != &quot;&quot; &amp;&amp; p[i] != &quot;&quot; {
  14. pm[n[i]] = p[i]
  15. }
  16. }
  17. return
  18. }

huangapple
  • 本文由 发表于 2015年5月27日 21:17:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/30483652.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定