获取在Golang正则表达式中命名的子组列表

huangapple go评论79阅读模式
英文:

Get named list of subgroup in golang regex

问题

我正在寻找一个返回map[string]interface{}的函数,其中interface{}可以是一个切片、一个map[string]interface{}或一个值。

我的用例是解析类似以下的WKT几何图形,并提取点的值;例如一个多边形的例子:

POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))

正则表达式(我故意设置了只匹配整数的\d,以提高可读性):

(POLYGON \(
    (?P<polygons>\(
        (?P<points>(?P<point>(\d \d), ){3,})
        (?P<last_point>\d \d )\),)*
    (?P<last_polygon>\(
        (?P<points>(?P<point>(\d \d), ){3,})
        (?P<last_point>\d \d)\))\)
)

我有一个函数(从SO复制而来),它可以获取一些信息,但对于嵌套组和组列表来说并不是很好:

func getRegexMatchParams(reg *regexp.Regexp, url string) (paramsMap map[string]string) {
    match := reg.FindStringSubmatch(url)
    paramsMap = make(map[string]string)
    for i, name := range reg.SubexpNames() {
        if i > 0 && i <= len(match) {
            paramsMap[name] = match[i]
        }
    }
    return match
}

似乎point组只获取到一个点。
在playground上的示例

[编辑] 我想要的结果是这样的:

map[string]interface{}{
    "polygons": map[string]interface{}{
        "points": []interface{}{
            map[string]string{"point": "0 0"},
            map[string]string{"point": "0 10"},
            map[string]string{"point": "10 10"},
            map[string]string{"point": "10 0"},
        },
        "last_point": "0 0",
    },
    "last_polygon": map[string]interface{}{
        "points": []interface{}{
            map[string]string{"point": "3 3"},
            map[string]string{"point": "3 7"},
            map[string]string{"point": "7 7"},
            map[string]string{"point": "7 3"},
        },
        "last_point": "3 3",
    },
}

这样我就可以进一步用于不同的目的,比如查询数据库并验证每个多边形的last_point是否等于points[0]

英文:

I'm looking for a function that returns a map[string]interface{} where interface{} can be a slice, a a map[string]interface{} or a value.

My use case is to parse WKT geometry like the following and retrieves point values; Example for a donut polygon:

POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))

The regex (I voluntary set \d that matches only integers for readability purpose):

(POLYGON \(
    (?P&lt;polygons&gt;\(
        (?P&lt;points&gt;(?P&lt;point&gt;(\d \d), ){3,})
        (?P&lt;last_point&gt;\d \d )\),)*
    (?P&lt;last_polygon&gt;\(
        (?P&lt;points&gt;(?P&lt;point&gt;(\d \d), ){3,})
        (?P&lt;last_point&gt;\d \d)\))\)
)

I have a function (copied from SO) that retrieves some informations but it's not that good for nested groups and list of groups:

func getRegexMatchParams(reg *regexp.Regexp, url string) (paramsMap map[string]string) {
    match := reg.FindStringSubmatch(url)
    paramsMap = make(map[string]string)
    for i, name := range reg.SubexpNames() {
	    if i &gt; 0 &amp;&amp; i &lt;= len(match) {
		    paramsMap[name] = match[i]
	    }
    }
    return match
}

It seems that the group point gets only 1 point.
example on playground

[EDIT] The result I want is something like this:

map[string]interface{}{
    &quot;polygons&quot;: map[string]interface{} {
        &quot;points&quot;: []interface{}{
            {map[string]string{&quot;point&quot;: &quot;0 0&quot;}},     
            {map[string]string{&quot;point&quot;: &quot;0 10&quot;}},        
            {map[string]string{&quot;point&quot;: &quot;10 10&quot;}},        
            {map[string]string{&quot;point&quot;: &quot;10 0&quot;}},
        },
        &quot;last_point&quot;: &quot;0 0&quot;,
    },
    &quot;last_polygon&quot;: map[string]interface{} {
        &quot;points&quot;: []interface{}{
            {map[string]string{&quot;point&quot;: &quot;3 3&quot;}},     
            {map[string]string{&quot;point&quot;: &quot;3 7&quot;}},        
            {map[string]string{&quot;point&quot;: &quot;7 7&quot;}},        
            {map[string]string{&quot;point&quot;: &quot;7 3&quot;}},
        },
        &quot;last_point&quot;: &quot;3 3&quot;,
    }
}

So I can use it further for different purposes like querying databases and validate that last_point = points[0] for each polygon.

答案1

得分: 2

尝试在正则表达式中添加一些空格。

还要注意,此引擎不会保留在类似 (a|b|c)+ 这样的量化外部分组中的所有捕获组值,其中该组仅包含它找到的最后一个 a、b 或 c。

而且,你的正则表达式可以简化为:

(POLYGON\s*\((?P<polygons>\(\s*(?P<points>(?P<point>\s*(\d+\s+\d+)\s*,){3,})\s*(?P<last_point>\d+\s+\d+)\s*\)(?:\s*,\s*|\s*\)))+)

原始链接:

https://play.golang.org/p/rLaaEa_7GX

以下是各个组的内容:

 (                             # (1 开始)
      POLYGON \s* \(
      (?P<polygons>                 # (2 开始)
           \( \s* 
           (?P<points>                   # (3 开始)
                (?P<point>                    # (4 开始)
                     \s* 
                     ( \d+ \s+ \d+ )               # (5)
                     \s* 
                     , 
                ){3,}                         # (4 结束)
           )                             # (3 结束)
           \s*            
           (?P<last_point> \d+ \s+ \d+ )  # (6)
           \s* \),
      )*                            # (2 结束)
      (?P<last_polygon>             # (7 开始)
           \( \s* 
           (?P<points>                   # (8 开始)
                (?P<point>                    # (9 开始)
                     \s* 
                     ( \d+ \s+ \d+ )               # (10)
                     \s* 
                     , 
                ){3,}                         # (9 结束)
           )                             # (8 结束)
           \s* 
           (?P<last_point> \d+ \s+ \d+ )  # (11)
           \s* \)
      )                             # (7 结束)
      \s* \)
 )                             # (1 结束)

输入:

POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))

输出:

 **  Grp 0                -  ( 位置 0 ,长度 65 ) 
POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))  
 **  Grp 1                -  ( 位置 0 ,长度 65 ) 
POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))  
 **  Grp 2 [polygons]     -  ( 位置 9 ,长度 30 ) 
(0 0, 0 10, 10 10, 10 0, 0 0),  
 **  Grp 3 [points]       -  ( 位置 10 ,长度 23 ) 
0 0, 0 10, 10 10, 10 0,  
 **  Grp 4 [point]        -  ( 位置 27 ,长度 6 ) 
 10 0,  
 **  Grp 5                -  ( 位置 28 ,长度 4 ) 
10 0  
 **  Grp 6 [last_point]   -  ( 位置 34 ,长度 3 ) 
0 0  
 **  Grp 7 [last_polygon] -  ( 位置 39 ,长度 25 ) 
(3 3, 3 7, 7 7, 7 3, 3 3)  
 **  Grp 8 [points]       -  ( 位置 40 ,长度 19 ) 
3 3, 3 7, 7 7, 7 3,  
 **  Grp 9 [point]        -  ( 位置 54 ,长度 5 ) 
 7 3,  
 **  Grp 10                -  ( 位置 55 ,长度 3 ) 
7 3  
 **  Grp 11 [last_point]   -  ( 位置 60 ,长度 3 ) 
3 3  

可能的解决方案:

这并非不可能,只是需要额外的几个步骤。
(顺便说一句,难道没有一个可以解析这个的 WKT 库吗?)

现在,我不知道你的语言能力如何,所以这只是一个一般的方法。

  1. 验证要解析的形式。
    这将验证并返回所有多边形集作为单个字符串在 All_Polygons 组中。

目标 POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))

POLYGON\s*\((?P<All_Polygons>(?:\(\s*\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,}\s*\))(?:\s*,\(\s*\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,}\s*\))*)\s*\)

 **  Grp 1 [All_Polygons] -  ( 位置 9 ,长度 55 ) 
(0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3)
  1. 如果步骤 1 成功,使用步骤 1 的输出设置一个循环匹配。

目标 (0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3)

(?:\(\s*(?P<Single_Poly_All_Pts>\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,})\s*\))

这一步相当于查找所有匹配项。它应该匹配单个多边形的所有点的连续值,并在 Single_Poly_All_Pts 组字符串中返回。

这将给你这两个单独的匹配项,可以将它们放入一个临时数组中,其中有两个值字符串:

 **  Grp 1 [Single_Poly_All_Pts] -  ( 位置 1 ,长度 27 ) 
0 0, 0 10, 10 10, 10 0, 0 0  

 **  Grp 1 [Single_Poly_All_Pts] -  ( 位置 31 ,长度 23 ) 
3 3, 3 7, 7 7, 7 3, 3 3  
  1. 如果步骤 2 成功,使用步骤 2 的临时数组输出设置一个循环匹配。
    这将给出每个多边形的 单个 点。

(?P<Single_Point>\d+\s+\d+)

同样,这是一个循环匹配(或查找所有类型的匹配)。对于每个数组元素(多边形),这将产生单个点。

目标[元素 1] 0 0, 0 10, 10 10, 10 0, 0 0

 **  Grp 1 [Single_Point] -  ( 位置 0 ,长度 3 ) 
0 0  
 **  Grp 1 [Single_Point] -  ( 位置 5 ,长度 4 ) 
0 10  
 **  Grp 1 [Single_Point] -  ( 位置 11 ,长度 5 ) 
10 10  
 **  Grp 1 [Single_Point] -  ( 位置 18 ,长度 4 ) 
10 0  
 **  Grp 1 [Single_Point] -  ( 位置 24 ,长度 3 ) 
0 0  

以及,

目标[元素 2] 3 3, 3 7, 7 7, 7 3, 3 3

 **  Grp 1 [Single_Point] -  ( 位置 0 ,长度 3 ) 
3 3  
 **  Grp 1 [Single_Point] -  ( 位置 5 ,长度 3 ) 
3 7  
 **  Grp 1 [Single_Point] -  ( 位置 10 ,长度 3 ) 
7 7  
 **  Grp 1 [Single_Point] -  ( 位置 15 ,长度 3 ) 
7 3  
 **  Grp 1 [Single_Point] -  ( 位置 20 ,长度 3 ) 
3 3
英文:

Try to add some whitespace to the regex.

Also note that this engine won't retain all capture group values that are
within a quantified outer grouping like (a|b|c)+ where this group will only contain the last a or b or c it finds.

And, your regex can be reduced to this

(POLYGON\s*\((?P&lt;polygons&gt;\(\s*(?P&lt;points&gt;(?P&lt;point&gt;\s*(\d+\s+\d+)\s*,){3,})\s*(?P&lt;last_point&gt;\d+\s+\d+)\s*\)(?:\s*,\s*|\s*\)))+)

https://play.golang.org/p/rLaaEa_7GX


The original:

(POLYGON\s*\((?P&lt;polygons&gt;\(\s*(?P&lt;points&gt;(?P&lt;point&gt;\s*(\d+\s+\d+)\s*,){3,})\s*(?P&lt;last_point&gt;\d+\s+\d+)\s*\),)*(?P&lt;last_polygon&gt;\(\s*(?P&lt;points&gt;(?P&lt;point&gt;\s*(\d+\s+\d+)\s*,){3,})\s*(?P&lt;last_point&gt;\d+\s+\d+)\s*\))\s*\))

https://play.golang.org/p/rZgJYPDMzl

See below for what the groups contain.

 (                             # (1 start)
      POLYGON \s* \(
      (?P&lt;polygons&gt;                 # (2 start)
           \( \s* 
           (?P&lt;points&gt;                   # (3 start)
                (?P&lt;point&gt;                    # (4 start)
                     \s* 
                     ( \d+ \s+ \d+ )               # (5)
                     \s* 
                     , 
                ){3,}                         # (4 end)
           )                             # (3 end)
           \s*            
           (?P&lt;last_point&gt; \d+ \s+ \d+ )  # (6)
           \s* \),
      )*                            # (2 end)
      (?P&lt;last_polygon&gt;             # (7 start)
           \( \s* 
           (?P&lt;points&gt;                   # (8 start)
                (?P&lt;point&gt;                    # (9 start)
                     \s* 
                     ( \d+ \s+ \d+ )               # (10)
                     \s* 
                     , 
                ){3,}                         # (9 end)
           )                             # (8 end)
           \s* 
           (?P&lt;last_point&gt; \d+ \s+ \d+ )  # (11)
           \s* \)
      )                             # (7 end)
      \s* \)
 )                             # (1 end)

Input

POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))

Output

 **  Grp 0                -  ( pos 0 , len 65 ) 
POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))  
 **  Grp 1                -  ( pos 0 , len 65 ) 
POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))  
 **  Grp 2 [polygons]     -  ( pos 9 , len 30 ) 
(0 0, 0 10, 10 10, 10 0, 0 0),  
 **  Grp 3 [points]       -  ( pos 10 , len 23 ) 
0 0, 0 10, 10 10, 10 0,  
 **  Grp 4 [point]        -  ( pos 27 , len 6 ) 
 10 0,  
 **  Grp 5                -  ( pos 28 , len 4 ) 
10 0  
 **  Grp 6 [last_point]   -  ( pos 34 , len 3 ) 
0 0  
 **  Grp 7 [last_polygon] -  ( pos 39 , len 25 ) 
(3 3, 3 7, 7 7, 7 3, 3 3)  
 **  Grp 8 [points]       -  ( pos 40 , len 19 ) 
3 3, 3 7, 7 7, 7 3,  
 **  Grp 9 [point]        -  ( pos 54 , len 5 ) 
 7 3,  
 **  Grp 10                -  ( pos 55 , len 3 ) 
7 3  
 **  Grp 11 [last_point]   -  ( pos 60 , len 3 ) 
3 3  

Possible Solution

It's not impossible. It just takes a few extra steps.
(As an aside, isn't there a library for WKT that can parse this for you ?)

Now, I don't know your language capabilities, so this is just a general approach.

1. Validate the form you're parsing.
This will validate and return all polygon sets as a single string in All_Polygons group.

Target POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))

POLYGON\s*\((?P&lt;All_Polygons&gt;(?:\(\s*\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,}\s*\))(?:\s*,\(\s*\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,}\s*\))*)\s*\)

 **  Grp 1 [All_Polygons] -  ( pos 9 , len 55 ) 
(0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3)

2. If 1 was successful, set up a loop match using the output of All_Polygons string.

Target (0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3)

(?:\(\s*(?P&lt;Single_Poly_All_Pts&gt;\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,})\s*\))

This step is equivalent of a find all type of match. It should match successive values of all the points of a single polygon, returned in Single_Poly_All_Pts group string.

This will give you these 2 separate matches, which can be put into a temp array having 2 value strings:

 **  Grp 1 [Single_Poly_All_Pts] -  ( pos 1 , len 27 ) 
0 0, 0 10, 10 10, 10 0, 0 0  

 **  Grp 1 [Single_Poly_All_Pts] -  ( pos 31 , len 23 ) 
3 3, 3 7, 7 7, 7 3, 3 3  

3. If 2 was successful, set up a loop match using the temp array output of step 2.
This will give you the individual points of each polygon.

(?P&lt;Single_Point&gt;\d+\s+\d+)

Again this is a loop match (or a find all type of match). For each array element
(Polygon), this will produce the individual points.

Target[element 1] 0 0, 0 10, 10 10, 10 0, 0 0

 **  Grp 1 [Single_Point] -  ( pos 0 , len 3 ) 
0 0  
 **  Grp 1 [Single_Point] -  ( pos 5 , len 4 ) 
0 10  
 **  Grp 1 [Single_Point] -  ( pos 11 , len 5 ) 
10 10  
 **  Grp 1 [Single_Point] -  ( pos 18 , len 4 ) 
10 0  
 **  Grp 1 [Single_Point] -  ( pos 24 , len 3 ) 
0 0  

And,

Target[element 2] 3 3, 3 7, 7 7, 7 3, 3 3

 **  Grp 1 [Single_Point] -  ( pos 0 , len 3 ) 
3 3  
 **  Grp 1 [Single_Point] -  ( pos 5 , len 3 ) 
3 7  
 **  Grp 1 [Single_Point] -  ( pos 10 , len 3 ) 
7 7  
 **  Grp 1 [Single_Point] -  ( pos 15 , len 3 ) 
7 3  
 **  Grp 1 [Single_Point] -  ( pos 20 , len 3 ) 
3 3  

huangapple
  • 本文由 发表于 2017年9月5日 23:16:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/46058388.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定