获取在Golang正则表达式中命名的子组列表

huangapple go评论111阅读模式
英文:

Get named list of subgroup in golang regex

问题

我正在寻找一个返回map[string]interface{}的函数,其中interface{}可以是一个切片、一个map[string]interface{}或一个值。

我的用例是解析类似以下的WKT几何图形,并提取点的值;例如一个多边形的例子:

POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))

正则表达式(我故意设置了只匹配整数的\d,以提高可读性):

  1. (POLYGON \(
  2. (?P<polygons>\(
  3. (?P<points>(?P<point>(\d \d), ){3,})
  4. (?P<last_point>\d \d )\),)*
  5. (?P<last_polygon>\(
  6. (?P<points>(?P<point>(\d \d), ){3,})
  7. (?P<last_point>\d \d)\))\)
  8. )

我有一个函数(从SO复制而来),它可以获取一些信息,但对于嵌套组和组列表来说并不是很好:

  1. func getRegexMatchParams(reg *regexp.Regexp, url string) (paramsMap map[string]string) {
  2. match := reg.FindStringSubmatch(url)
  3. paramsMap = make(map[string]string)
  4. for i, name := range reg.SubexpNames() {
  5. if i > 0 && i <= len(match) {
  6. paramsMap[name] = match[i]
  7. }
  8. }
  9. return match
  10. }

似乎point组只获取到一个点。
在playground上的示例

[编辑] 我想要的结果是这样的:

  1. map[string]interface{}{
  2. "polygons": map[string]interface{}{
  3. "points": []interface{}{
  4. map[string]string{"point": "0 0"},
  5. map[string]string{"point": "0 10"},
  6. map[string]string{"point": "10 10"},
  7. map[string]string{"point": "10 0"},
  8. },
  9. "last_point": "0 0",
  10. },
  11. "last_polygon": map[string]interface{}{
  12. "points": []interface{}{
  13. map[string]string{"point": "3 3"},
  14. map[string]string{"point": "3 7"},
  15. map[string]string{"point": "7 7"},
  16. map[string]string{"point": "7 3"},
  17. },
  18. "last_point": "3 3",
  19. },
  20. }

这样我就可以进一步用于不同的目的,比如查询数据库并验证每个多边形的last_point是否等于points[0]

英文:

I'm looking for a function that returns a map[string]interface{} where interface{} can be a slice, a a map[string]interface{} or a value.

My use case is to parse WKT geometry like the following and retrieves point values; Example for a donut polygon:

POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))

The regex (I voluntary set \d that matches only integers for readability purpose):

  1. (POLYGON \(
  2. (?P&lt;polygons&gt;\(
  3. (?P&lt;points&gt;(?P&lt;point&gt;(\d \d), ){3,})
  4. (?P&lt;last_point&gt;\d \d )\),)*
  5. (?P&lt;last_polygon&gt;\(
  6. (?P&lt;points&gt;(?P&lt;point&gt;(\d \d), ){3,})
  7. (?P&lt;last_point&gt;\d \d)\))\)
  8. )

I have a function (copied from SO) that retrieves some informations but it's not that good for nested groups and list of groups:

  1. func getRegexMatchParams(reg *regexp.Regexp, url string) (paramsMap map[string]string) {
  2. match := reg.FindStringSubmatch(url)
  3. paramsMap = make(map[string]string)
  4. for i, name := range reg.SubexpNames() {
  5. if i &gt; 0 &amp;&amp; i &lt;= len(match) {
  6. paramsMap[name] = match[i]
  7. }
  8. }
  9. return match
  10. }

It seems that the group point gets only 1 point.
example on playground

[EDIT] The result I want is something like this:

  1. map[string]interface{}{
  2. &quot;polygons&quot;: map[string]interface{} {
  3. &quot;points&quot;: []interface{}{
  4. {map[string]string{&quot;point&quot;: &quot;0 0&quot;}},
  5. {map[string]string{&quot;point&quot;: &quot;0 10&quot;}},
  6. {map[string]string{&quot;point&quot;: &quot;10 10&quot;}},
  7. {map[string]string{&quot;point&quot;: &quot;10 0&quot;}},
  8. },
  9. &quot;last_point&quot;: &quot;0 0&quot;,
  10. },
  11. &quot;last_polygon&quot;: map[string]interface{} {
  12. &quot;points&quot;: []interface{}{
  13. {map[string]string{&quot;point&quot;: &quot;3 3&quot;}},
  14. {map[string]string{&quot;point&quot;: &quot;3 7&quot;}},
  15. {map[string]string{&quot;point&quot;: &quot;7 7&quot;}},
  16. {map[string]string{&quot;point&quot;: &quot;7 3&quot;}},
  17. },
  18. &quot;last_point&quot;: &quot;3 3&quot;,
  19. }
  20. }

So I can use it further for different purposes like querying databases and validate that last_point = points[0] for each polygon.

答案1

得分: 2

尝试在正则表达式中添加一些空格。

还要注意,此引擎不会保留在类似 (a|b|c)+ 这样的量化外部分组中的所有捕获组值,其中该组仅包含它找到的最后一个 a、b 或 c。

而且,你的正则表达式可以简化为:

(POLYGON\s*\((?P<polygons>\(\s*(?P<points>(?P<point>\s*(\d+\s+\d+)\s*,){3,})\s*(?P<last_point>\d+\s+\d+)\s*\)(?:\s*,\s*|\s*\)))+)

原始链接:

https://play.golang.org/p/rLaaEa_7GX

以下是各个组的内容:

  1. ( # (1 开始)
  2. POLYGON \s* \(
  3. (?P<polygons> # (2 开始)
  4. \( \s*
  5. (?P<points> # (3 开始)
  6. (?P<point> # (4 开始)
  7. \s*
  8. ( \d+ \s+ \d+ ) # (5)
  9. \s*
  10. ,
  11. ){3,} # (4 结束)
  12. ) # (3 结束)
  13. \s*
  14. (?P<last_point> \d+ \s+ \d+ ) # (6)
  15. \s* \),
  16. )* # (2 结束)
  17. (?P<last_polygon> # (7 开始)
  18. \( \s*
  19. (?P<points> # (8 开始)
  20. (?P<point> # (9 开始)
  21. \s*
  22. ( \d+ \s+ \d+ ) # (10)
  23. \s*
  24. ,
  25. ){3,} # (9 结束)
  26. ) # (8 结束)
  27. \s*
  28. (?P<last_point> \d+ \s+ \d+ ) # (11)
  29. \s* \)
  30. ) # (7 结束)
  31. \s* \)
  32. ) # (1 结束)

输入:

  1. POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))

输出:

  1. ** Grp 0 - ( 位置 0 ,长度 65 )
  2. POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))
  3. ** Grp 1 - ( 位置 0 ,长度 65 )
  4. POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))
  5. ** Grp 2 [polygons] - ( 位置 9 ,长度 30 )
  6. (0 0, 0 10, 10 10, 10 0, 0 0),
  7. ** Grp 3 [points] - ( 位置 10 ,长度 23 )
  8. 0 0, 0 10, 10 10, 10 0,
  9. ** Grp 4 [point] - ( 位置 27 ,长度 6 )
  10. 10 0,
  11. ** Grp 5 - ( 位置 28 ,长度 4 )
  12. 10 0
  13. ** Grp 6 [last_point] - ( 位置 34 ,长度 3 )
  14. 0 0
  15. ** Grp 7 [last_polygon] - ( 位置 39 ,长度 25 )
  16. (3 3, 3 7, 7 7, 7 3, 3 3)
  17. ** Grp 8 [points] - ( 位置 40 ,长度 19 )
  18. 3 3, 3 7, 7 7, 7 3,
  19. ** Grp 9 [point] - ( 位置 54 ,长度 5 )
  20. 7 3,
  21. ** Grp 10 - ( 位置 55 ,长度 3 )
  22. 7 3
  23. ** Grp 11 [last_point] - ( 位置 60 ,长度 3 )
  24. 3 3

可能的解决方案:

这并非不可能,只是需要额外的几个步骤。
(顺便说一句,难道没有一个可以解析这个的 WKT 库吗?)

现在,我不知道你的语言能力如何,所以这只是一个一般的方法。

  1. 验证要解析的形式。
    这将验证并返回所有多边形集作为单个字符串在 All_Polygons 组中。

目标 POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))

POLYGON\s*\((?P<All_Polygons>(?:\(\s*\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,}\s*\))(?:\s*,\(\s*\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,}\s*\))*)\s*\)

  1. ** Grp 1 [All_Polygons] - ( 位置 9 ,长度 55 )
  2. (0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3)
  1. 如果步骤 1 成功,使用步骤 1 的输出设置一个循环匹配。

目标 (0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3)

(?:\(\s*(?P<Single_Poly_All_Pts>\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,})\s*\))

这一步相当于查找所有匹配项。它应该匹配单个多边形的所有点的连续值,并在 Single_Poly_All_Pts 组字符串中返回。

这将给你这两个单独的匹配项,可以将它们放入一个临时数组中,其中有两个值字符串:

  1. ** Grp 1 [Single_Poly_All_Pts] - ( 位置 1 ,长度 27 )
  2. 0 0, 0 10, 10 10, 10 0, 0 0
  3. ** Grp 1 [Single_Poly_All_Pts] - ( 位置 31 ,长度 23 )
  4. 3 3, 3 7, 7 7, 7 3, 3 3
  1. 如果步骤 2 成功,使用步骤 2 的临时数组输出设置一个循环匹配。
    这将给出每个多边形的 单个 点。

(?P<Single_Point>\d+\s+\d+)

同样,这是一个循环匹配(或查找所有类型的匹配)。对于每个数组元素(多边形),这将产生单个点。

目标[元素 1] 0 0, 0 10, 10 10, 10 0, 0 0

  1. ** Grp 1 [Single_Point] - ( 位置 0 ,长度 3 )
  2. 0 0
  3. ** Grp 1 [Single_Point] - ( 位置 5 ,长度 4 )
  4. 0 10
  5. ** Grp 1 [Single_Point] - ( 位置 11 ,长度 5 )
  6. 10 10
  7. ** Grp 1 [Single_Point] - ( 位置 18 ,长度 4 )
  8. 10 0
  9. ** Grp 1 [Single_Point] - ( 位置 24 ,长度 3 )
  10. 0 0

以及,

目标[元素 2] 3 3, 3 7, 7 7, 7 3, 3 3

  1. ** Grp 1 [Single_Point] - ( 位置 0 ,长度 3 )
  2. 3 3
  3. ** Grp 1 [Single_Point] - ( 位置 5 ,长度 3 )
  4. 3 7
  5. ** Grp 1 [Single_Point] - ( 位置 10 ,长度 3 )
  6. 7 7
  7. ** Grp 1 [Single_Point] - ( 位置 15 ,长度 3 )
  8. 7 3
  9. ** Grp 1 [Single_Point] - ( 位置 20 ,长度 3 )
  10. 3 3
英文:

Try to add some whitespace to the regex.

Also note that this engine won't retain all capture group values that are
within a quantified outer grouping like (a|b|c)+ where this group will only contain the last a or b or c it finds.

And, your regex can be reduced to this

(POLYGON\s*\((?P&lt;polygons&gt;\(\s*(?P&lt;points&gt;(?P&lt;point&gt;\s*(\d+\s+\d+)\s*,){3,})\s*(?P&lt;last_point&gt;\d+\s+\d+)\s*\)(?:\s*,\s*|\s*\)))+)

https://play.golang.org/p/rLaaEa_7GX


The original:

(POLYGON\s*\((?P&lt;polygons&gt;\(\s*(?P&lt;points&gt;(?P&lt;point&gt;\s*(\d+\s+\d+)\s*,){3,})\s*(?P&lt;last_point&gt;\d+\s+\d+)\s*\),)*(?P&lt;last_polygon&gt;\(\s*(?P&lt;points&gt;(?P&lt;point&gt;\s*(\d+\s+\d+)\s*,){3,})\s*(?P&lt;last_point&gt;\d+\s+\d+)\s*\))\s*\))

https://play.golang.org/p/rZgJYPDMzl

See below for what the groups contain.

  1. ( # (1 start)
  2. POLYGON \s* \(
  3. (?P&lt;polygons&gt; # (2 start)
  4. \( \s*
  5. (?P&lt;points&gt; # (3 start)
  6. (?P&lt;point&gt; # (4 start)
  7. \s*
  8. ( \d+ \s+ \d+ ) # (5)
  9. \s*
  10. ,
  11. ){3,} # (4 end)
  12. ) # (3 end)
  13. \s*
  14. (?P&lt;last_point&gt; \d+ \s+ \d+ ) # (6)
  15. \s* \),
  16. )* # (2 end)
  17. (?P&lt;last_polygon&gt; # (7 start)
  18. \( \s*
  19. (?P&lt;points&gt; # (8 start)
  20. (?P&lt;point&gt; # (9 start)
  21. \s*
  22. ( \d+ \s+ \d+ ) # (10)
  23. \s*
  24. ,
  25. ){3,} # (9 end)
  26. ) # (8 end)
  27. \s*
  28. (?P&lt;last_point&gt; \d+ \s+ \d+ ) # (11)
  29. \s* \)
  30. ) # (7 end)
  31. \s* \)
  32. ) # (1 end)

Input

  1. POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))

Output

  1. ** Grp 0 - ( pos 0 , len 65 )
  2. POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))
  3. ** Grp 1 - ( pos 0 , len 65 )
  4. POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))
  5. ** Grp 2 [polygons] - ( pos 9 , len 30 )
  6. (0 0, 0 10, 10 10, 10 0, 0 0),
  7. ** Grp 3 [points] - ( pos 10 , len 23 )
  8. 0 0, 0 10, 10 10, 10 0,
  9. ** Grp 4 [point] - ( pos 27 , len 6 )
  10. 10 0,
  11. ** Grp 5 - ( pos 28 , len 4 )
  12. 10 0
  13. ** Grp 6 [last_point] - ( pos 34 , len 3 )
  14. 0 0
  15. ** Grp 7 [last_polygon] - ( pos 39 , len 25 )
  16. (3 3, 3 7, 7 7, 7 3, 3 3)
  17. ** Grp 8 [points] - ( pos 40 , len 19 )
  18. 3 3, 3 7, 7 7, 7 3,
  19. ** Grp 9 [point] - ( pos 54 , len 5 )
  20. 7 3,
  21. ** Grp 10 - ( pos 55 , len 3 )
  22. 7 3
  23. ** Grp 11 [last_point] - ( pos 60 , len 3 )
  24. 3 3

Possible Solution

It's not impossible. It just takes a few extra steps.
(As an aside, isn't there a library for WKT that can parse this for you ?)

Now, I don't know your language capabilities, so this is just a general approach.

1. Validate the form you're parsing.
This will validate and return all polygon sets as a single string in All_Polygons group.

Target POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))

POLYGON\s*\((?P&lt;All_Polygons&gt;(?:\(\s*\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,}\s*\))(?:\s*,\(\s*\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,}\s*\))*)\s*\)

  1. ** Grp 1 [All_Polygons] - ( pos 9 , len 55 )
  2. (0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3)

2. If 1 was successful, set up a loop match using the output of All_Polygons string.

Target (0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3)

(?:\(\s*(?P&lt;Single_Poly_All_Pts&gt;\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,})\s*\))

This step is equivalent of a find all type of match. It should match successive values of all the points of a single polygon, returned in Single_Poly_All_Pts group string.

This will give you these 2 separate matches, which can be put into a temp array having 2 value strings:

  1. ** Grp 1 [Single_Poly_All_Pts] - ( pos 1 , len 27 )
  2. 0 0, 0 10, 10 10, 10 0, 0 0
  3. ** Grp 1 [Single_Poly_All_Pts] - ( pos 31 , len 23 )
  4. 3 3, 3 7, 7 7, 7 3, 3 3

3. If 2 was successful, set up a loop match using the temp array output of step 2.
This will give you the individual points of each polygon.

(?P&lt;Single_Point&gt;\d+\s+\d+)

Again this is a loop match (or a find all type of match). For each array element
(Polygon), this will produce the individual points.

Target[element 1] 0 0, 0 10, 10 10, 10 0, 0 0

  1. ** Grp 1 [Single_Point] - ( pos 0 , len 3 )
  2. 0 0
  3. ** Grp 1 [Single_Point] - ( pos 5 , len 4 )
  4. 0 10
  5. ** Grp 1 [Single_Point] - ( pos 11 , len 5 )
  6. 10 10
  7. ** Grp 1 [Single_Point] - ( pos 18 , len 4 )
  8. 10 0
  9. ** Grp 1 [Single_Point] - ( pos 24 , len 3 )
  10. 0 0

And,

Target[element 2] 3 3, 3 7, 7 7, 7 3, 3 3

  1. ** Grp 1 [Single_Point] - ( pos 0 , len 3 )
  2. 3 3
  3. ** Grp 1 [Single_Point] - ( pos 5 , len 3 )
  4. 3 7
  5. ** Grp 1 [Single_Point] - ( pos 10 , len 3 )
  6. 7 7
  7. ** Grp 1 [Single_Point] - ( pos 15 , len 3 )
  8. 7 3
  9. ** Grp 1 [Single_Point] - ( pos 20 , len 3 )
  10. 3 3

huangapple
  • 本文由 发表于 2017年9月5日 23:16:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/46058388.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定