golang regular expression to extract pairs of quantities and their units

huangapple go评论104阅读模式
英文:

golang regular expression to extract pairs of quantities and their units

问题

我有一组可读的字符串表示时间段。以下是四个示例:

1 天 40 小时 23 分钟 50 秒

3 小时 1 分钟 30 秒

10 天 23 分钟 11 秒

52 秒

我想将这些字符串转换为秒数。一旦将字符串分解为其组成部分,进行这种转换的数学运算就非常简单 - 只需进行乘法和加法运算。然而,我在编写正则表达式将字符串解析为 [<数量>, <单位>] 对时遇到了一些问题。例如,对于字符串:

1 天 40 小时 23 分钟 50 秒

我希望得到一个类似的数组(或切片):

[[1, "天"], [40, "小时"], [23, "分钟"], [50, "秒"]]

以下是我目前尝试的代码及其输出(可在 http://play.golang.org/p/iR-xfc8MVQ 上执行)。segs 是我第一次尝试的结果,它似乎将字符串分解为了 4 个组件,但每个组件只是一个字符串,如 1 天,而不是一个包含两个元素的数组,如 [1, 天]segs2 是我第二次尝试的结果,它似乎做了一些更奇怪的事情,每个组件都重复了两次。

// 时间单位分词器
package main

import "fmt"
import "regexp"

func main() {
    s := "1 天 40 小时 23 分钟 50 秒"
    re := regexp.MustCompile(`(?P<quant>\d+) (?P<unit>\w+)+`)

    segs := re.FindAllString(s, -1)
    fmt.Println("segs:", segs)
    fmt.Println(segs[0], ",", segs[1], ",", segs[2], ",", segs[3])
    fmt.Println("length segs:", len(segs))

    segs2 := re.FindAllStringSubmatch(s, -1)
    fmt.Println("segs2:", segs2)
    fmt.Println(segs2[0], ",", segs2[1], ",", segs2[2], ",", segs2[3])
    fmt.Println("length segs2:", len(segs2))
}

输出:

segs: [1 天 40 小时 23 分钟 50 秒]
1 天 , 40 小时 , 23 分钟 , 50 秒
length segs: 4
segs2: [[1 天 1 天] [40 小时 40 小时] [23 分钟 23 分钟] [50 秒 50 秒]]
[1 天 1 天] , [40 小时 40 小时] , [23 分钟 23 分钟] , [50 秒 50 秒]
length segs2: 4

我在 Python 中编写了类似的正则表达式,它可以正常工作,所以我真的不确定我是否在 Go 的正则表达式语法上做错了什么,或者在 re 对象上做了错误的调用。

英文:

I have a set of human readable strings expressing a duration of time. Here are four examples:

1 days 40 hrs 23 min 50 sec

3 hrs 1 min 30 sec

10 days 23 min 11 sec

52 sec

I am trying to convert these strings into number of seconds. The math to do this is quite simple once the string is broken down into its components - it's just multiplication and addition. I am having some issues however with writing the regular expression to parse the string into [&lt;quantity&gt;, &lt;unit&gt;] pairs. As an example, the output I would like for the string:

1 days 40 hrs 23 min 50 sec

is an array (or slice) like:

[[1, &quot;days&quot;], [40, &quot;hrs&quot;], [23, &quot;min&quot;], [50, &quot;sec&quot;]].

Below is the code for what I've tried so far and its output (executable at http://play.golang.org/p/iR-xfc8MVQ). segs was my first attempt, which seems to break the string down into 4 components ok but each component is just a string like 1 days rather than a 2-element array like [1, days]. segs2 was my second attempt, which seems to do something weirder where each component is repeated twice.

// time unit tokenizer
package main

import &quot;fmt&quot;
import &quot;regexp&quot;

func main() {
	s := &quot;1 days 40 hrs 23 min 50 sec&quot;
	re := regexp.MustCompile(&quot;(?P&lt;quant&gt;\\d+) (?P&lt;unit&gt;\\w+)+&quot;)
	
	segs := re.FindAllString(s, -1)
	fmt.Println(&quot;segs:&quot;, segs)
	fmt.Println(segs[0], &quot;,&quot; ,segs[1], &quot;,&quot;, segs[2], &quot;,&quot;, segs[3])	
	fmt.Println(&quot;length segs:&quot;, len(segs))
	
	segs2 := re.FindAllStringSubmatch(s, -1)
	fmt.Println(&quot;segs2:&quot;, segs2)
	fmt.Println(segs2[0], &quot;,&quot; ,segs2[1], &quot;,&quot;, segs2[2], &quot;,&quot;, segs2[3])
	fmt.Println(&quot;length segs2:&quot;, len(segs2))
}

Output:

segs: [1 days 40 hrs 23 min 50 sec]
1 days , 40 hrs , 23 min , 50 sec
length segs: 4
segs2: [[1 days 1 days] [40 hrs 40 hrs] [23 min 23 min] [50 sec 50 sec]]
[1 days 1 days] , [40 hrs 40 hrs] , [23 min 23 min] , [50 sec 50 sec]
length segs2: 4

I've written a similar regex is Python which works OK, so I'm really not sure whether I am doing something incorrect for Go's regular expression syntax or perhaps making the wrong call on the re object.

答案1

得分: 8

Regexp.FindAllStringSubmatch 返回的是 [][]string。但是它的内容与 Python 函数 re.findall 的返回值略有不同(我假设你在 Python 中使用了 re.findall)。

  • return_value[i][0] 包含整个匹配的字符串。
  • return_value[i][1] 包含第一个捕获组。
  • return_value[i][2] 包含第二个捕获组。....

打印 return_value[i] 会导致打印出 return_value[i] 中的所有项(包括 return_value[i][0]return_value[i][1]return_value[i][2],等等)。


你可以通过只打印捕获组匹配项(不包括 [0])来获得你期望的结果,如下所示:

segs2 := re.FindAllStringSubmatch(s, -1)
for i := 0; i < len(segs2); i++ {
    fmt.Println(segs2[i][1], ",", segs2[i][2])
}

演示示例


附注

以下字符串字面量:

"(?P<quant>\d+) (?P<unit>\w+)+"

可以用以下原始字符串字面量表示:

`(?P<quant>\d+) (?P<unit>\w+)+`

参见 字符串字面量

英文:

Regexp.FindAllStringSubmatch returns [][]string. But its contents are slightly different from the return value of the Python function re.findall (I assumed that you used re.findall in Python).

  • return_value[i][0] contains whole matched string.
  • return_value[i][1] contains captured group 1.
  • return_value[i][2] contains captured group 2. ....

Printing return_value[i] cause all items in return_value[i] to be printed. (return_value[i][0], return_value[i][1], return_value[i][2], ..)


You can get what you expected by only printing captured group matches (excluding [0]) as follow:

segs2 := re.FindAllStringSubmatch(s, -1)
for i := 0; i &lt; len(segs2); i++ {
	fmt.Println(segs2[i][1], &quot;,&quot; ,segs2[i][2]);
}

Demo


Side Note

Following string literal:

&quot;(?P&lt;quant&gt;\\d+) (?P&lt;unit&gt;\\w+)+&quot;

can be expressed as the following raw string literals.

`(?P&lt;quant&gt;\d+) (?P&lt;unit&gt;\w+)+`

See String literals

huangapple
  • 本文由 发表于 2014年1月5日 13:55:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/20930643.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定