英文:
Go parsing dates from substring within string
问题
我正在编写一个日志文件解析器,并编写了一些用于在C语言中解析的测试代码。
要解析的字符串如下所示:
s := `10.0.0.1 Jan 11 2014 10:00:00 hello`
在C语言中,直接解析这个字符串非常容易。首先,我找到字符串中日期的指针,然后使用strptime()尽可能多地消耗字符串。这是可能的,因为strptime()会在调用后返回字符串中的位置。
最终,我决定使用Go而不是C语言,但在移植代码时遇到了一些问题。据我所知,time.Parse()没有提供任何选项可以从现有字符串中解析(尽管可以通过切片解决此问题),也没有指示在解析日期时消耗了原始字符串的多少。
在Go语言中,有没有一种优雅的方式可以直接从字符串中解析日期/时间,而无需先提取日期时间到一个精确的切片中,例如通过返回解析后提取的字符数?
英文:
I am writing a log file parser, and have written some test code to parse this in C.
The string to be parsed looks as follows:
s := `10.0.0.1 Jan 11 2014 10:00:00 hello`
In C, parsing this in place is quite easy. First I find the pointer to the date within the string and then just consume as much as possible using strptime(). This is possible as strptime() will return the position in the string after the call.
Eventually I decided to go with Go instead of C, but while porting the code over I have some issues. As far as I can tell, time.Parse() does not give me any option to parse from within an existing string (though this can be solved with slices) or indication about how much of the original string it have consumed when parsing the date from within the string.
Is there any elegant way in Go I can parse the date/time right out of the string without having to first extract the datetime into an exact slice e.g. by returning the number of characters extracted after parsing?
答案1
得分: 3
很遗憾,time.Parse
方法无法告诉你它解析了多少个字符,所以我们需要研究其他优雅的解决方案。在你解析日志语句的示例中,使用正则表达式是一个相当优雅的策略,正如 @rob74 建议的那样。下面的示例为了简洁起见忽略了错误:
var r = regexp.MustCompile(`^((?:\d{1,3}\.){3}\d{1,3}) ([a-zA-Z]{3} \d{1,2} \d{4} \d{1,2}:\d{2}:\d{2}) (.*)`)
const longForm = "Jan 02 2006 15:04:05"
func parseRegex(s string) (ip, msg string, t time.Time) {
m := r.FindStringSubmatch(s)
t, _ = time.Parse(longForm, m[2])
ip, msg = m[1], m[3]
return ip, msg, t
}
基准测试显示,上述正则表达式在我的机器上比 @rob74 的示例效率高约两倍,每秒解析约 100,000 行:
BenchmarkParseRegex 100000 17130 ns/op
BenchmarkParseRegexRob74 50000 32788 ns/op
然而,如果我们使用 strings.SplitN
,可以使解决方案更简洁和高效。例如:
func parseSplit(s string) (ip, msg string, t time.Time) {
parts := strings.SplitN(s, " ", 6)
t, _ = time.Parse(longForm, strings.Join(parts[1:5], " "))
ip, msg = parts[0], parts[5]
return ip, msg, t
}
这将字符串在前 5 个空格处分割,并将剩余的字符串(即消息部分)放入最后一个 parts
切片元素中。这并不是非常优雅,因为我们依赖于日期格式中空格的数量,但我们可以通过编程方式计算日期格式字符串中的空格数,以获得更通用的解决方案。让我们看看这与我们的正则表达式解决方案相比如何:
BenchmarkParseRegex 100000 17130 ns/op
BenchmarkParseSplit 500000 3557 ns/op
结果相当令人满意。使用 SplitN
比使用正则表达式快约五倍,并且仍然产生简洁易读的代码。这是以稍微多使用一些内存来换取的。
英文:
Unfortunately, the time.Parse
method can't tell you how many characters it parsed, so we will need to investigate other elegant solutions. In your example of parsing log statements, the use of regular expressions, as @rob74 suggested, is a reasonably elegant strategy. The example below ignores errors for brevity:
var r = regexp.MustCompile(`^((?:\d{1,3}\.){3}\d{1,3}) ([a-zA-Z]{3} \d{1,2} \d{4} \d{1,2}:\d{2}:\d{2}) (.*)`)
const longForm = "Jan 02 2006 15:04:05"
func parseRegex(s string) (ip, msg string, t time.Time) {
m := r.FindStringSubmatch(s)
t, _ = time.Parse(longForm, m[2])
ip, msg = m[1], m[3]
return ip, msg, t
}
Benchmarks show the above regular expression to be about two times more efficient than @rob74's example on my machine, parsing about a 100,000 lines per second:
BenchmarkParseRegex 100000 17130 ns/op
BenchmarkParseRegexRob74 50000 32788 ns/op
We can, however, keep the solution short and more efficient if we use strings.SplitN
instead. For example:
func parseSplit(s string) (ip, msg string, t time.Time) {
parts := strings.SplitN(s, " ", 6)
t, _ = time.Parse(longForm, strings.Join(parts[1:5], " "))
ip, msg = parts[0], parts[5]
return ip, msg, t
}
This splits the string on the first 5 spaces and puts the remaining string (the message part) inside the final parts
slice element. This is not very elegant, since we rely on the number of spaces in the date format, but we could count the spaces in the date format string programmatically for a more general solution. Let's see how this compares to our regular expression solution:
BenchmarkParseRegex 100000 17130 ns/op
BenchmarkParseSplit 500000 3557 ns/op
It compares quite favorably, as it turns out. Using SplitN
is about five times faster than using regular expressions, and still results in concise and readable code. It does this at the cost of using slightly more memory for the slice allocation.
答案2
得分: 0
也许你应该考虑使用正则表达式来分割日志行,例如:
package main
import "fmt"
import "time"
import "regexp"
func main() {
s := "10.0.0.1 Jan 11 2014 10:00:00 hello"
r := regexp.MustCompile(`^([^/w]+) ([a-zA-Z]+ [0-9]{1,2} [0-9]{4} [0-9]{1,2}:[0-9]{2}:[0-9]{2}) (.*)`)
m := r.FindStringSubmatch(s)
if len(m) >= 4 {
fmt.Println("IP:", m[1])
fmt.Println("Timestamp:", m[2])
fmt.Println("Message:", m[3])
t, err := time.Parse("Jan 02 2006 15:04:05", m[2])
if err != nil {
fmt.Println(err.Error())
} else {
fmt.Println("Parsed Time:", t)
}
} else {
fmt.Println("Regexp mismatch!")
}
}
你可以在这里查看代码的运行结果:http://play.golang.org/p/EP-waAPGB4
英文:
Maybe you should consider using a regular expression to split the log line, e.g.:
package main
import "fmt"
import "time"
import "regexp"
func main() {
s := "10.0.0.1 Jan 11 2014 10:00:00 hello"
r := regexp.MustCompile("^([^/w]+) ([a-zA-Z]+ [0-9]{1,2} [0-9]{4} [0-9]{1,2}:[0-9]{2}:[0-9]{2}) (.*)")
m := r.FindStringSubmatch(s)
if len(m) >= 4 {
fmt.Println("IP:", m[1])
fmt.Println("Timestamp:", m[2])
fmt.Println("Message:", m[3])
t, err := time.Parse("Jan 02 2006 15:04:05", m[2])
if err != nil {
fmt.Println(err.Error())
} else {
fmt.Println("Parsed Time:",t)
}
} else {
fmt.Println("Regexp mismatch!")
}
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论