解析字符串中的日期:隐藏的符文

huangapple go评论89阅读模式
英文:

Parse Date from String: hidden runes

问题

我正在直接从HTML文件中解析日期,并尝试将其转换为日期格式。然而,这样做总是导致错误;最奇怪的是,如果我直接粘贴字符串,则可以解析成功。

layout := "02-01-2006 15:04:05"
t, err := time.Parse(layout, *date)

if err != nil {
    fmt.Println(err)
}

输出结果为:

parsing time "12-06-2021   00:00:31" as "02-01-2006 15:04:05": cannot parse "  00:00:31" as "15"

然而,如果我直接从字符串解析,则可以正常工作:

layout := "02-01-2006 15:04:05"
date := "12-06-2021   00:00:31"
t, err := time.Parse(layout, date)

if err != nil {
    fmt.Println(err)
}

这样可以正常工作。我尝试以各种方式去除空格,问题仍然存在:

date2 := *date
date2 = strings.TrimSpace(date2)
date2 = strings.TrimRight(date2, "\r\n")
date2 = strings.TrimRight(date2, "\n")

space := regexp.MustCompile(`\s+`)
date2 = space.ReplaceAllString(date2, "")
date2 = strings.ReplaceAll(date2, " ", "")
date2 = strings.ReplaceAll(date2, "\r", "")
date2 = strings.ReplaceAll(date2, "\n", "")

这表明存在隐藏的符文。最后,我尝试打印原始字符串(来自指针)和我粘贴的版本中的实际符文,以下是结果:

原始字符串:

0: U+0031 '1'
1: U+0033 '3'
2: U+002D '-'
3: U+0030 '0'
4: U+0037 '7'
5: U+002D '-'
6: U+0032 '2'
7: U+0030 '0'
8: U+0032 '2'
9: U+0031 '1'
10: U+00A0
12: U+0031 '1'
13: U+0030 '0'
14: U+003A ':'
15: U+0030 '0'
16: U+0030 '0'
17: U+003A ':'
18: U+0030 '0'
19: U+0030 '0'

手动粘贴的字符串:

0: U+0031 '1'
1: U+0032 '2'
2: U+002D '-'
3: U+0030 '0'
4: U+0036 '6'
5: U+002D '-'
6: U+0032 '2'
7: U+0030 '0'
8: U+0032 '2'
9: U+0031 '1'
10: U+0020 ' '
11: U+0030 '0'
12: U+0030 '0'
13: U+003A ':'
14: U+0030 '0'
15: U+0030 '0'
16: U+003A ':'
17: U+0033 '3'
18: U+0032 '2'

我们立即可以发现问题:在位置11上有一个额外的符文,但由于某种原因,它没有显示出来。那么它是什么?如何去除它?

英文:

I'm parsing dates directly from an HTML file and attempting to convert them to date. However, doing so always reflects in an error; the strangest part is that I'm able to parse if I paste the string directly.

layout := "02-01-2006 15:04:05"
t, err := time.Parse(layout, *date)

if err != nil {
	fmt.Println( err)
}

Yields:

parsing time "12-06-2021   00:00:31" as "02-01-2006 15:04:05": cannot parse "  00:00:31" as "15"

If however, I try to parse directly from string, it works:

layout := "02-01-2006 15:04:05"
date := "12-06-2021   00:00:31"
t, err := time.Parse(layout, date)

if err != nil {
	fmt.Println( err)
}

Works just fine. I tried removing whitespace in every possible manner, and the problem persists:

date2 := *date
date2 = strings.TrimSpace(date2)
date2 = strings.TrimRight(date2, "\r\n")
date2 = strings.TrimRight(date2, "\n")

space := regexp.MustCompile(`\s+`)
date2 = space.ReplaceAllString(date2, "")
date2 = strings.ReplaceAll(date2, " ", "")
date2 = strings.ReplaceAll(date2, "\r", "")
date2 = strings.ReplaceAll(date2, "\n", "")

This suggests there are hidden runes. Finally, I resorted to printing the actual runes in both the original string (from pointer) and my pasted version, and this is what I get.

Original string:

0: U+0031 '1'
1: U+0033 '3'
2: U+002D '-'
3: U+0030 '0'
4: U+0037 '7'
5: U+002D '-'
6: U+0032 '2'
7: U+0030 '0'
8: U+0032 '2'
9: U+0031 '1'
10: U+00A0
12: U+0031 '1'
13: U+0030 '0'
14: U+003A ':'
15: U+0030 '0'
16: U+0030 '0'
17: U+003A ':'
18: U+0030 '0'
19: U+0030 '0'

Hand-pasted string:

0: U+0031 '1'
1: U+0032 '2'
2: U+002D '-'
3: U+0030 '0'
4: U+0036 '6'
5: U+002D '-'
6: U+0032 '2'
7: U+0030 '0'
8: U+0032 '2'
9: U+0031 '1'
10: U+0020 ' '
11: U+0030 '0'
12: U+0030 '0'
13: U+003A ':'
14: U+0030 '0'
15: U+0030 '0'
16: U+003A ':'
17: U+0033 '3'
18: U+0032 '2'

Immediately we can spot the problem: there is an additional rune in position 11, but for some reason, it is now shown; how come? what is it? and how to remove it?

答案1

得分: 2

U+00A0是不间断空格字符。它经常用于人类可读的日期时间格式,以确保日期时间不会被显示程序换行。

你可以尝试先将任何\uA0字符替换为普通空格。

至于为什么正则表达式不起作用,是因为它不符合你的\s正则表达式,因为文档中明确说明它是严格匹配以下字符的:

\s             空白字符(== [\t\n\f\r ])
英文:

U+00A0 is the non-breaking space character. It's often used in datetimes formatted for human use to ensure the datetime won't be wrapped by the displaying program.

You might want to just try replacing any \uA0 characters with a regular space first.

As for why regexps won't do anything, it's not matched by your \s regexp since the docs say it's strictly

\s             whitespace (== [\t\n\f\r ])

huangapple
  • 本文由 发表于 2021年7月12日 22:55:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/68349540.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定