英文:
Validate correct date time format fetched from CSV
问题
我需要读取一个CSV文件,其中一个时间戳列中有不同的时间格式。它可以是以下提到的5种格式中的任何一种。我需要匹配提取的日期并根据每行进行相应解析。
如何验证和解析它?
public static final String DEFAULT_DATE_FORMAT_PATTERN = "yyyy-MM-dd";
public static final String DEFAULT_DATE_TIME_FORMAT_PATTERN = "yyyy-MM-dd HH:mm:ss.SSS";
public static final String DATE_TIME_MINUTES_ONLY_FORMAT_PATTERN = "yyyy-MM-dd HH:mm";
public static final String DATE_TIME_WITHOUT_MILLIS_FORMAT_PATTERN = "yyyy-MM-dd HH:mm:ss";
// 以毫秒为单位的Epoch
英文:
I need to read a CSV file which has different time format in one timestamp column. It can be anything from below mentioned 5 formats. I need to match the fetched date and parse accordingly on each row.
How can I validate and parse it?
public static final String DEFAULT_DATE_FORMAT_PATTERN = "yyyy-MM-dd";
public static final String DEFAULT_DATE_TIME_FORMAT_PATTERN = "yyyy-MM-dd HH:mm:ss.SSS";
public static final String DATE_TIME_MINUTES_ONLY_FORMAT_PATTERN = "yyyy-MM-dd HH:mm";
public static final String DATE_TIME_WITHOUT_MILLIS_FORMAT_PATTERN = "yyyy-MM-dd HH:mm:ss";
// Epoch in milli
答案1
得分: 4
你所需要的是一个带有可选部分的格式化程序。模式可以包含方括号,以表示可选部分,例如 HH:mm[:ss]
。然后格式化程序需要解析 HH:mm
,并尝试解析接下来的文本作为 :ss
,如果失败则跳过。yyyy-MM-dd[ HH:mm[:ss[.SSS]]]
将会是该模式。
这里只有一个问题 - 当你尝试使用 LocalDateTime::parse
解析一个没有时间部分的 yyyy-MM-dd
模式字符串时,它会抛出一个带有消息 无法从TemporalAccessor获取LocalDateTime 的 DateTimeFormatException
。显然,至少要有一个时间部分才能成功。
幸运的是,我们可以使用 DateTimeFormatterBuilder
来构建一个模式,指示格式化程序在解析的文本中缺少信息时使用一些默认值。以下是代码示例:
DateTimeFormatter formatter = new DateTimeFormatterBuilder()
.appendPattern("yyyy-MM-dd[ HH:mm[:ss[.SSS]]]")
.parseDefaulting(ChronoField.HOUR_OF_DAY, 0)
.parseDefaulting(ChronoField.MINUTE_OF_HOUR, 0)
.parseDefaulting(ChronoField.SECOND_OF_MINUTE, 0)
.toFormatter();
LocalDateTime dateTime = LocalDateTime.parse(input, formatter);
测试:
String[] inputs = {
"2020-10-22", // 正常
"2020-10-22 14:55", // 正常
"2020-10-22T14:55", // 失败:不正确的格式
"2020-10-22 14:55:23",
"2020-10-22 14:55:23.9", // 失败:不正确的秒分数
"2020-10-22 14:55:23.91", // 失败:不正确的秒分数
"2020-10-22 14:55:23.917", // 正常
"2020-10-22 14:55:23.9174", // 失败:不正确的秒分数
"2020-10-22 14:55:23.917428511" // 失败:不正确的秒分数
};
那么毫秒级的 epoch 时间呢?
嗯,这不能直接由 DateTimeFormatter
解析。而且更重要的是:毫秒级的 epoch 时间有一个隐式的时区:UTC。其他的模式缺少时区信息。因此,毫秒级的 epoch 时间是一种基本不同的信息。你可以假设输入缺少时区的情况下使用一个时区。
然而,如果你仍然想解析这个时间戳,你可以尝试将它作为 long
使用 Long::parseLong
解析,如果失败,再尝试使用格式化程序解析。另外,你可以使用正则表达式(例如 -?\d+
或类似的)来尝试匹配这个时间戳,如果匹配成功,就解析为时间戳,如果失败,再尝试使用上述提到的格式化程序解析。
英文:
What you need is a formatter with optional parts. A pattern can contain square brackets to denote an optional part, for example HH:mm[:ss]
. The formatter then is required to parse HH:mm
, and tries to parse the following text as :ss
, or skips it if that fails. yyyy-MM-dd[ HH:mm[:ss[.SSS]]]
would then be the pattern.
There is only one issue here – when you try to parse a string with the pattern yyyy-MM-dd
(so without time part) using LocalDateTime::parse
, it will throw a DateTimeFormatException
with the message Unable to obtain LocalDateTime from TemporalAccessor. Apparently, at least one time part must be available to succeed.
Luckily, we can use a DateTimeFormatterBuilder
to build a pattern, instructing the formatter to use some defaults if information is missing from the parsed text. Here it is:
DateTimeFormatter formatter = new DateTimeFormatterBuilder()
.appendPattern("yyyy-MM-dd[ HH:mm[:ss[.SSS]]]")
.parseDefaulting(ChronoField.HOUR_OF_DAY, 0)
.parseDefaulting(ChronoField.MINUTE_OF_HOUR, 0)
.parseDefaulting(ChronoField.SECOND_OF_MINUTE, 0)
.toFormatter();
LocalDateTime dateTime = LocalDateTime.parse(input, formatter);
Tests:
String[] inputs = {
"2020-10-22", // OK
"2020-10-22 14:55", // OK
"2020-10-22T14:55", // Fails: incorrect format
"2020-10-22 14:55:23",
"2020-10-22 14:55:23.9", // Fails: incorrect fraction of second
"2020-10-22 14:55:23.91", // Fails: incorrect fraction of second
"2020-10-22 14:55:23.917", // OK
"2020-10-22 14:55:23.9174", // Fails: incorrect fraction of second
"2020-10-22 14:55:23.917428511" // Fails: incorrect fraction of second
};
And what about epoch in milli?
Well, this cannot be parsed directly by the DateTimeFormatter
. But what's more: an epoch in milli has an implicit timezone: UTC. The other patterns lack a timezone. So an epoch is a fundamentally different piece of information. One thing you could do is assume a timezone for the inputs missing one.
However, if you nevertheless want to parse the instant, you could try to parse it as a long
using Long::parseLong
, and if it fails, then try to parse with the formatter. Alternatively, you could use a regular expression (like -?\d+
or something) to try to match the instant, and if it does, then parse as instant, and if it fails, then try to parse with the abovementioned formatter.
答案2
得分: 1
蛮力方法:
- 依次尝试4种格式,逐一解析传入的字符串
- 如果解析抛出异常,尝试下一个格式
- 如果解析通过,那么就匹配了该格式
当然,如果我们讨论更大的表格,这是相当低效的。可能的优化方法:
- 显然,不同的模式有微小的差异,所以你可以首先使用
indexOf()
检查。比如:如果要解析的值不包含':'
字符,那么它只能是第一个模式。 - 你可以手动查看数据,以确定实际使用的模式分布。然后,根据模式在数据中被使用的可能性来调整尝试模式的顺序。
或者:你可以定义自己的正则表达式。唯一让它稍微不太美观的是,你的输入使用的是月份名称,而不是月份数字。但我认为编写一个覆盖所有情况的单个正则表达式应该不会太难。
英文:
The brute force approach:
- simply try your 4 formats, one after the other to parse the incoming string
- if parsing throws an exception, try the next one
- if parsing passes, well, that format just matched
Of course, if we are talking about larger tables, that is quite inefficient. Possible optimisations:
- obviously, the different patterns have subtle differences, so you could use
indexOf()
checks first. Like: if the value to be parsed contains no':'
char, then it can only be the first pattern. - you can look at your data manually to figure the actual distribution of patterns that are used. then you adapt the order of patterns to try to the likelihood of the pattern being used in your data
Alternatively: you could define your own regex. The only thing that makes it slightly ugly is the fact that your input uses month names, not month number. But I think it shouldn't be too hard to write up a single regex that covers all your cases.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论