验证从CSV获取的正确日期时间格式

huangapple go评论86阅读模式
英文:

Validate correct date time format fetched from CSV

问题

我需要读取一个CSV文件,其中一个时间戳列中有不同的时间格式。它可以是以下提到的5种格式中的任何一种。我需要匹配提取的日期并根据每行进行相应解析。
如何验证和解析它?

public static final String DEFAULT_DATE_FORMAT_PATTERN = "yyyy-MM-dd";
public static final String DEFAULT_DATE_TIME_FORMAT_PATTERN = "yyyy-MM-dd HH:mm:ss.SSS";
public static final String DATE_TIME_MINUTES_ONLY_FORMAT_PATTERN = "yyyy-MM-dd HH:mm";
public static final String DATE_TIME_WITHOUT_MILLIS_FORMAT_PATTERN = "yyyy-MM-dd HH:mm:ss";
// 以毫秒为单位的Epoch
英文:

I need to read a CSV file which has different time format in one timestamp column. It can be anything from below mentioned 5 formats. I need to match the fetched date and parse accordingly on each row.
How can I validate and parse it?

public static final String DEFAULT_DATE_FORMAT_PATTERN = "yyyy-MM-dd";
public static final String DEFAULT_DATE_TIME_FORMAT_PATTERN = "yyyy-MM-dd HH:mm:ss.SSS";
public static final String DATE_TIME_MINUTES_ONLY_FORMAT_PATTERN = "yyyy-MM-dd HH:mm";
public static final String DATE_TIME_WITHOUT_MILLIS_FORMAT_PATTERN = "yyyy-MM-dd HH:mm:ss";
// Epoch in milli

答案1

得分: 4

你所需要的是一个带有可选部分的格式化程序。模式可以包含方括号,以表示可选部分,例如 HH:mm[:ss]。然后格式化程序需要解析 HH:mm,并尝试解析接下来的文本作为 :ss,如果失败则跳过。yyyy-MM-dd[ HH:mm[:ss[.SSS]]] 将会是该模式。

这里只有一个问题 - 当你尝试使用 LocalDateTime::parse 解析一个没有时间部分的 yyyy-MM-dd 模式字符串时,它会抛出一个带有消息 无法从TemporalAccessor获取LocalDateTimeDateTimeFormatException。显然,至少要有一个时间部分才能成功。

幸运的是,我们可以使用 DateTimeFormatterBuilder 来构建一个模式,指示格式化程序在解析的文本中缺少信息时使用一些默认值。以下是代码示例:

DateTimeFormatter formatter = new DateTimeFormatterBuilder()
    .appendPattern("yyyy-MM-dd[ HH:mm[:ss[.SSS]]]")
    .parseDefaulting(ChronoField.HOUR_OF_DAY, 0)
    .parseDefaulting(ChronoField.MINUTE_OF_HOUR, 0)
    .parseDefaulting(ChronoField.SECOND_OF_MINUTE, 0)
    .toFormatter();
LocalDateTime dateTime = LocalDateTime.parse(input, formatter);

测试:

String[] inputs = {
    "2020-10-22", // 正常
    "2020-10-22 14:55", // 正常
    "2020-10-22T14:55", // 失败:不正确的格式
    "2020-10-22 14:55:23",
    "2020-10-22 14:55:23.9", // 失败:不正确的秒分数
    "2020-10-22 14:55:23.91", // 失败:不正确的秒分数
    "2020-10-22 14:55:23.917", // 正常
    "2020-10-22 14:55:23.9174", // 失败:不正确的秒分数
    "2020-10-22 14:55:23.917428511" // 失败:不正确的秒分数 
};

那么毫秒级的 epoch 时间呢?

嗯,这不能直接由 DateTimeFormatter 解析。而且更重要的是:毫秒级的 epoch 时间有一个隐式的时区:UTC。其他的模式缺少时区信息。因此,毫秒级的 epoch 时间是一种基本不同的信息。你可以假设输入缺少时区的情况下使用一个时区。

然而,如果你仍然想解析这个时间戳,你可以尝试将它作为 long 使用 Long::parseLong 解析,如果失败,再尝试使用格式化程序解析。另外,你可以使用正则表达式(例如 -?\d+ 或类似的)来尝试匹配这个时间戳,如果匹配成功,就解析为时间戳,如果失败,再尝试使用上述提到的格式化程序解析。

英文:

What you need is a formatter with optional parts. A pattern can contain square brackets to denote an optional part, for example HH:mm[:ss]. The formatter then is required to parse HH:mm, and tries to parse the following text as :ss, or skips it if that fails. yyyy-MM-dd[ HH:mm[:ss[.SSS]]] would then be the pattern.

There is only one issue here – when you try to parse a string with the pattern yyyy-MM-dd (so without time part) using LocalDateTime::parse, it will throw a DateTimeFormatException with the message Unable to obtain LocalDateTime from TemporalAccessor. Apparently, at least one time part must be available to succeed.

Luckily, we can use a DateTimeFormatterBuilder to build a pattern, instructing the formatter to use some defaults if information is missing from the parsed text. Here it is:

DateTimeFormatter formatter = new DateTimeFormatterBuilder()
    .appendPattern("yyyy-MM-dd[ HH:mm[:ss[.SSS]]]")
    .parseDefaulting(ChronoField.HOUR_OF_DAY, 0)
    .parseDefaulting(ChronoField.MINUTE_OF_HOUR, 0)
    .parseDefaulting(ChronoField.SECOND_OF_MINUTE, 0)
    .toFormatter();
LocalDateTime dateTime = LocalDateTime.parse(input, formatter);

Tests:

String[] inputs = {
    "2020-10-22", // OK
    "2020-10-22 14:55", // OK
    "2020-10-22T14:55", // Fails: incorrect format
    "2020-10-22 14:55:23",
    "2020-10-22 14:55:23.9", // Fails: incorrect fraction of second
    "2020-10-22 14:55:23.91", // Fails: incorrect fraction of second
    "2020-10-22 14:55:23.917", // OK
    "2020-10-22 14:55:23.9174", // Fails: incorrect fraction of second
    "2020-10-22 14:55:23.917428511" // Fails: incorrect fraction of second 
};

And what about epoch in milli?

Well, this cannot be parsed directly by the DateTimeFormatter. But what's more: an epoch in milli has an implicit timezone: UTC. The other patterns lack a timezone. So an epoch is a fundamentally different piece of information. One thing you could do is assume a timezone for the inputs missing one.

However, if you nevertheless want to parse the instant, you could try to parse it as a long using Long::parseLong, and if it fails, then try to parse with the formatter. Alternatively, you could use a regular expression (like -?\d+ or something) to try to match the instant, and if it does, then parse as instant, and if it fails, then try to parse with the abovementioned formatter.

答案2

得分: 1

蛮力方法:

  • 依次尝试4种格式,逐一解析传入的字符串
  • 如果解析抛出异常,尝试下一个格式
  • 如果解析通过,那么就匹配了该格式

当然,如果我们讨论更大的表格,这是相当低效的。可能的优化方法:

  • 显然,不同的模式有微小的差异,所以你可以首先使用indexOf()检查。比如:如果要解析的值不包含':'字符,那么它只能是第一个模式。
  • 你可以手动查看数据,以确定实际使用的模式分布。然后,根据模式在数据中被使用的可能性来调整尝试模式的顺序

或者:你可以定义自己的正则表达式。唯一让它稍微不太美观的是,你的输入使用的是月份名称,而不是月份数字。但我认为编写一个覆盖所有情况的单个正则表达式应该不会太难。

英文:

The brute force approach:

  • simply try your 4 formats, one after the other to parse the incoming string
  • if parsing throws an exception, try the next one
  • if parsing passes, well, that format just matched

Of course, if we are talking about larger tables, that is quite inefficient. Possible optimisations:

  • obviously, the different patterns have subtle differences, so you could use indexOf() checks first. Like: if the value to be parsed contains no ':' char, then it can only be the first pattern.
  • you can look at your data manually to figure the actual distribution of patterns that are used. then you adapt the order of patterns to try to the likelihood of the pattern being used in your data

Alternatively: you could define your own regex. The only thing that makes it slightly ugly is the fact that your input uses month names, not month number. But I think it shouldn't be too hard to write up a single regex that covers all your cases.

huangapple
  • 本文由 发表于 2020年10月22日 14:35:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/64476573.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定