SimpleDateFormat 根据年份数字的不同成功执行或抛出异常。

huangapple go评论76阅读模式
英文:

SimpleDateFormat succeeds or throws depending on year number

问题

我正在处理一个遗留代码,它尝试解析包含可选时间组件的日期,通常用零填充,使用格式字符串 ddMMyy,这个格式字符串实际上并不匹配输入。出于真正的遗留代码精神,没有人费心去清理它,因为它偶然做了它应该做的事情。但是,在2023年,它再也不能正常工作了。

这里是代码的(大大简化的)版本:

import java.text.ParseException;
import java.text.SimpleDateFormat;

public class WeirdDateFormat {
    public static void main(String...args) throws ParseException {
        var df = new SimpleDateFormat("ddMMyy");
        df.setLenient(false);

        System.out.println(df.parse("09012023000000000"));
        System.out.println(df.parse("09012022000000000"));
    }
}

它输出:

Mon Jan 09 00:00:00 CET 70403584
Exception in thread "main" java.text.ParseException: Unparseable date: "09012022000000000"
	at java.base/java.text.DateFormat.parse(DateFormat.java:399)
	at WeirdDateFormat.main(WeirdDateFormat.java:10)

换句话说,第一个日期(2023年1月9日)解析成功,但年份为70403584。
第二个日期(2022年1月9日)无法解析并引发异常。

(如果我们将 lenient 设置为 true,第二个日期不会引发异常,但年份会变为239492593。)

这里到底发生了什么?为什么有时无法解析,有时可以?以及这些奇怪的年份数字是从哪里来的?

(发现该问题在运行Java 8的生产环境中,但在Java 17上的行为相同。)

编辑 是的,我知道,我知道,我必须修复这个遗留代码。不需要一直告诉我,你是在给信徒传教!不幸的是,我大大简化的代码版本并没有向您展示所有其他必须解决的代码库的遗留缺陷。我只是想理解这里到底发生了什么,这样当我实际重构这段代码时,我就会更了解情况。

英文:

I'm working with legacy code that tries to parse dates that include an optional time component that's usually padded with zeroes, using the a format string ddMMyy that doesn't really match the input. In the spirit of true legacy code, nobody ever bothered to clean it up because it accidentally does what it's supposed to do. Except, in 2023, it no longer does.

Here's a (drastically simplified) version of the code:

import java.text.ParseException;
import java.text.SimpleDateFormat;

public class WeirdDateFormat {
    public static void main(String...args) throws ParseException {
        var df = new SimpleDateFormat("ddMMyy");
        df.setLenient(false);

        System.out.println(df.parse("09012023000000000"));
        System.out.println(df.parse("09012022000000000"));
    }
}

It prints:

Mon Jan 09 00:00:00 CET 70403584
Exception in thread "main" java.text.ParseException: Unparseable date: "09012022000000000"
	at java.base/java.text.DateFormat.parse(DateFormat.java:399)
	at WeirdDateFormat.main(WeirdDateFormat.java:10)

In other words, the first date (9 January 2023) parses fine, but gives a date with the year 70403584.
The second date (9 January 2022) fails to parse and throws an exception.

(If we set lenient to true, the second date doesn't throw but ends up in the year 239492593.)

WTF is happening here? Why does it sometimes fail to parse, and sometimes not? And where do these bizarre year numbers come from?

(Found the issue in production running Java 8, but the behaviour on Java 17 is the same.)

EDIT Yes, I know, I know, I must fix the legacy code. No need to keep telling me, you're preaching to the choir! Unfortunately, my drastically simplified version of the code doesn't show you all the other legacy defects of the code base that also have to be addressed. I just want to understand what's going on here so I'll be better informed when I actually do refactor this code.

答案1

得分: 11

明显的答案:摆脱这些从未正常工作过的过时代码,并正确实现它。

让我们首先解释这个结果。

我不确定为什么,但你想知道为什么会发生这种情况。我已经为你深入研究了SimpleDateFormat的源代码。

由于yy部分是最后一个,它会获取剩余的所有数字。因此,字符串的"2022000000000"部分被解析为值为2022000000000Long。然后,它立即转换为int,这就是问题所在;2022000000000溢出并变成int-929596416。然后,一个标准的java.text.CalendarBuilder实例被告知将其YEAR字段设置为该值(-929596416)。这没问题。

当解析完成时,该构建器被要求生成一个GregorianCalendar值。这不起作用 - GregorianCalendar可以接受-929596416作为YEAR值,但SimpleDateFormat然后要求这个GregCal实例计算自纪元以来的毫秒数,而这个操作失败了;一个异常引发异常并指示了这一点。这个异常被SimpleDateFormat代码捕获,并导致你得到的“无法解析的日期”异常。

对于2023,你得到了相同的效果:同样的操作将其转换为int,而不检查是否溢出;同样溢出,导致int70403584GregorianCalendar接受这一年。这就是你看到的效果:70403584年,以下是解释:

long y = 2023000000000L;
int i = (int) y;
System.out.println(i); // 输出 70403584

然后深入研究一下:为什么70403584是可以的,而-929596416不行?

大多数情况下,是因为。“GregCal”内部方法getMinimum(field)getMaximum(field)在传递YEAR字段(常量值1)时分别为1和292278994。这意味着70403584被接受,而-929596416则不行。你告诉它不要宽松处理。在这里,“宽松”(旧的j.u.Calendar)主要是一个愚蠢的概念(试图定义在非宽松模式下什么是可接受的几乎是不可能的。甚至在非宽松模式下,各种绝对荒谬的日期仍然是可以接受的)。

我们可以验证这一点:

GregorianCalendar cal = new GregorianCalendar();
cal.setLenient(false);
cal.set(Calendar.YEAR, -5);
System.out.println(cal.getTime());

会得到:

Exception in thread "main" java.lang.IllegalArgumentException: YEAR
	at java.base/java.util.GregorianCalendar.computeTime(GregorianCalendar.java:2609)
	at java.base/java.util.Calendar.updateTime(Calendar.java:3411)
	at java.base/java.util.Calendar.getTimeInMillis(Calendar.java:1805)
	at java.base/java.util.Calendar.getTime(Calendar.java:1776)

执行结论:如果你期望宽松模式拒绝这些模式,我有一些不好的消息告诉你:非宽松模式不起作用,从来没有起作用,你不应该依赖它。具体来说,在非宽松模式下不检查溢出(你可能会认为在非宽松模式下,任何值的溢出都意味着拒绝该值,但遗憾的是),而2023000000000碰巧溢出为一个荒谬但仍然可接受(即使在非宽松模式下也可接受)的年份,而2022000000000则不行。

那么如何修复这个问题呢?

你不能。SimpleDateFormatGregorianCalendar是可怕的API和有缺陷的实现。唯一的解决办法是放弃它们。使用java.time。使用java.time.DateTimeFormatter创建一个新的格式化程序,将这个值解析为LocalDate,然后从那里继续。你将在这个过程中解决一系列与时区相关的疯狂问题!(因为java.util.Date是虚假的,不表示日期。它表示瞬时,这就是为什么.getYear()等方法已被弃用,因为你不能在没有时区的情况下向瞬时询问年份,而Date没有时区。Calendar与所有这些都紧密交织在一起 - 因此,在一个时区存储日期并在另一个时区读取它们会引发怪异的问题。LocalDate避免了所有这些问题)。

编辑:作为荷兰人,需要注意的是,最新的JDK破坏了Europe/Amsterdam时区(咕咕咕,OpenJDK团队不明白他们正在造成什么样的损害) - 这意味着在荷兰地区运行的软件在时代毫秒和基本日期之间的任何转换都会变得更加问题复杂。例如,如果你正在存储出生日期,然后通过这样的转换,那些在1940年之前出生的人都会受到影响,他们的生日将向后移动一天。LocalDate通过根本不将任何内容存储为时代毫秒来避免这个问题。

英文:

OBVIOUS ANSWER: Get rid of this obsolete crud that never worked properly and do it right.

Let's first explain this result

I'm not sure why, but you wanted to know why this happens. I dived into the source of SimpleDateFormat for you.

Given that the yy part is the last, it takes all remaining digits. Thus, the "2022000000000" part of the string is parsed into a Long of value 2022000000000. This is then immediately converted to an int, and that's quite problematic; 2022000000000 overflows and turns into int value -929596416. A standard java.text.CalendarBuilder instance is then told to set its YEAR field to that value (-929596416). Which is fine.

When parsing is done, that builder is asked to produce a GregorianCalendar value. This doesn't work - the GregorianCalendar accepts -929596416 as YEAR value just fine, but SimpleDateFormat then asks this GregCal instance to calculate the time in millis since the epoch, and that fails; an exception throws an exception indicating this. This exception is caught by the SimpleDateFormat code and results in the Unparseable date exception that you are getting.

With 2023, you get the same effect: That is turned into an int without checking if it overflows; that overflows just the same, and results in int value 70403584. GregorianCalendar DOES accept this year. This then results in what you saw: Year 70403584 - which is explained as follows:

long y = 2023000000000L;
int i = (int) y;
System.out.println(i); // prints 70403584

A deeper dive then is: Why is 70403584 fine, and -929596416 isn't?

Mostly, 'because'. The GregCal internal methods getMinimum(field) and getMaximum(field), when passing the YEAR field (constant value 1) are respectively 1 and 292278994. That means 70403584 is accepted, and -929596416 is not. You told it to be non-lenient. "Lenient" here (the old j.u.Calendar stuff) is mostly a silly concept (trying to define what is acceptable in non-lenient mode is virtually impossible. Various utterly ridiculous dates nevertheless are acceptable even in non-lenient mode).

We can verify this:

GregorianCalendar cal = new GregorianCalendar();
cal.setLenient(false);
cal.set(Calendar.YEAR, -5);
System.out.println(cal.getTime());

gives you:

Exception in thread "main" java.lang.IllegalArgumentException: YEAR
	at java.base/java.util.GregorianCalendar.computeTime(GregorianCalendar.java:2609)
	at java.base/java.util.Calendar.updateTime(Calendar.java:3411)
	at java.base/java.util.Calendar.getTimeInMillis(Calendar.java:1805)
	at java.base/java.util.Calendar.getTime(Calendar.java:1776)

THE EXECUTIVE CONCLUSION: If you were expecting lenient mode to reject these patterns, I have some nasty news for you: non-lenient mode does not work and never did and you should not be relying on it. Specifically here, overflows are not checked (you'd think that in non-lenient mode, any overflow of any value means the value is rejected, but, alas), and 2023000000000 so happens to overflow into a ridiculous but nevertheless, acceptable (even in non-lenient) year, whereas 2022000000000 does not.

So how do you fix this?

You can't. SimpleDateFormat and GregorianCalendar are horrible API and broken implementations. The only fix is to ditch it. Use java.time. Make a new formatter using java.time.DateTimeFormatter, parse this value into a LocalDate, and go from there. You'll solve a whole host of timezone related craziness on the fly, too! (Because java.util.Date is lying and doesn't represent dates. It represents instants, hence why .getYear() and company are deprecated, because you can't ask an instant for a year without a timezone, and Date doesn't have one. Calendar is intricately interwoven with it all - hence, storing dates on one timezone and reading them on another causes wonkiness. LocalDate avoids all that).

EDIT: As a fellow dutchie, note that the most recent JDKs break the Europe/Amsterdam timezone (grumble grumble OpenJDK team doesn't understand what damage they are causing) - which means any conversion between epoch-millis and base dates is extra problematic for software running in dutch locales. For example, if you are storing birthdates and you dip through conversion like this, everybody born before 1940 will break and their birthday will shift by a day. LocalDate avoids this by never storing anything as epoch-millis in the first place.

答案2

得分: 7

原因是整数溢出。但是,认真修复遗留代码。

jshell> (int)9012023000000000L
$1 ==> 496985600

jshell> (int)9012022000000000L
$2 ==> -503014400

负年份超出了年份组件的范围。一个非常大的年份仍然被视为有效。

SimpleDateFormat.java的第1543行设置断点。异常本身是在GregorianCalendar.java的第2583行抛出的:

for (int field = 0; field < FIELD_COUNT; field++) {
    int value = internalGet(field);
    if (isExternallySet(field)) {
        // Quick validation for any out of range values
        if (value < getMinimum(field) || value > getMaximum(field)) {
            throw new IllegalArgumentException(getFieldName(field));
        }
    }
    originalFields[field] = value;
}

最终传递给此方法的实际“年份”值为-929596416((int)2022000000000L)。2023年不会引发异常,因为将长整数值强制转换为整数后,它变为了70403584,这是允许的。GregorianCalendar拒绝小于1 (< 1) 和大于292278994 (> 292278994) 的值。

英文:

The reason is integer overflow. But seriously, fix the legacy code.

jshell&gt; (int)9012023000000000L
$1 ==&gt; 496985600

jshell&gt; (int)9012022000000000L
$2 ==&gt; -503014400

A negative year is out of range for the year component. A crazy large year is still considered valid.

Set a breakpoint on line 1543 of SimpleDateFormat.java. The exception itself is thrown in Line 2583 of GregorianCalendar.java:

> for (int field = 0; field < FIELD_COUNT; field++) {
> int value = internalGet(field);
> if (isExternallySet(field)) {
> // Quick validation for any out of range values
> if (value < getMinimum(field) || value > getMaximum(field)) {
> throw new IllegalArgumentException(getFieldName(field));
> }
> }
> originalFields[field] = value;
> }

The actual "year" value that ends up in this method is -929596416 ((int)2022000000000L). Year 2023 does not throw because the long value coerced into an integer ends up being 70403584 – which is allowed. The GregorianCalendar rejects values smaller than 1 (&lt; 1) and greater than 292278994 (&gt; 292278994).

huangapple
  • 本文由 发表于 2023年1月9日 19:32:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/75056668.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定