32位hashCode在Java中如何存储在25位标记字中,而不会丢失数据?

huangapple go评论59阅读模式
英文:

How is a 32-bit hashCode stored in a 25-bit mark word in Java without data loss?

问题

I've been looking into the internals of Java objects and am puzzled about how hashCode values are managed. As I understand it, the hashCode method in Java returns a 32-bit integer. However, this hashCode is stored in the object's header, specifically in the 25-bit mark word.

This raises a couple of questions for me:

How is it possible to store a 32-bit hashCode in a 25-bit mark word without losing some bits of data?
Even if data loss occurs due to this bit-length discrepancy, why is it that when I call hashCode() again, it still retrieves the original hashCode value without any apparent data loss?

Any insights into how Java manages to do this would be greatly appreciated.

英文:

I've been looking into the internals of Java objects and am puzzled about how hashCode values are managed. As I understand it, the hashCode method in Java returns a 32-bit integer. However, this hashCode is stored in the object's header, specifically in the 25-bit mark word.

This raises a couple of questions for me:

How is it possible to store a 32-bit hashCode in a 25-bit mark word without losing some bits of data?
Even if data loss occurs due to this bit-length discrepancy, why is it that when I call hashCode() again, it still retrieves the original hashCode value without any apparent data loss?

Any insights into how Java manages to do this would be greatly appreciated.

答案1

得分: 7

以下是您要翻译的内容:

首先,最重要的是:所有这些都非常依赖实现细节,而不是由规范定义的。我特别讨论最近的OpenJDK构建(我正在测试JDK 17,但似乎这种行为存在已经有一段时间了),但没有任何说其他JDK或甚至未来版本的OpenJDK可能会改变其中任何部分。

接下来,重要的是要区分对象的“标识哈希码”和其“哈希码”。

  • “标识哈希码”是由JVM决定的值,在对象的生命周期内保持不变,不能受Java代码的影响(即覆盖hashCode()对此没有影响)。可以通过调用System.identityHashCode(obj)来获取此值。

  • 另一方面,“哈希码”是Java程序员大多数与之交互的值:在对象上调用hashCode()时的返回值。虽然在任何东西存储在HashMapHashSet(或类似结构)中时,这是一个重要的值,但JVM本身并不特别关心它。而且,即使关心,它也不能将其存储在对象头中,因为hashCode()可能每次调用时都返回不同的值。

这两个定义在一个重要方面互动:java.lang.ObjecthashCode()方法(因此也是任何未覆盖该方法的其他对象)将返回标识哈希码。因此,如果没有其他定义,可以说标识哈希码是哈希码的默认值。

在查看相关代码之后,似乎在32位平台上最多有25位空间来存储标识哈希码。

但是hashCode被定义为32位宽,那怎么可能呢?

简单:在这些平台上,标识哈希码根本不使用超过25位的位数,因此未存储的所有位都被认为是零。

虽然我没有找到决定这一点的具体位置(我也没有仔细查看),但可以轻松通过以下代码验证这一点:

public class MyClass {
    public static void main(String args[]) {
      int minLeadingZeroes = 32;
      for (int i = 0; i < 1_000_000; i++) {
          int hash = System.identityHashCode(new Object());
          minLeadingZeroes = Math.min(minLeadingZeroes, Integer.numberOfLeadingZeros(hash));
      }

      System.out.println("标识哈希码中前导零的最小数量(1000000个对象的标识哈希码)= " + minLeadingZeroes);
    }
}

在64位JVM上运行时,这会打印出:

标识哈希码中前导零的最小数量(1000000个对象的标识哈希码)= 1

而在32位JVM上运行时,它会打印出:

标识哈希码中前导零的最小数量(1000000个对象的标识哈希码)= 7

需要注意的是,即使在64位的OpenJDK构建中,标识哈希码最多使用31位(正如上面链接的实现的注释中也提到的),尽管在这种情况下有很多空余的位。

英文:

First of all and most importantly: all of this is very much an implementation detail and not defined by the specification. I'm specifically discussing recent OpenJDK builds (I'm testing JDK 17, but this behaviour seems to exist for a while), but nothing says that other JDKs or even future versions of OpenJDK could change any of this.

Next it's important to distinguish an objects identity hash code from its hash code.

  • the identity hash code is a value decided by the JVM that's constant over the lifetime of an object and can not be influenced by Java code (i.e. overriding hashCode() has no effect on this). This value can be gotten by calling System.identityHashCode(obj).

  • the hash code on the other hand is what Java programmers mostly interact with: the return value of hashCode() when called on an object. While this is an important value whenever anything is stored in a HashMap or HashSet (or similar structures), the JVM itself does not particularly care about it. And even if it did, it couldn't store it in an object header, as hashCode() could conceivably return a different value every time it's called.

These two definitions interact in one important way: The hashCode() method of java.lang.Object (and thus also of any other object where that method isn't overridden) will return the identity hash code. So one could say that the identity hash code is the default value for the hash code, if nothing else is defined.

After looking at the relevant code it does indeed seem like there's at most 25 bits of space to store the identity hash code on 32bit platforms.

But hashCode is defined to be 32bits wide, so how can that be?

Simple: the identity hash code on those platforms simply never uses any more bits than 25, so all the bits that are not stored are known/assumed to be zero.

While I didn't find the specific place where that is decided (I also didn't look too closely), one can easily verify this with code like this:

public class MyClass {
    public static void main(String args[]) {
      int minLeadingZeroes = 32;
      for (int i = 0; i < 1_000_000; i++) {
          int hash = System.identityHashCode(new Object());
          minLeadingZeroes = Math.min(minLeadingZeroes, Integer.numberOfLeadingZeros(hash));
      }

      System.out.println("Smallest number of leading zeroes in identity hash codes of 1000000 objects = " + minLeadingZeroes);
    }
}

When run with a 64bit JVM this prints

Smallest number of leading zeroes in identity hash codes of 1000000 objects = 1

Whereas on a 32bit JVM it prints

Smallest number of leading zeroes in identity hash codes of 1000000 objects = 7

Granted, this is not absolute proof, but it would be extremely unlikely that those values are a coincidence when testing a million objects.

Also note that even in a 64bit OpenJDK build the identity hash code uses at most 31bits (as noted also in the comments of the implementation linked to above), despite there being plenty of spare room (many bits are unused in this case).

huangapple
  • 本文由 发表于 2023年5月10日 21:46:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/76219209.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定