英文:
Zero-Allocation-Hashing murmur3: hashChars() and hashBytes() produce different output
问题
I am not sure if I am using murmur3
(OpenHFT's zero-allocation-hashing) function correctly but the result seems different for hashChars()
and hashBytes()
// Using zero-allocation-hashing 0.16
String input = "abc123";
System.out.println(LongHashFunction.murmur_3().hashChars(input));
System.out.println(LongHashFunction.murmur_3().hashBytes(input.getBytes(StandardCharsets.UTF_8)));
Output:
-4878457159164508227
-7432123028918728600
The latter one produces the same output as Guava lib.
Which function should be used for String
inputs?
Shouldn't both functions produce the same result?
Update:
How can I get the same output as :
Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().asLong();
Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().toString()
using zero-allocation-hashing
lib which seems to be faster than Guava
英文:
I am not sure if I am using murmur3
(OpenHFT's zero-allocation-hashing) function correctly but the result seems different for hashChars()
and hashBytes()
// Using zero-allocation-hashing 0.16
String input = "abc123";
System.out.println(LongHashFunction.murmur_3().hashChars(input));
System.out.println(LongHashFunction.murmur_3().hashBytes(input.getBytes(StandardCharsets.UTF_8)));
Output:
-4878457159164508227
-7432123028918728600
The latter one produces the same output as Guava lib.
Which function should be used for String
inputs?
Shouldn't both functions produce the same result?
Update:
How can I get same output as :
Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().asLong();
Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().toString()
using zero-allocation-hashing
lib which seems to be faster than Guava
答案1
得分: 2
在Java中,char和byte的大小不同:
char
的大小为16位,使用Unicode字符集。byte
实际上与其名称相符,长度为8位。
当我们考虑不同的字符时,这种差异变得至关重要:考虑一个简单的字符像 'A' - 在Unicode中,它由十六进制数0x0041表示,所以在我们的例子中:
String input = "A";
byte[] bytes = input.getBytes(StandardCharsets.UTF_8);
System.out.println(LongHashFunction.murmur_3().hashChars(input));
System.out.println(LongHashFunction.murmur_3().hashBytes(bytes));
hashChars
使用两个字节(0x00,0x41),而hashBytes
使用一个字节(0x41)- 这就是为什么你会得到不同的结果。
使用哪个函数取决于你的要求:如果你要哈希字符串并且想要忽略底层的编码,请使用hashChars()
。如果你关心特定的字节表示,请使用hashBytes()
。
英文:
The size of a char and byte are different in Java:
char
size is 16 bits, using the Unicode character setbyte
actually respond to it's name, 8 bits long
This difference becomes crucial when we consider different characters: considering a simple character like 'A' - in Unicode, it's represented by the hexadecimal number 0x0041, so in our example:
String input = "A";
byte[] bytes = input.getBytes(StandardCharsets.UTF_8);
System.out.println(LongHashFunction.murmur_3().hashChars(input));
System.out.println(LongHashFunction.murmur_3().hashBytes(bytes));
hashChars
is working with two bytes (0x00, 0x41), while hashBytes
is working with one byte (0x41) -- this is why you will get different results.
Which function to use really depends on your requirements: if you're hashing strings and you wanna ignore the underlying encoding, use hashChars()
. If you care about the specific byte representation, use hashBytes().
答案2
得分: 1
Your assumption regarding UTF-8
is not correct, it holds for StandardCharsets.UTF_16LE
.
String input = "abc123";
System.out.println(LongHashFunction.murmur_3().hashChars(input));
System.out.println(LongHashFunction.murmur_3().hashBytes(
input.getBytes(StandardCharsets.UTF_16LE)
));
gives:
-4878457159164508227
-4878457159164508227
Additional Answer
For the desired:
Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().asLong();
this:
LongHashFunction.murmur_3().hashBytes(input.getBytes(StandardCharsets.UTF_8));
seems to work (please test more!)
The (hex) string conversion is sort of a problem, since the guava hash creates (really) 128 bits (16 bytes, 2 longs), whereas "your lib" gives us only 64 bits!
Half of the digits I can reproduce with:
...
Thanks to:
With your help (sorry first time encountering this lib), I could finally:
System.out.println("Actual: " +
toHexString(
LongTupleHashFunction.murmur_3().hashBytes(
input.getBytes(StandardCharsets.UTF_8)
)
)
);
where:
private static final String toHexString(long[] hashLongs) {
StringBuilder sb = new StringBuilder(hashLongs.length * Long.BYTES * 2);
for (long lng : hashLongs)
for (int i = 0; i < Long.BYTES; i++) {
byte b = (byte) (lng >> (i * Long.BYTES));
sb.append(HEX_DIGITS[(b >> 4) & 0xf]).append(HEX_DIGITS[b & 0xf]);
}
return sb.toString();
}
private static final char[] HEX_DIGITS = "0123456789abcdef".toCharArray();
英文:
Your assumption regarding UTF-8
is not correct, it holds for StandardCharsets.UTF_16LE
.
String input = "abc123";
System.out.println(LongHashFunction.murmur_3().hashChars(input));
System.out.println(LongHashFunction.murmur_3().hashBytes(
input.getBytes(StandardCharsets.UTF_16LE)
));
gives:
-4878457159164508227
-4878457159164508227
Additional Answer
For the desired:
Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().asLong();
this:
LongHashFunction.murmur_3().hashBytes(input.getBytes(StandardCharsets.UTF_8));
seems to work (please test more!)
The (hex) string conversion is sort of a problem, since the guava hash creates (really) 128 bits (16 bytes, 2 longs), whereas "your lib" gives us only 64 bits!
<s>Half of the digits i can reproduce with:
...</s>
thx to:
- https://stackoverflow.com/q/4485128/592355
- https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/HashCode.java#L412
With your help (sorry first time encounter this lib), I could finally:
System.out.println("Actual: " +
toHexString(
LongTupleHashFunction.murmur_3().hashBytes(
input.getBytes(StandardCharsets.UTF_8)
)
)
);
where:
private static final String toHexString(long[] hashLongs) {
StringBuilder sb = new StringBuilder(hashLongs.length * Long.BYTES * 2);
for (long lng : hashLongs)
for (int i = 0; i < Long.BYTES; i++) {
byte b = (byte) (lng >> (i * Long.BYTES));
sb.append(HEX_DIGITS[(b >> 4) & 0xf]).append(HEX_DIGITS[b & 0xf]);
}
return sb.toString();
}
private static final char[] HEX_DIGITS = "0123456789abcdef".toCharArray();
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论