2023年6月12日 17:33:48go评论93阅读模式

英文:

Zero-Allocation-Hashing murmur3: hashChars() and hashBytes() produce different output

问题

I am not sure if I am using murmur3 (OpenHFT's zero-allocation-hashing) function correctly but the result seems different for hashChars() and hashBytes()

// Using zero-allocation-hashing 0.16
String input = "abc123";
System.out.println(LongHashFunction.murmur_3().hashChars(input));
System.out.println(LongHashFunction.murmur_3().hashBytes(input.getBytes(StandardCharsets.UTF_8)));

Output:

-4878457159164508227
-7432123028918728600

The latter one produces the same output as Guava lib.

Which function should be used for String inputs?

Shouldn't both functions produce the same result?

Update:

How can I get the same output as :

Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().asLong();
Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().toString()

using zero-allocation-hashing lib which seems to be faster than Guava

英文:

I am not sure if I am using murmur3 (OpenHFT's zero-allocation-hashing) function correctly but the result seems different for hashChars() and hashBytes()

// Using zero-allocation-hashing 0.16  
String input = &quot;abc123&quot;;
System.out.println(LongHashFunction.murmur_3().hashChars(input));
System.out.println(LongHashFunction.murmur_3().hashBytes(input.getBytes(StandardCharsets.UTF_8)));

Output:

-4878457159164508227
-7432123028918728600

The latter one produces the same output as Guava lib.

Which function should be used for String inputs?

Shouldn't both functions produce the same result?

Update:

How can I get same output as :

Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().asLong();
Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().toString()

using zero-allocation-hashing lib which seems to be faster than Guava

答案1

得分: 2

在Java中，char和byte的大小不同：

char的大小为16位，使用Unicode字符集。
byte实际上与其名称相符，长度为8位。

当我们考虑不同的字符时，这种差异变得至关重要：考虑一个简单的字符像 'A' - 在Unicode中，它由十六进制数0x0041表示，所以在我们的例子中：

String input = "A";
byte[] bytes = input.getBytes(StandardCharsets.UTF_8);
System.out.println(LongHashFunction.murmur_3().hashChars(input));
System.out.println(LongHashFunction.murmur_3().hashBytes(bytes));

hashChars使用两个字节（0x00，0x41），而hashBytes使用一个字节（0x41）- 这就是为什么你会得到不同的结果。

使用哪个函数取决于你的要求：如果你要哈希字符串并且想要忽略底层的编码，请使用hashChars()。如果你关心特定的字节表示，请使用hashBytes()。

英文:

The size of a char and byte are different in Java:

char size is 16 bits, using the Unicode character set
byte actually respond to it's name, 8 bits long

This difference becomes crucial when we consider different characters: considering a simple character like 'A' - in Unicode, it's represented by the hexadecimal number 0x0041, so in our example:

String input = &quot;A&quot;;
byte[] bytes = input.getBytes(StandardCharsets.UTF_8);
System.out.println(LongHashFunction.murmur_3().hashChars(input));
System.out.println(LongHashFunction.murmur_3().hashBytes(bytes));

hashChars is working with two bytes (0x00, 0x41), while hashBytes is working with one byte (0x41) -- this is why you will get different results.

Which function to use really depends on your requirements: if you're hashing strings and you wanna ignore the underlying encoding, use hashChars(). If you care about the specific byte representation, use hashBytes().

答案2

得分: 1

Your assumption regarding UTF-8 is not correct, it holds for StandardCharsets.UTF_16LE.

String input = "abc123";
System.out.println(LongHashFunction.murmur_3().hashChars(input));
System.out.println(LongHashFunction.murmur_3().hashBytes(
  input.getBytes(StandardCharsets.UTF_16LE)
));

gives:

-4878457159164508227
-4878457159164508227

Additional Answer

For the desired:

Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().asLong();

this:

LongHashFunction.murmur_3().hashBytes(input.getBytes(StandardCharsets.UTF_8));

seems to work (please test more!)

The (hex) string conversion is sort of a problem, since the guava hash creates (really) 128 bits (16 bytes, 2 longs), whereas "your lib" gives us only 64 bits!

Half of the digits I can reproduce with:
...

Thanks to:

With your help (sorry first time encountering this lib), I could finally:

System.out.println("Actual:   " +
    toHexString(
        LongTupleHashFunction.murmur_3().hashBytes(
            input.getBytes(StandardCharsets.UTF_8)
        )
    )
);

where:

private static final String toHexString(long[] hashLongs) {
	StringBuilder sb = new StringBuilder(hashLongs.length * Long.BYTES * 2);
	for (long lng : hashLongs)
		for (int i = 0; i < Long.BYTES; i++) {
			byte b = (byte) (lng >> (i * Long.BYTES));
			sb.append(HEX_DIGITS[(b >> 4) & 0xf]).append(HEX_DIGITS[b & 0xf]);
		}
	return sb.toString();
}
private static final char[] HEX_DIGITS = "0123456789abcdef".toCharArray();

英文:

Your assumption regarding UTF-8 is not correct, it holds for StandardCharsets.UTF_16LE.

String input = &quot;abc123&quot;;
System.out.println(LongHashFunction.murmur_3().hashChars(input));
System.out.println(LongHashFunction.murmur_3().hashBytes(
  input.getBytes(StandardCharsets.UTF_16LE)
));

gives:

-4878457159164508227
-4878457159164508227

Additional Answer

For the desired:

Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().asLong();

this:

LongHashFunction.murmur_3().hashBytes(input.getBytes(StandardCharsets.UTF_8));

seems to work (please test more!)

The (hex) string conversion is sort of a problem, since the guava hash creates (really) 128 bits (16 bytes, 2 longs), whereas "your lib" gives us only 64 bits!

<s>Half of the digits i can reproduce with:
...</s>

thx to:

With your help (sorry first time encounter this lib), I could finally:

System.out.println(&quot;Actual:   &quot; +
    toHexString(
        LongTupleHashFunction.murmur_3().hashBytes(
            input.getBytes(StandardCharsets.UTF_8)
        )
    )
);

where:

private static final String toHexString(long[] hashLongs) {
	StringBuilder sb = new StringBuilder(hashLongs.length * Long.BYTES * 2);
	for (long lng : hashLongs)
		for (int i = 0; i &lt; Long.BYTES; i++) {
			byte b = (byte) (lng &gt;&gt; (i * Long.BYTES));
			sb.append(HEX_DIGITS[(b &gt;&gt; 4) &amp; 0xf]).append(HEX_DIGITS[b &amp; 0xf]);
		}
	return sb.toString();
}
private static final char[] HEX_DIGITS = &quot;0123456789abcdef&quot;.toCharArray();

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Zero-Allocation-Hashing murmur3: hashChars() and hashBytes() produce different output

问题

答案1

答案2

Additional Answer

Additional Answer

Is there a way to simplify this code comparing two Points with fields x and y for equality?

java.util.InputMismatchException扫描器问题？

如何将List<List<String>>转换为List<List<Object>>?

如何创建静态类的实例，以便可以从另一个类中调用？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论