2023年6月29日 21:30:51go评论125阅读模式

英文:

Python equivalent for Java hashCode() function

问题

我有一个基于Java hashCode()函数应用于用户 id（一个字符串）结果的A/B测试分割。我想在我的数据框中模拟该分割以分析结果。
是否有Python等价的函数？
或者关于hashCode()内部哈希算法的特定文档，以便我自己编写该函数？

谢谢

我搜索了文档，但找不到具体细节。

英文:

I have an A/B test split based on the result of Java hashCode() function applied to user's id (a string). I want to emulate that split in my dataframe to analyse the results.
Is there a python equivalent for that function?
Or maybe a documentation on the specific hashing algorithm inside hashCode() so I can produce that function myself?

Thanks

I searched for the documentation but couldn't find the specific details

答案1

得分: 4

根据Java的String源代码，哈希实现如下：

public int hashCode() {
    if (cachedHashCode != 0)
        return cachedHashCode;
    
    // 使用本地变量计算哈希码以支持可重入性。
    int hashCode = 0;
    int limit = count + offset;
    for (int i = offset; i < limit; i++)
        hashCode = hashCode * 31 + value[i];
    return cachedHashCode = hashCode;
}

你可以将其转换为Python（不带缓存）：

class JavaHashStr(str):
    def __hash__(self):
        hashCode = 0
        for char in self:
            hashCode = hashCode * 31 + ord(char)
        return hashCode
j = JavaHashStr("abcd")
print(hash(j))  # 输出与Java相同的结果
j = JavaHashStr("abcdef")
print(hash(j))  # 输出与Java相同的结果

请注意，Python的整数不会像Java那样溢出，因此对于许多情况来说，这是不正确的。你需要添加一个处理溢出的模拟（更新：感谢@PresidentJamesK.Polk改进的版本，关于此主题的Stack Overflow帖子）：

class JavaHashStr(str):
    def __hash__(self):
        hashCode = 0
        for char in self:
            hashCode = (hashCode * 31 + ord(char)) & (2**32 - 1)  # 无符号
        if hashCode & 2**31:
            hashCode -= 2**32  # 使其带符号
        return hashCode
j = JavaHashStr("abc")
print(hash(j))
j = JavaHashStr("abcdef")
print(hash(j))  # 输出与Java相同的结果

即使是溢出的哈希值，现在也会表现得相同。

这可能对于后面的Unicode面板中的字符（如表情符号等）仍然不正确。但对于大多数常见的标点符号和基于拉丁字符的字符，这应该有效。

英文:

According to java String source code, the hash implementation is:

public int hashCode()
    {
      if (cachedHashCode != 0)
        return cachedHashCode;
    
      // Compute the hash code using a local variable to be reentrant.
      int hashCode = 0;
      int limit = count + offset;
      for (int i = offset; i &lt; limit; i++)
        hashCode = hashCode * 31 + value[i];
      return cachedHashCode = hashCode;
    }

You can transfer this to Python (w/o caching):

class JavaHashStr(str):
    def __hash__(self):
        hashCode = 0
        for char in self:
            hashCode = hashCode * 31 + ord(char)
        return hashCode
&gt;&gt;&gt; j = JavaHashStr(&quot;abcd&quot;)
&gt;&gt;&gt; hash(j)
2987074  # same as java
&gt;&gt;&gt; j = JavaHashStr(&quot;abcdef&quot;)
&gt;&gt;&gt; hash(j)
2870581347  # java: -1424385949

Note, Python ints do not overflow like java, so this is wrong for many cases. You would have to add a simulation for the overflow (Update: thx to @PresidentJamesK.Polk for the improved version, SO thread on the topic):

class JavaHashStr(str):
    def __hash__(self):
        hashCode = 0
        for char in self:
            hashCode = (hashCode * 31 + ord(char)) &amp; (2**32 - 1)  # unsigned
        if hashCode &amp; 2**31:
            hashCode -= 2**32  # make it signed
        return hashCode

Now, even overflowing hashes behave the same:

&gt;&gt;&gt; j = JavaHashStr(&quot;abc&quot;)
&gt;&gt;&gt; hash(j)
96354
&gt;&gt;&gt; j = JavaHashStr(&quot;abcdef&quot;)
&gt;&gt;&gt; hash(j)
-1424385949  # Java hash for &quot;abcdef&quot;

This might still be off for characters from the latter unicode panes like emojis or the like. But for the most common punctuation and latin-based characters, this should work.

答案2

得分: 1

这将为字符串"The Quick Brown Fox"生成与user2390182的答案和我尝试过的在线工具相同的结果。我认为这个方法可能更容易理解，但可能会更慢，不确定。如果性能很关键，你可能想测试一下。

def java_hasher(text):
    size = 32
    sign = 1 << size-1
    text_hashed = sum(ord(t)*31**i for i, t in enumerate(reversed(text)))
    return (text_hashed & sign-1) - (text_hashed & sign)
    
print(java_hasher("The Quick Brown Fox"))

这将给你：-732416445

英文:

This produces the same results for the string "The Quick Brown Fox" as the answer by @user2390182 and an online tool I tried. I think it is a little easier to follow but it might be slower, not sure. You might want to test it for performance if that was critical.

def java_hasher(text):
    size = 32
    sign = 1 &lt;&lt; size-1
    text_hashed = sum(ord(t)*31**i for i, t in enumerate(reversed(text)))
    return (text_hashed &amp; sign-1) - (text_hashed &amp; sign)
    
print(java_hasher(&quot;The Quick Brown Fox&quot;))

That should give you: -732416445

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python中对应Java的hashCode()函数的等效方法是使用hash()函数。

问题

答案1

答案2

为什么在Java中进行类型转换会给我这些奇怪的结果？

使用Pybind11并通过基础指针访问C++对象。

如何将双精度浮点数列表减少为字符串？

运行Java工具从批处理文件或Jupyter Notebook中

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。