Python中对应Java的hashCode()函数的等效方法是使用hash()函数。

huangapple go评论67阅读模式
英文:

Python equivalent for Java hashCode() function

问题

我有一个基于Java hashCode()函数应用于用户 id(一个字符串)结果的A/B测试分割。我想在我的数据框中模拟该分割以分析结果。
是否有Python等价的函数?
或者关于hashCode()内部哈希算法的特定文档,以便我自己编写该函数?

谢谢

我搜索了文档,但找不到具体细节。

英文:

I have an A/B test split based on the result of Java hashCode() function applied to user's id (a string). I want to emulate that split in my dataframe to analyse the results.
Is there a python equivalent for that function?
Or maybe a documentation on the specific hashing algorithm inside hashCode() so I can produce that function myself?

Thanks

I searched for the documentation but couldn't find the specific details

答案1

得分: 4

根据Java的String源代码,哈希实现如下:

public int hashCode() {
    if (cachedHashCode != 0)
        return cachedHashCode;
    
    // 使用本地变量计算哈希码以支持可重入性。
    int hashCode = 0;
    int limit = count + offset;
    for (int i = offset; i < limit; i++)
        hashCode = hashCode * 31 + value[i];
    return cachedHashCode = hashCode;
}

你可以将其转换为Python(不带缓存):

class JavaHashStr(str):
    def __hash__(self):
        hashCode = 0
        for char in self:
            hashCode = hashCode * 31 + ord(char)
        return hashCode

j = JavaHashStr("abcd")
print(hash(j))  # 输出与Java相同的结果
j = JavaHashStr("abcdef")
print(hash(j))  # 输出与Java相同的结果

请注意,Python的整数不会像Java那样溢出,因此对于许多情况来说,这是不正确的。你需要添加一个处理溢出的模拟(更新:感谢@PresidentJamesK.Polk改进的版本,关于此主题的Stack Overflow帖子):

class JavaHashStr(str):
    def __hash__(self):
        hashCode = 0
        for char in self:
            hashCode = (hashCode * 31 + ord(char)) & (2**32 - 1)  # 无符号
        if hashCode & 2**31:
            hashCode -= 2**32  # 使其带符号
        return hashCode

j = JavaHashStr("abc")
print(hash(j))
j = JavaHashStr("abcdef")
print(hash(j))  # 输出与Java相同的结果

即使是溢出的哈希值,现在也会表现得相同。

这可能对于后面的Unicode面板中的字符(如表情符号等)仍然不正确。但对于大多数常见的标点符号和基于拉丁字符的字符,这应该有效。

英文:

According to java String source code, the hash implementation is:

public int hashCode()
    {
      if (cachedHashCode != 0)
        return cachedHashCode;
    
      // Compute the hash code using a local variable to be reentrant.
      int hashCode = 0;
      int limit = count + offset;
      for (int i = offset; i &lt; limit; i++)
        hashCode = hashCode * 31 + value[i];
      return cachedHashCode = hashCode;
    }

You can transfer this to Python (w/o caching):

class JavaHashStr(str):
    def __hash__(self):
        hashCode = 0
        for char in self:
            hashCode = hashCode * 31 + ord(char)
        return hashCode

&gt;&gt;&gt; j = JavaHashStr(&quot;abcd&quot;)
&gt;&gt;&gt; hash(j)
2987074  # same as java
&gt;&gt;&gt; j = JavaHashStr(&quot;abcdef&quot;)
&gt;&gt;&gt; hash(j)
2870581347  # java: -1424385949

Note, Python ints do not overflow like java, so this is wrong for many cases. You would have to add a simulation for the overflow (Update: thx to @PresidentJamesK.Polk for the improved version, SO thread on the topic):

class JavaHashStr(str):
    def __hash__(self):
        hashCode = 0
        for char in self:
            hashCode = (hashCode * 31 + ord(char)) &amp; (2**32 - 1)  # unsigned
        if hashCode &amp; 2**31:
            hashCode -= 2**32  # make it signed
        return hashCode

Now, even overflowing hashes behave the same:

&gt;&gt;&gt; j = JavaHashStr(&quot;abc&quot;)
&gt;&gt;&gt; hash(j)
96354
&gt;&gt;&gt; j = JavaHashStr(&quot;abcdef&quot;)
&gt;&gt;&gt; hash(j)
-1424385949  # Java hash for &quot;abcdef&quot;

This might still be off for characters from the latter unicode panes like emojis or the like. But for the most common punctuation and latin-based characters, this should work.

答案2

得分: 1

这将为字符串"The Quick Brown Fox"生成与user2390182的答案和我尝试过的在线工具相同的结果。我认为这个方法可能更容易理解,但可能会更慢,不确定。如果性能很关键,你可能想测试一下。

def java_hasher(text):
    size = 32
    sign = 1 << size-1
    text_hashed = sum(ord(t)*31**i for i, t in enumerate(reversed(text)))
    return (text_hashed & sign-1) - (text_hashed & sign)
    
print(java_hasher("The Quick Brown Fox"))

这将给你:-732416445

英文:

This produces the same results for the string "The Quick Brown Fox" as the answer by @user2390182 and an online tool I tried. I think it is a little easier to follow but it might be slower, not sure. You might want to test it for performance if that was critical.

def java_hasher(text):
    size = 32
    sign = 1 &lt;&lt; size-1
    text_hashed = sum(ord(t)*31**i for i, t in enumerate(reversed(text)))
    return (text_hashed &amp; sign-1) - (text_hashed &amp; sign)
    
print(java_hasher(&quot;The Quick Brown Fox&quot;))

That should give you: -732416445

huangapple
  • 本文由 发表于 2023年6月29日 21:30:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76581535.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定