Python中对应Java的hashCode()函数的等效方法是使用hash()函数。

huangapple go评论125阅读模式
英文:

Python equivalent for Java hashCode() function

问题

我有一个基于Java hashCode()函数应用于用户 id(一个字符串)结果的A/B测试分割。我想在我的数据框中模拟该分割以分析结果。
是否有Python等价的函数?
或者关于hashCode()内部哈希算法的特定文档,以便我自己编写该函数?

谢谢

我搜索了文档,但找不到具体细节。

英文:

I have an A/B test split based on the result of Java hashCode() function applied to user's id (a string). I want to emulate that split in my dataframe to analyse the results.
Is there a python equivalent for that function?
Or maybe a documentation on the specific hashing algorithm inside hashCode() so I can produce that function myself?

Thanks

I searched for the documentation but couldn't find the specific details

答案1

得分: 4

根据Java的String源代码,哈希实现如下:

  1. public int hashCode() {
  2. if (cachedHashCode != 0)
  3. return cachedHashCode;
  4. // 使用本地变量计算哈希码以支持可重入性。
  5. int hashCode = 0;
  6. int limit = count + offset;
  7. for (int i = offset; i < limit; i++)
  8. hashCode = hashCode * 31 + value[i];
  9. return cachedHashCode = hashCode;
  10. }

你可以将其转换为Python(不带缓存):

  1. class JavaHashStr(str):
  2. def __hash__(self):
  3. hashCode = 0
  4. for char in self:
  5. hashCode = hashCode * 31 + ord(char)
  6. return hashCode
  7. j = JavaHashStr("abcd")
  8. print(hash(j)) # 输出与Java相同的结果
  9. j = JavaHashStr("abcdef")
  10. print(hash(j)) # 输出与Java相同的结果

请注意,Python的整数不会像Java那样溢出,因此对于许多情况来说,这是不正确的。你需要添加一个处理溢出的模拟(更新:感谢@PresidentJamesK.Polk改进的版本,关于此主题的Stack Overflow帖子):

  1. class JavaHashStr(str):
  2. def __hash__(self):
  3. hashCode = 0
  4. for char in self:
  5. hashCode = (hashCode * 31 + ord(char)) & (2**32 - 1) # 无符号
  6. if hashCode & 2**31:
  7. hashCode -= 2**32 # 使其带符号
  8. return hashCode
  9. j = JavaHashStr("abc")
  10. print(hash(j))
  11. j = JavaHashStr("abcdef")
  12. print(hash(j)) # 输出与Java相同的结果

即使是溢出的哈希值,现在也会表现得相同。

这可能对于后面的Unicode面板中的字符(如表情符号等)仍然不正确。但对于大多数常见的标点符号和基于拉丁字符的字符,这应该有效。

英文:

According to java String source code, the hash implementation is:

  1. public int hashCode()
  2. {
  3. if (cachedHashCode != 0)
  4. return cachedHashCode;
  5. // Compute the hash code using a local variable to be reentrant.
  6. int hashCode = 0;
  7. int limit = count + offset;
  8. for (int i = offset; i &lt; limit; i++)
  9. hashCode = hashCode * 31 + value[i];
  10. return cachedHashCode = hashCode;
  11. }

You can transfer this to Python (w/o caching):

  1. class JavaHashStr(str):
  2. def __hash__(self):
  3. hashCode = 0
  4. for char in self:
  5. hashCode = hashCode * 31 + ord(char)
  6. return hashCode
  7. &gt;&gt;&gt; j = JavaHashStr(&quot;abcd&quot;)
  8. &gt;&gt;&gt; hash(j)
  9. 2987074 # same as java
  10. &gt;&gt;&gt; j = JavaHashStr(&quot;abcdef&quot;)
  11. &gt;&gt;&gt; hash(j)
  12. 2870581347 # java: -1424385949

Note, Python ints do not overflow like java, so this is wrong for many cases. You would have to add a simulation for the overflow (Update: thx to @PresidentJamesK.Polk for the improved version, SO thread on the topic):

  1. class JavaHashStr(str):
  2. def __hash__(self):
  3. hashCode = 0
  4. for char in self:
  5. hashCode = (hashCode * 31 + ord(char)) &amp; (2**32 - 1) # unsigned
  6. if hashCode &amp; 2**31:
  7. hashCode -= 2**32 # make it signed
  8. return hashCode

Now, even overflowing hashes behave the same:

  1. &gt;&gt;&gt; j = JavaHashStr(&quot;abc&quot;)
  2. &gt;&gt;&gt; hash(j)
  3. 96354
  4. &gt;&gt;&gt; j = JavaHashStr(&quot;abcdef&quot;)
  5. &gt;&gt;&gt; hash(j)
  6. -1424385949 # Java hash for &quot;abcdef&quot;

This might still be off for characters from the latter unicode panes like emojis or the like. But for the most common punctuation and latin-based characters, this should work.

答案2

得分: 1

这将为字符串"The Quick Brown Fox"生成与user2390182的答案和我尝试过的在线工具相同的结果。我认为这个方法可能更容易理解,但可能会更慢,不确定。如果性能很关键,你可能想测试一下。

  1. def java_hasher(text):
  2. size = 32
  3. sign = 1 << size-1
  4. text_hashed = sum(ord(t)*31**i for i, t in enumerate(reversed(text)))
  5. return (text_hashed & sign-1) - (text_hashed & sign)
  6. print(java_hasher("The Quick Brown Fox"))

这将给你:-732416445

英文:

This produces the same results for the string "The Quick Brown Fox" as the answer by @user2390182 and an online tool I tried. I think it is a little easier to follow but it might be slower, not sure. You might want to test it for performance if that was critical.

  1. def java_hasher(text):
  2. size = 32
  3. sign = 1 &lt;&lt; size-1
  4. text_hashed = sum(ord(t)*31**i for i, t in enumerate(reversed(text)))
  5. return (text_hashed &amp; sign-1) - (text_hashed &amp; sign)
  6. print(java_hasher(&quot;The Quick Brown Fox&quot;))

That should give you: -732416445

huangapple
  • 本文由 发表于 2023年6月29日 21:30:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76581535.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定