英文:
Python equivalent for Java hashCode() function
问题
我有一个基于Java hashCode()函数应用于用户 id(一个字符串)结果的A/B测试分割。我想在我的数据框中模拟该分割以分析结果。
是否有Python等价的函数?
或者关于hashCode()内部哈希算法的特定文档,以便我自己编写该函数?
谢谢
我搜索了文档,但找不到具体细节。
英文:
I have an A/B test split based on the result of Java hashCode() function applied to user's id (a string). I want to emulate that split in my dataframe to analyse the results.
Is there a python equivalent for that function?
Or maybe a documentation on the specific hashing algorithm inside hashCode() so I can produce that function myself?
Thanks
I searched for the documentation but couldn't find the specific details
答案1
得分: 4
根据Java的String源代码,哈希实现如下:
public int hashCode() {
if (cachedHashCode != 0)
return cachedHashCode;
// 使用本地变量计算哈希码以支持可重入性。
int hashCode = 0;
int limit = count + offset;
for (int i = offset; i < limit; i++)
hashCode = hashCode * 31 + value[i];
return cachedHashCode = hashCode;
}
你可以将其转换为Python(不带缓存):
class JavaHashStr(str):
def __hash__(self):
hashCode = 0
for char in self:
hashCode = hashCode * 31 + ord(char)
return hashCode
j = JavaHashStr("abcd")
print(hash(j)) # 输出与Java相同的结果
j = JavaHashStr("abcdef")
print(hash(j)) # 输出与Java相同的结果
请注意,Python的整数不会像Java那样溢出,因此对于许多情况来说,这是不正确的。你需要添加一个处理溢出的模拟(更新:感谢@PresidentJamesK.Polk改进的版本,关于此主题的Stack Overflow帖子):
class JavaHashStr(str):
def __hash__(self):
hashCode = 0
for char in self:
hashCode = (hashCode * 31 + ord(char)) & (2**32 - 1) # 无符号
if hashCode & 2**31:
hashCode -= 2**32 # 使其带符号
return hashCode
j = JavaHashStr("abc")
print(hash(j))
j = JavaHashStr("abcdef")
print(hash(j)) # 输出与Java相同的结果
即使是溢出的哈希值,现在也会表现得相同。
这可能对于后面的Unicode面板中的字符(如表情符号等)仍然不正确。但对于大多数常见的标点符号和基于拉丁字符的字符,这应该有效。
英文:
According to java String source code, the hash implementation is:
public int hashCode()
{
if (cachedHashCode != 0)
return cachedHashCode;
// Compute the hash code using a local variable to be reentrant.
int hashCode = 0;
int limit = count + offset;
for (int i = offset; i < limit; i++)
hashCode = hashCode * 31 + value[i];
return cachedHashCode = hashCode;
}
You can transfer this to Python (w/o caching):
class JavaHashStr(str):
def __hash__(self):
hashCode = 0
for char in self:
hashCode = hashCode * 31 + ord(char)
return hashCode
>>> j = JavaHashStr("abcd")
>>> hash(j)
2987074 # same as java
>>> j = JavaHashStr("abcdef")
>>> hash(j)
2870581347 # java: -1424385949
Note, Python ints do not overflow like java, so this is wrong for many cases. You would have to add a simulation for the overflow (Update: thx to @PresidentJamesK.Polk for the improved version, SO thread on the topic):
class JavaHashStr(str):
def __hash__(self):
hashCode = 0
for char in self:
hashCode = (hashCode * 31 + ord(char)) & (2**32 - 1) # unsigned
if hashCode & 2**31:
hashCode -= 2**32 # make it signed
return hashCode
Now, even overflowing hashes behave the same:
>>> j = JavaHashStr("abc")
>>> hash(j)
96354
>>> j = JavaHashStr("abcdef")
>>> hash(j)
-1424385949 # Java hash for "abcdef"
This might still be off for characters from the latter unicode panes like emojis or the like. But for the most common punctuation and latin-based characters, this should work.
答案2
得分: 1
这将为字符串"The Quick Brown Fox"生成与user2390182的答案和我尝试过的在线工具相同的结果。我认为这个方法可能更容易理解,但可能会更慢,不确定。如果性能很关键,你可能想测试一下。
def java_hasher(text):
size = 32
sign = 1 << size-1
text_hashed = sum(ord(t)*31**i for i, t in enumerate(reversed(text)))
return (text_hashed & sign-1) - (text_hashed & sign)
print(java_hasher("The Quick Brown Fox"))
这将给你:-732416445
英文:
This produces the same results for the string "The Quick Brown Fox" as the answer by @user2390182 and an online tool I tried. I think it is a little easier to follow but it might be slower, not sure. You might want to test it for performance if that was critical.
def java_hasher(text):
size = 32
sign = 1 << size-1
text_hashed = sum(ord(t)*31**i for i, t in enumerate(reversed(text)))
return (text_hashed & sign-1) - (text_hashed & sign)
print(java_hasher("The Quick Brown Fox"))
That should give you: -732416445
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论