英文:
How to compute a unique 'signature' for a list of numbers?
问题
在给定一个数字列表的情况下,例如一些唯一的整数或长整数ID,计算一个可重现的“签名”(最好是不考虑元素顺序)的最佳方式是什么?
用例是检测列表(对象列表)中是否添加或删除了任何ID。
Java的array.hashCode()
不适用,因为即使在JVM调用之间它似乎是一致的,如果元素的顺序发生变化或创建了另一个具有相同元素的实例,它将返回不同的哈希值:
int[] ids1 = {1, 2, 3};
System.out.println(ids1.hashCode());
// 输出:980546781
int[] ids1Copy = {1, 2, 3};
System.out.println(ids1Copy.hashCode());
// 输出:2061475679
int[] ids2 = {2, 1, 3};
System.out.println(ids2.hashCode());
// 输出:140435067
我理解ids1.hashCode()
计算的是数组的内存地址的哈希值,而不是数组中原始元素的累积哈希值。
除了分别对每个元素进行哈希处理外,还可以使用哪些其他方法?
英文:
Given a list of numbers, for example some unique integer or long ID's what would be an optimum way to compute a reproducible 'signature' (preferably irregardless of element order)?
The use case is to detect whether any of the IDs have been added or removed from a list (of objects).
Java's array.hashCode()
does not fit the bill, because even if it is apparently consistent between JVM invocations it returns a different hash if the order of elements changes or if another instance with the same elements is created:
<!-- language: lang-java -->
int[] ids1 = {1, 2, 3};
System.out.println(ids1.hashCode());
// output: 980546781
int[] ids1Copy = {1, 2, 3};
System.out.println(ids1Copy.hashCode());
// output: 2061475679
int[] ids2 = {2, 1, 3};
System.out.println(ids2.hashCode());
// output: 140435067
My understanding is that ids1.hashCode()
computes the hash for the memory address of the array and not a cumulative hash code for the primitive elements in the array.
What other approaches could be used in this case apart from hashing each element separately?
答案1
得分: 1
你可以首先创建一个数字与其在数组中出现次数的哈希映射。然后你可以只使用哈希映射的哈希码。
但是请记住,有可能(尽管很少见)会出现两个不同的哈希映射返回相同的哈希码,就像@khelwood建议的那样。
因此,如果你想可靠地检查两个数字列表是否相同,你可以创建它们的频率哈希映射,然后执行以下检查:
- hashmap2的大小等于hashmap1的大小
- 对于哈希映射2中的每个(key, value) { hashmap1[key] == value }
它的算法时间复杂度与计算和比较哈希码一样高效。
编辑:
我刚意识到上面提到的算法实际上是Java HashMap内部使用的equals()
。因此,我们可以创建频率哈希映射,然后只需使用hashmap2.equals(hashmap1)
来检查它们的相等性。
编辑 2:
如果数组中的所有数字都不同,那么你可以从它们创建一个哈希集合,然后只需检查set2.equals(set1)
。
英文:
You can first create a hashmap of number vs its count in the array.
Then you can just use the hashcode of the hashmap.
However, keep in mind that it might be possible (although rare) for 2 different hashmaps to return the same hashcode, as @khelwood suggested.
So if you want to reliably check if 2 lists of of numbers are same or not, you can create their frequency hashmaps as mentioned above, and then just do these checks:
- hashmap2.size() == hashmap1.size()
- for every (key, value) in hashmap2 { hashmap1[key] == value }
Its algorithmic time complexity is as efficient as computing and comparing hashcodes.
EDIT:
I just realized the above mentioned algorithm is what's used internally in Java HashMap equals()
.
So we can just create the frequency hashmaps and just check their equality using hashmap2.equals(hashmap1)
.
EDIT 2:
If all the numbers in an array are distinct, then you can create a hashset from them and then just check if set2.equals(set1)
.
答案2
得分: 1
以下是已翻译的内容:
约束条件:
> 一个可重复的“签名”(最好不考虑元素顺序)
使这个问题具有挑战性。
以下是我脑海中的两种方法:
方法 1:
a. 在O(N lg N)
时间内对整数列表进行排序。
b. 将整数列表视为基于M
的整数的数字,其中M
是列表中的最大数字。假设你有一个整数列表像 [A, B, C]
。然后,你可以将该列表哈希为:hash = A*M^0 + B*M^1 + C*M^2
。如果M
是一个小值,这个方法是合理的。你也可以选择一个小的M
作为2的幂次方(例如2^8),然后对于任何大于该值的整数,将整数分成8位的块,并使用相同的算法。
总时间:O(N lg N) + O(N)
。空间:O(1)
长整数累加器。
方法 2:
a. 在O(N lg N)
时间内对整数列表进行排序。
b. 构建整数列表的字符串表示,然后对字符串进行哈希。例如,对于像 [1, 2, 3]
这样的整数列表,创建一个字符串 1_2_3
并对其进行哈希。
总时间:O(N lg N) + O(N)
。空间:O(N lg N)
大小的字符串。
英文:
The constraint
> a reproducible 'signature' (preferably irregardless of element order)
makes this problem challenging.
Here are two approaches off the top of my head:
Approach 1:
a. Sort your list of integers in O(N lg N)
time.
b. Treat your list of integers as the digits in a base-M
integer, where M
is the largest number in your list. Suppose you have a list of integers like [A, B, C]
. Then you can hash that list to be: hash = A*M^0 + B*M^1 + C*M^2
. This approach is reasonable if M
is a small value. You can alternatively choose a small M
as a power of 2 (e.g. 2^8) and then for any integer larger than that, break up the integer into chunks of 8 bits and use the same algorithm.
Total time: O(N lg N) + O(N)
. Space: O(1)
long int accumulator.
Approach 2:
a. Sort your list of integers in O(N lg N)
time.
b. Build a string representation of your list of integer and then hash the string. For example, for a list of integers like [1, 2, 3]
, create a string 1_2_3
and hash it.
Total time: O(N lg N) + O(N)
. Space: O(N lg N)
sized string.
答案3
得分: 0
请注意,基于哈希的解决方案都不是可靠的。也就是说,存在碰撞的可能性。
假设这没问题,下面是一个简单的方法。
首先,构建一个用于整数对的哈希函数。有许多这样的哈希函数可用。
接下来,让我们进行一次思维实验。
想象将所有整数排列到2^64个桶中。然后看看计数。所以像[2, 0, 2]
这样的数组变成了频率计数的列表,像..., 0, 0 0, 1, 0, 2, 0, 0, 0, ....
。
现在将这些频率计数与它们的下一个相邻元素配对。所以我们得到了..., (0, 0), (1, 0), (2, 0), (0, 0), ...
。现在用它们的哈希替换每一对。重复这个过程。经过64个级别,我们将得到一个代表整个频率计数的单个哈希值。
现在我们实际上不能执行这个操作。但是在每个级别,大多数条目首先从0
开始,然后是hash(0, 0)
,然后是hash(hash(0,0), hash(0,0))
等等。它们都是相同的。因此,如果数据结构是一个带有值和两个指针的链表,大多数指针将指向通用的填充为0的块数据结构。
因此,我们可以编写一个包含所有0块指针指向相同规范值的“树”。当我们有了这个树之后,插入一个元素就是导航到适当根的路径,创建一个具有正确值的新节点,并沿着树向上插入新值的问题。这需要O(64)
的时间来完成。插入所有值,我们得到了值的精确频率计数的表示,用哈希签名,时间复杂度为O(64 n)
。(创建相同数量的数据,然后能够丢弃其中大部分。)
但情况会更好。如果你有两个使用此数据结构创建的列表,不仅可以确定它们是否可能不同,还可以找到实际的差异!(rsync实用程序使用类似的技巧来找出远程文件之间发生了什么变化,以便它可以限制复制的数量。)
英文:
Note that all hash based solutions are unreliable. That is, there is a chance of a collision.
Assuming that that is OK, here is a simple approach.
First, build a hash function for pairs of integers. There are lots of those available.
Next, let's do a thought exercise.
Imagine arranging all of your integers into 2^64 buckets. Then look at the counts. So an array like [2, 0, 2]
becomes a list of frequency counts like, ..., 0, 0 0, 1, 0, 2, 0, 0, 0, ....
Now pair those frequency counts up with their next neighbor. So we get ..., (0, 0), (1, 0), (2, 0), (0, 0), ...
. Now replace each pair with its hash. Repeat. After 64 levels, we'll get a single hash representing the whole frequency count.
Now we can't actually perform this operation. However at each level most of the entries start off 0
, then hash(0, 0)
, then hash(hash(0,0), hash(0,0))
and so on. Which are all the same. So if the data structure is a linked list with the value and two pointers, most of the pointers will just point at the generic 0-filled block data structure.
So we can write out a "tree" with all of the pointers for 0-blocks pointing at the same canonical values. And when we have this tree, inserting an element is a question of navigating the path down to the appropriate root, creating a new node with the right value, and walking back up the tree inserting new values. This takes O(64)
time to do. Insert all of the values, and we get a representation of the exact frequency count of the values, signed with a hash, in time O(64 n)
. (Creating the same amount of data, and then being able to throw most of it away.)
But it gets better. If you have two lists with this data structure created, not only can you tell whether they are likely different, but you can actually find the differences! (The rsync utility uses a similar trick to figure out just what changed between remote files so that it can restrict how much gets copied.)
答案4
得分: 0
根据评论和反馈,已经确定了以下方法(可能不可靠,因为可能存在哈希冲突,如btilly所述):
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
public class NumberHash {
public static void main(String[] args) {
// ######## Arrays.deepHashCode() ########
Integer[] ids1Sorted = {1, 2, 3};
Integer[] ids1Unsorted = {3, 1, 2};
System.out.println(Arrays.deepHashCode(ids1Sorted));
// 30817
Arrays.sort(ids1Unsorted);
System.out.println(Arrays.deepHashCode(ids1Unsorted));
// 30817
// ######## toString() based ########
int[] idsSorted = {1, 2, 3};
System.out.println(Arrays.toString(idsSorted).hashCode());
// -412129978
int[] idsUnsorted = {3, 2, 1};
Arrays.sort(idsUnsorted);
System.out.println(Arrays.toString(idsUnsorted).hashCode());
// -412129978
List<Integer> oids = Arrays.asList(2, 3, 1);
Collections.sort(oids);
System.out.println(oids.toString().hashCode());
// -412129978
}
}
英文:
Based on the comments and feedback the following approaches have been identified (possibly unreliable because of potential hash collisions as outlined by btilly):
<!-- language-all: lang-java -->
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
public class NumberHash {
public static void main(String[] args) {
// ######## Arrays.deepHashCode() ########
Integer[] ids1Sorted = {1, 2, 3};
Integer[] ids1Unsorted = {3, 1, 2};
System.out.println(Arrays.deepHashCode(ids1Sorted));
// 30817
Arrays.sort(ids1Unsorted);
System.out.println(Arrays.deepHashCode(ids1Unsorted));
// 30817
// ######## toString() based ########
int[] idsSorted = {1, 2, 3};
System.out.println(Arrays.toString(idsSorted).hashCode());
// -412129978
int[] idsUnsorted = {3, 2, 1};
Arrays.sort(idsUnsorted);
System.out.println(Arrays.toString(idsUnsorted).hashCode());
// -412129978
List<Integer> oids = Arrays.asList(2, 3, 1);
Collections.sort(oids);
System.out.println(oids.toString().hashCode());
// -412129978
}
}
答案5
得分: 0
I would take a checksum like the CRC32 or Adler32 as a unique identifier, wrapped in a lambda ready for use:
int[] yourArray = {1, 2, 3};
long checksum = Arrays.stream(yourArray).boxed().collect(Collector.of(
CRC32::new, CRC32::update, (l, r) -> {return l;})).getValue();
{1, 2, 3}: 0x55bc801d
{1, 3, 2}: 0x3ba081ca
{2, 1, 3}: 0x7cd76d87
英文:
I would take a checksum like the CRC32 or Adler32 as unique identifier<br />
wrapped in a lambda ready for use:
int[] yourArray = {1, 2, 3};
long checksum = Arrays.stream(yourArray).boxed().collect(Collector.of(
CRC32::new, CRC32::update, (l, r) -> {return l;})).getValue();
{1, 2, 3}: 0x55bc801d
<br />
{1, 3, 2}: 0x3ba081ca
<br />
{2, 1, 3}: 0x7cd76d87
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论