如何对整数环形缓冲区进行哈希处理?

huangapple go评论69阅读模式
英文:

How to hash a ring buffer of integers?

问题

我正在使用Python。我有一些包含整数的列表,但它们实际上是环形缓冲区。

以下是示例规则:

  • 我们不添加新元素或修改任何元素。这些环是不可变的。

  • 环中没有重复的元素。

  • 如果两个列表长度不同,它们不是同一个环。

  • 在相同长度的两个列表之间,如果一个列表在任意次环移动反转后可以与另一个相同,那么这两个环是相等的。例如,[1, 7, 9, 2, 5][7, 9, 2, 5, 1](环移动)是相等的,[1, 7, 9, 2, 5][1, 5, 2, 9, 7](环移动和反转)是相等的,但[1, 7, 9, 2, 5][7, 1, 9, 2, 5]不相等。

我想快速识别两个环是否相等。

一种方法是比较它们的元素,另一种方法是找到一个好的哈希方法。我尝试将两个列表移动到它们的正常状态并比较它们的元素是否相同(或在反转后是否相同),但速度太慢。

我认为哈希是一个更好的选择。那么,哪种哈希方法适用于这种环形缓冲区?

以下是我目前的代码:

import random
from time import perf_counter
from typing import List, Tuple

class Ring:
    def __init__(self, ids:List[int]) -> None:
        self.ids = ids

    def get_shifted(self, n:int) -> 'Ring':
        result_list = self.ids.copy()
        for i in range(n):
            head = result_list[0]
            result_list.remove(head)
            result_list.append(head)
        return Ring(result_list)
    
    def get_normalized(self) -> 'Ring':
        min_i = self.ids.index(min(self.ids))
        shifted = self.get_shifted(min_i)
        return shifted
    
    def get_reversed(self) -> 'Ring':
        result_list = self.ids.copy()
        result_list.reverse()
        return Ring(result_list)

    def __eq__(self, other: 'Ring') -> bool:
        if len(self.ids) != len(other.ids):
            return False
        normalized1 = tuple(self.get_normalized().ids)
        normalized2 = tuple(self.get_reversed().get_normalized().ids)
        normalized_other = tuple(other.get_normalized().ids)
        return normalized1 == normalized_other or normalized2 == normalized_other
    
    @staticmethod
    def Random() -> 'Ring':
        unduplicated = set()
        while len(unduplicated) < ring_capacity:
            unduplicated.add(random.randint(0, 20))
        return Ring(list(unduplicated))
    

if __name__ == '__main__':
    random.seed(1)
    ring_capacity = 5
    num_rings = 2000

    ring_set = []
    random_rings = [Ring.Random() for _ in range(num_rings)]

    start = perf_counter()
    for ring in random_rings:
        if ring not in ring_set:
            ring_set.append(ring)
    end = perf_counter()

    print(end - start)
    print(f'{len(ring_set)} out of {num_rings} unduplicated rings')

如果需要更多信息或有其他问题,请告诉我。

英文:

I am using Python. I have some lists containing integers, but they are actually ring buffers.

The following are rules by examples:

  • We do not add new elements or modify any elements. These rings are immutable.

  • No repetitive elements in a ring.

  • If two lists have different lengths, they are not the same ring.

  • Between two lists of the same length, if one list, after arbitrary times of ring shift or reversion, can be identical to the other, the two rings are equal. For example, [1, 7, 9, 2, 5] and [7, 9, 2, 5, 1](ring shifted) are equal, [1, 7, 9, 2, 5] and [1, 5, 2, 9, 7](ring shifted and reversed) are equal, but [1, 7, 9, 2, 5] and [7, 1, 9, 2, 5] are not equal.

I want to quickly identify whether two rings are equal.

One method is to compare their elements, another method is to find a good hashing method. I tried shifting two lists to their normal state and compare if their elements are identical (or identical after reversed), but it's too slow.

I think hashing is a better choice. So what hashing method is good for this kind of ring buffers?

The following is what I currently have:

import random
from time import perf_counter
from typing import List, Tuple
class Ring:
def __init__(self, ids:List[int]) -&gt; None:
self.ids = ids
def get_shifted(self, n:int) -&gt; &#39;Ring&#39;:
result_list = self.ids.copy()
for i in range(n):
head = result_list[0]
result_list.remove(head)
result_list.append(head)
return Ring(result_list)
def get_normalized(self) -&gt; &#39;Ring&#39;:
min_i = self.ids.index(min(self.ids))
shifted = self.get_shifted(min_i)
return shifted
def get_reversed(self) -&gt; &#39;Ring&#39;:
result_list = self.ids.copy()
result_list.reverse()
return Ring(result_list)
def __eq__(self, other: &#39;Ring&#39;) -&gt; bool:
if len(self.ids) != len(other.ids):
return False
normalized1 = tuple(self.get_normalized().ids)
normalized2 = tuple(self.get_reversed().get_normalized().ids)
normalized_other = tuple(other.get_normalized().ids)
return normalized1 == normalized_other or normalized2 == normalized_other
@staticmethod
def Random() -&gt; &#39;Ring&#39;:
unduplicated = set()
while len(unduplicated) &lt; ring_capacity:
unduplicated.add(random.randint(0, 20))
return Ring(list(unduplicated))
if __name__ == &#39;__main__&#39;:
random.seed(1)
ring_capacity = 5
num_rings = 2000
ring_set = []
random_rings = [Ring.Random() for _ in range(num_rings)]
start = perf_counter()
for ring in random_rings:
if ring not in ring_set:
ring_set.append(ring)
end = perf_counter()
print(end - start)
print(f&#39;{len(ring_set)} out of {num_rings} unduplicated rings&#39;)

答案1

得分: 3

不必提供一个为等效列表提供相同哈希代码的特殊哈希函数,将每个环转换为规范形式可能更简单。

在所有可能的移位/反转中,例如,您可以选择字典顺序最小的那个。

转换为规范形式后,等效的环是相同的,因此如果您需要哈希,可以使用任何哈希函数,如果需要比较,只需进行简单的线性比较。

英文:

Instead of coming up with a special hash that would provide the same hash code for equivalent lists, it is probably simpler to convert each ring to a canonical form.

Of all the possible shifts/reversions, for example, you can select the one that is lexically smallest.

After conversion to canonical form, equivalent rings are identical, so if you want to hash, then you can use whatever hash you want, and if you need to compare, you can just do a simple linear comparison.

答案2

得分: 2

以下是您要翻译的代码部分:

class Ring:
    def __init__(self, ids:List[int]) -> None:
        self.ids = ids
        i = ids.index(min(ids))
        self.normed = min(
            ids[i:] + ids[:i],
            ids[i::-1] + ids[:i:-1]
        )

    def __eq__(self, other: 'Ring') -> bool:
        return self.normed == other.normed
Output with yours:

15.41106627508998
1896 out of 2000 unduplicated rings


```python
Output with mine:

0.38525977171957493
1896 out of 2000 unduplicated rings


```python
Output with the below ([Attempt This Online!](https://ato.pxeger.com/run?1=bVTBbpwwEL30Ur5ibhh1F5FIlSqUjfoBVQ9Rb6uV5eIhWAKb2kbtKsqX9JJL-1H9mnpsloU2XMDz3sy8eTb--Xs8-87ol5dfk2_3H_68eauG0VgPVmhphqy1ZgCvBoQ5PqJteWMm7dHO6HlU-vGCf1LO7-DLNPaYZU0vnIOHANcZhEdiC5wrrTznzGHf7kBJV1POUWl_KmB_D5-NxkSnh1hlIMGBqEtYpXWptMQfbFCahVVRbNO0sQPKwCR8gWK6dEdVn-Bd_KrVafcKXO9vrgxaLJx_-nTCdaELvZinydmqfRC1mh2_LZMb36GtISd78jj5V2P66-QW_WT1dpJDyprXkboqTgLm8sXrdYiR5Hx0XnjVDBi2Xy41HuKus6gmCbvWmbQMs6lG-OipQ8-uPnzvVI_Qo2ZrWgF3YEMV3ohRNMqf643Na2oppGTp0JX0CseBVTu4rVabOk9Culgfzsy2V-JlmSIrtBiQczIs53wQSnM-jzL3cIiS3aScjcQw2vsY1dPACaGjd1tVVbZKX4AjiSkX21pjgYPSRHtEtpQoTik7mB7-kcPmL2IrEcHU2dp1n0RALf_PjMhoyS3C96lDsQq3-RNty6V88Qxm8mBaeFrUPW_3NsbyIt0I88VwuSD-Ag)):

0.0007639830000698566
1896 out of 2000 unduplicated rings


最后是修改您的*使用*代码,将您的`ring_set`实际变成一个集合而不是一个命名不当的列表:
```python
start = perf_counter()
ring_set = set(random_rings)
end = perf_counter()

具有有意义的哈希值的Ring类如下:

class Ring:
    def __init__(self, ids:List[int]) -> None:
        self.ids = ids
        i = ids.index(min(ids))
        self.normed = min(
            ids[i:] + ids[:i],
            ids[i::-1] + ids[:i:-1]
        )
        self.hash = hash(tuple(self.normed))

    def __eq__(self, other: 'Ring') -> bool:
        return self.normed == other.normed
    
    def __hash__(self):
        return self.hash
英文:

It's faster and simpler to just compute a normalized form at construction:

class Ring:
def __init__(self, ids:List[int]) -&gt; None:
self.ids = ids
i = ids.index(min(ids))
self.normed = min(
ids[i:] + ids[:i],
ids[i::-1] + ids[:i:-1]
)
def __eq__(self, other: &#39;Ring&#39;) -&gt; bool:
return self.normed == other.normed

Output with yours:

15.41106627508998
1896 out of 2000 unduplicated rings

Output with mine:

0.38525977171957493
1896 out of 2000 unduplicated rings

Output with the below (Attempt This Online!):

0.0007639830000698566
1896 out of 2000 unduplicated rings

(Hmm, I just realized that by moving the normalization into construction, I moved it out of the timing. But if I include the random_rings = [Ring.Random() for _ in range(num_rings)] in the timing, the whole thing still only takes ~0.03 seconds.)

The last is modifying your using code to make your ring_set an actual set instead of a misnamed list:

    start = perf_counter()
ring_set = set(random_rings)
end = perf_counter()

With the Ring class providing meaningful hashes:

class Ring:
def __init__(self, ids:List[int]) -&gt; None:
self.ids = ids
i = ids.index(min(ids))
self.normed = min(
ids[i:] + ids[:i],
ids[i::-1] + ids[:i:-1]
)
self.hash = hash(tuple(self.normed))
def __eq__(self, other: &#39;Ring&#39;) -&gt; bool:
return self.normed == other.normed
def __hash__(self):
return self.hash

答案3

得分: 1

你的问题相当宽泛,因此任何建议可能也会相当模糊。听起来你正在处理大量数据,否则比较肯定比哈希更快。不能使用标准的加密哈希函数,因为它们不是位置无关的。而是寻找保持位置的哈希函数(LP Hashes),比如Nilsimsa哈希。其思想是具有相似输入的情况会产生相似的输出,其中相似意味着起始点是任意的,仍然会创建相同的哈希。顺便说一下,用于查找候选项的非常简单的哈希函数是传统的(SAE J1708)校验和。

英文:

Your question is rather broad so any recommendation will probably be rather vague, too. It sounds like you are processing gigantic amounts of data - otherwise comparison would certainly be faster than hashing.
You cannot use the standard cryptographic hash functions as they are not location independent. Rather look for location preserving hash functions (LP Hashes) like the Nilsimsa Hash. The idea is to have similar input create similar output where similar here means, that the starting point is arbitrary and still the set would create the same hash. Btw., a very simple hash for finding candidates would be the conventional (SAE J1708) checksum.

huangapple
  • 本文由 发表于 2023年6月11日 21:15:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76450635.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定