坏的HashMap哈希函数的风险是什么?

huangapple go评论81阅读模式
英文:

What are the risk of bad HashMap hashers?

问题

我正在构建一个可以存储任何通用类型的映射,类似于这个crate

为了实现这一点,我使用了一个以TypeId为键的HashMap。这个TypeId是在编译时构建的哈希值,所以为了提高性能,我想知道是否可以使用一个自定义的Hasher,它什么都不做,因为我的输入已经被哈希过了。

这让我想到了,使用一个不良的Hasher对任何HashMap会有什么风险?
性能?安全性?

英文:

I was building a map that could store any generic type like this crate.

By doing so, I'm using a HashMap that takes a TypeId as a key. This TypeId is a hash built at compile time, so to improve performances I was wondering if I could use a custom Hasher that does nothing, as my input is already hashed.

That made me think, what would be the risks of a bad Hasher for any HashMap ?
Performances ? Security ?

答案1

得分: 3

HashMap基础

要理解风险,首先需要了解HashMap的工作原理。

有两种主要类型的HashMap:封闭地址和开放地址。每种类型都有许多实现变体。

封闭地址哈希映射较简单,所以让我们从这些开始。此类哈希映射中的底层数据结构是桶的动态数组:

  • 插入:计算哈希,导出一个哈希%数组大小的索引,迭代桶中的所有元素以检查重复项,如果没有,则插入。
  • 搜索:计算哈希,导出一个哈希%数组大小的索引,迭代桶中的元素,直到找到匹配项(或耗尽桶)。

对于一个_好_的哈希函数,预期哈希%数组大小的函数将产生一个相当均匀分布的索引集合,因此大致上所有桶的大小都会相同。也就是说,对于数组大小为A的数组中的N个元素,预计每个桶将有大约N / A个元素。鉴于A基于负载因子L随N增长,这意味着每个桶有N /(N * L)= 1 / L个元素。

显然,这导致了插入和搜索的O(1)平均复杂度。

开放地址哈希映射更复杂。底层数据结构只是一个单元的动态数组:

  • 插入:计算哈希,导出一个哈希%数组大小的索引,迭代直到找到匹配单元,如果遇到空单元,就插入。
  • 搜索:计算哈希,导出一个哈希%数组大小的索引,迭代直到找到匹配单元,如果遇到空单元,就中止。

或者至少,这是要点,因为删除元素可能会在搜索的单元序列中产生空洞,有各种策略来处理...现在我们先说我们使用向后移位,即将序列中的所有后续元素向后移动一个单元以填补空洞。

您可以将一系列单元视为一个桶...但由于“相邻”桶会合并成一个单一序列,因此存在聚类问题,使得复杂性分析更加困难。

差劣哈希函数的影响

好吧,让我们考虑一个荒谬的情况,一个总是返回41的函数。

坚持使用我们的封闭地址哈希映射,然后桶中的元素数量会严重偏斜:

  • 所有桶(除了索引4处的桶)都没有元素。
  • 索引4处的桶有N个元素。

这意味着突然插入和搜索变为O(N),而不是O(1)。哦。

1 通过公平骰子的投掷获得。

哈希洪泛

在2003年,首次描述了一种名为哈希洪泛的拒绝服务(DoS)攻击,随后几年成功地用于针对网站。

策略很简单:知道网站使用的哈希函数--通常是由支持网站的语言提供的标准哈希函数--对手会预先计算大量的键,这些键都会产生相同的哈希(或相同的哈希%数组大小,适用于相当大小的数组),然后发送带有所有这些键作为参数名的GET或POST请求。Web框架将尝试在调用应用程序之前将所有参数放入哈希映射中,由于所有参数名都具有相同的哈希,...

...参数的插入总体上将是O(N2),而不是O(N)。对于N的较小值,情况并不太糟糕。对于N = 1,000,这意味着1,000个操作和1,000,000个操作之间的差异,这开始显现出来!只需循环发送这些请求的少数“客户”,整个Web服务器就会瘫痪。

作为响应,Python和Ruby都进行了更新,使其默认的哈希函数更难以建立预图冲突,从而使哈希洪泛成为不可能。

注:开放地址会引入聚类问题,因为即使附近的桶也可以合并成单一的桶,所以即使有不完美的碰撞,仍可能导致严重问题。

风险

在哈希映射中插入和搜索的算法复杂度介于O(1)和O(N)之间的最坏情况之间。使用不好的哈希函数可能会导致平均情况更接近于最坏情况。在这一点上,您的哈希映射不比未排序的数组好。

_bad_的含义取决于情景:

  • 如果您的输入受到您的控制,或者是随机的,则_bad_哈希函数只是一个具有许多碰撞的函数。例如,奇偶校验位函数,始终返回0或1。
  • 如果您的输入可能受到对手的控制,则_bad_哈希函数是一个可以相对轻松地找到预映像碰撞的函数。

例如,我们可以考虑[FNV-1a](https://en.wikipedia.org/wiki/F

英文:

HashMap Primer

To understand the risks, you first need to understand how a HashMap works.

There are two main kinds of HashMap: Closed Addressing and Open Addressing. And each kind has many implementation variants.

The Closed Addressing hash maps are simpler, so let's start with those. The underlying data-structure within such a hash map is a dynamic array of buckets:

  • Insertion: compute the hash, derive an index as hash % array-size, iterate over all the elements in the bucket to check for a duplicate and if not, insert.
  • Search: compute the hash, derive an index as hash % array-size, iterate over the elements in the bucket until you find a match (or exhaust the bucket).

For a good hash function, it's expected that the hash % array-size function will yield a fairly uniformly spread set of indexes, and thus, that roughly all buckets will have the same size. That is, for N elements in an array of size A, each bucket is expected to have roughly N / A elements. Given that A grows with N based on a load factor L, this means that each bucket has N / (N * L) = 1 / L elements.

Trivially, this leads to an O(1) average complexity for Insertion & Search.

The Open Addressing hash maps are more complex. The underlying data-structure is just a dynamic array of cells:

  • Insertion: compute the hash, derive an index as hash % array-size, iterate until you find a matching cell, and if you stumble upon an empty cell, insert.
  • Search: compute the hash, derive an index as hash % array-size, iterate until you find a matching cell, and if you stumble upon an empty cell, abort.

Or at least, that's the gist, because deleting elements may poke holes in the sequence of cells to search, and there are various strategies to deal with that... let's say for now that we use backward-shifting, ie move all further elements in the sequence back one cell to fill the hole.

You can treat a sequence of cells as a bucket... but due to "neighbour" buckets merging together in a single sequence you have clustering issues which make complexity analysis more difficult.

Effects of a poor hash function

Well, let's take an absurd case, a function which always returns 4<sup>1</sup>.

Sticking to our Closed Addressing Hash Map, the number of elements in the buckets is then highly skewed:

  • All buckets (except the one at index 4) have 0 elements.
  • The bucket at index 4 has N elements.

This means that suddenly Insertion and Search are O(N), not O(1). Oh.

<sup>1</sup> Obtained by the roll of a fair dice.

Hash Flooding

In 2003, a Denial of Service (DoS) attack named Hash Flooding was first described, and in the following years it was successfully weaponized against websites.

The strategy is simple: knowing the hash function used by the website -- typically, the standard hash function provided by the language powering the website -- adversaries would pre-compute a large number of keys which would all result in the same hash (or same hash % array-size for reasonably sized array) then send a GET or POST request with all those keys as parameter names. The web framework would attempt to put all the parameters in a hash map before calling the application, and since all parameter names would have the same hash, ...

... insertion of parameters would overall be O(N<sup>2</sup>) instead of O(N).

For small values of N, it's not too bad. For N = 1'000, it's the difference between 1'000 operations and 1'000'000 operations, and it starts showing! With only a few "clients" sending such requests in a loop, an entire web server would be brought to its knees.

As a response, both Python & Ruby were updated so that their default hash function would be much harder to built pre-image collisions for, thereby making Hash Flooding impossible.

Note: Open Addressing adds clustering issues, since even nearby buckets can merge together into a single bucket, so that even imperfect collisions can still lead to a perfect storm.

Risks

The algorithmic complexity of inserting and searching in a hash map is between O(1) in the best case and O(N) in the worst case. Using a bad hash function may lead to the average case being much closer to the worst case. At this point, your hash map is no better than an unsorted array.

What bad means depends on the scenario:

  • If your input is under your control, or random, a bad hash function is simply a function which has many collisions. Such as a parity bit function, always returning 0 or 1.
  • If your input is potentially controlled by an adversary, a bad hash function is a function for which pre-image collisions can relatively easily be found.

As an example, we can consider FNV-1a:

  • If your input is controlled, or random, it's a fair hash function. Not necessarily the best, but it's pretty fast, and the distribution of hashes is pretty good.
  • If your input is potentially controlled by an adversary, it's a terrible hash function. It's fairly easy to craft colliding inputs using maths, so your adversary can in no time whip up a 1'000 inputs which will all collide.

And that's it.

TL;DR: your risk performance degradation, which may potentially be used for Denial of Service on your application.

huangapple
  • 本文由 发表于 2023年6月29日 16:46:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76579470.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定