Go:高效的字符串查找,不使用哈希表(也称为映射)吗?

huangapple go评论82阅读模式
英文:

Go: Efficient string lookup without hash table (aka map)?

问题

我在Golang中遇到了一个问题,我需要能够查找字符串键,这些键大约有5,000,000个,每个键只包含小写字母a-z和数字0-9。对于uint32和uint64作为键也存在类似的问题。

对于这个问题,哈希表(map)是完美的解决方案,但它使用的内存太多。

对于这种类型的问题,肯定有已知的方法。我一直在研究B-Tree,但我不确定它是否是最有效的机制。

我问题的一些特点可能会导致更高效的解决方案:

  1. 键只需要是a-z0-9的字符串或简单的uint值。
  2. 一旦构建完成,它只需要是只读的。

考虑到它只需要是只读的,我认为将其作为一个预排序的列表,并带有一系列索引,可能会很好地工作。起初,我以为我可以将它放在切片中,每个级别(即字符)都有一个36(26个字母+10个数字)的索引...但是当然这意味着36^whatever,最终得到的是相反的效率。然后我想也许我可以每个级别只放一个36的索引,但是这样我就得到了一堆需要相交以获取结果ID的数组/切片。

我想我正在寻找一种非常特定的B-Tree实现,但更适合我的目的(没有B)。

有人知道是否存在像我所建议的这样的东西吗?

英文:

I have a problem in Golang whereby I need to be able to lookup string keys, from about 5,000,000 strings, each of which containing only a-z (lowercase) and 0-9 characters. Similar problem with uint32 and uint64 as keys.

A map (hash table) is perfect for this, but it uses much too much RAM.

There must be known methods for this type of thing, I've been looking into B-Tree but I'm not sure it would be the most efficient mechanism.

Some of the particularities of my problem, which could lead to a more efficient solution, are:

  1. The keys need only be strings of a-z0-9 or simple uint values.
  2. Once built it only needs be read-only.

Seeing as it only needs be read-only then it seems to me that having it as a pre-sorted list with a series of indexes, might work well. I thought at first I might be able to just have it in slices with a 36 (26 letters + 10 numbers) index for each level (i.e. character)... but of course that means 36^whatever which ends up being the opposite of efficient. Then I thought maybe I could put only a single index of 36 for each level, but then I end up with a load of arrays/slices that need to be intersected to get the ID of the result.

I guess I'm looking for some kind of very specific B-Tree implementation, but more tuned to my purpose (without the B.)

Does anyone know of anything that exists like I am suggesting?

答案1

得分: 1

我会尝试使用压缩字典树。它是一种适用于字典序键的数据结构。B树主要用于外部存储器,因为它们可以最小化树的深度。字典树或更节省内存的哈希是正确的选择。

英文:

I'd give a try to a Compressed Trie. It's the data structure perfectly usable in a scenario with lexicographic keys. B-Trees are mostly intended for external memories because they're minimizing the depth of a tree. A trie or a more memory efficient hashing is the right way to go.

答案2

得分: 1

你应该使用 Trie 数据结构,它被设计为非常高效地将字符串映射到值。详见 https://en.wikipedia.org/wiki/Trie 获取更多信息。

英国政府新网站的一部分正在广泛使用的基于 Go 的 Trie 实现可以在 https://github.com/alphagov/router/tree/master/trie 找到。

英文:

You should use a Trie data structure which is designed to map strings to values very efficiently. See https://en.wikipedia.org/wiki/Trie for more information.

A Go based one that is in heavy use as part of the UK Government's new website can be found at https://github.com/alphagov/router/tree/master/trie

答案3

得分: 1

我认为这取决于你最终想要实现什么。例如:

  1. 你想要(a)查找与字符串相关联的某个值(例如索引或缓存),还是(b)仅仅测试字符串是否是一组字符串的成员(例如排除列表)?
  2. 查找需要多准确 - 你可以接受误报吗?

如果问题1的答案是(a),那么字典树(trie)可能是一个很好的数据结构选择。如果问题1的答案是(b),那么最好使用位集(bitset)或布隆过滤器(bloom filter)。在这两者中,布隆过滤器将是最快且最节省内存的选择,但它是概率性的,并且会产生一些误报(但不会产生真负),这可能取决于你对问题2的回答是否可以接受。

英文:

I think it depends upon what you are trying to ultimately achieve. For example:

  1. Do you want to (a) look up some value associated with the string (e.g. an index or cache) or (b) simply test whether the string is a member of a set of strings (e.g. an exclusion list)?
  2. How accurate does the lookup need to be - can you live with false positives?

If the answer to question 1 is (a) then a trie is probably a good choice of data structure. If the answer to question 1 is (b) then you are probably best off using a bitset or a bloom filter. Of these two, a bloom filter will be the fastest and most memory efficient but is probabilistic and will yield some false positives (but no true negatives) which may or may not be okay for your use case depending upon your answer to question 2.

答案4

得分: 0

另一个优化地图和内存使用的想法是,如果你知道要使用多少个字符串,并且只需要检查它们是否存在。

  1. 使用map[string]struct{}struct{}保证具有0大小,而不像bool类型至少使用1字节。
  2. 预先分配地图:m := make(map[string]struct{}, 5000000)
  3. 使用if _, ok := m[key]; ok { ... }检查键是否存在

压缩字典树很好,但除非你非常受内存限制,否则map[string]struct{}会更快。

英文:

Another idea to optimize the map and memory usage if you know how many strings you're gonna use and you only need to check their existence.

  1. Use map[string]struct{}, struct{} is guaranteed to have 0 size, unlike bool which uses at least 1 byte.
  2. Preallocate the map : m := make(map[string]struct{}, 5000000)
  3. Check if a key exists with if _, ok := m[key]; ok { ... }

Compressed Tries are nice and all, but unless you're extremely pressed for memory, map[string]struct{} will be much faster.

huangapple
  • 本文由 发表于 2014年8月5日 02:44:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/25125450.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定