在Rust中,我如何将快速始终为空和单入口HashMap传递给第三方API?

huangapple go评论98阅读模式
英文:

In Rust, how can I pass fast always-empty and single-entry HashMaps to a third-party API?

问题

I am using a third-party API (the GCloud SDK from https://lib.rs/crates/gcloud-sdk) that uses HashMaps in its types to pass key/value pairs to the GCloud webservice and back. A field that contains such a hash map is specified like this:

pub labels: ::std::collections::HashMap<String, String>,

which is just a long way to say HashMap<String, String>. Many of these fields will contain empty hash maps or single-entry hash maps. Since the default hasher is considered "slow", I'd like to be able to optimize these maps if they become a bottleneck. My intention here, for now, is really just the ability to optimize if it becomes a problem, not prematurely optimize before I know if it is a problem.

The performance concern consists of two things:

  • that generation of a new RandomState is expensive (AFAIK if even reads from /dev/urandom), affecting even empty hash maps
  • that hashing a single key is somewhat expensive, less than the above but still total overkill for a single-element map since few of the advantages of hash maps apply to single-element maps

My initial idea was therefore to replace the hashing algorithm by a much simpler one, and for empty maps even by a no-op. However, I don't understand how this is supposed to work. I wrote a no-op implementation, wrapped by a function like this:

pub fn create_nop_hash_map<K, V>() -> HashMap<K, V, NopBuildHasher> {...}

I'm leaving out the implementation of NopBuildHasher since it is trivial and not related to the problem I'm having.

This gives me an error:

labels: create_nop_hash_map(),
^^^^^^^^^^^^^^^^^^^^^ expected HashMap<String, String>, found HashMap<_, _, NopBuildHasher>

The "labels" field is defined in the third-party library, so changing its type isn't possible for me. I could obviously ask the author to change that field, but what should it look like? That library cannot know in advance which maps I'm going to leave empty. It seems to me that the only thing I could realistically ask the author is to define a new type parameter for every map-typed field, and since these parameters would cascade through the enclosing structs, the whole library would be cluttered with type parameters for the hash algorithms. This does not seem realistic to me.

The alternative approach I wanted to try is to leave the default Siphash-1-3 algorithm and just use hardcoded initial values, but I can't do this either because the RandomState fields are private and there is no constructor to specify them directly.

The last thing I could think of is to ask the author of the library to not use hash maps at all for these fields, but e.g. list-of-key-value-pairs, but it seems to me that I'm totally on the wrong track here because using a custom hashing algorithm is, I think, a common use case that many people have done before.

What would be the idiomatic way to pass hash maps with custom hashing algorithms to third-party APIs? Alternatively, what is the idiomatic way to optimize empty or single-element hash maps?

英文:

I am using a third-party API (the GCloud SDK from https://lib.rs/crates/gcloud-sdk) that uses HashMaps in its types to pass key/value pairs to the GCloud webservice and back. A field that contains such a hash map is specified like this:

pub labels: ::std::collections::HashMap&lt;
    ::prost::alloc::string::String,
    ::prost::alloc::string::String,
&gt;,

which is just a long way to say HashMap<String, String>. Many of these fields will contain empty hash maps or single-entry hash maps. Since the default hasher is considered "slow", I'd like to be able to optimize these maps if they become a bottleneck. My intention here, for now, is really just the ability to optimize if it becomes a problem, not prematurely optimize before I know if it is a problem.

The performance concern consists of two things:

  • that generation of a new RandomState is expensive (AFAIK if even reads from /dev/urandom), affecting even empty hash maps
  • that hashing a single key is somewhat expensive, less than the above but still total overkill for a single-element map since few of the advantages of hash maps apply to single-element maps

My initial idea was therefore to replace the hashing algorithm by a much simpler one, and for empty maps even by a no-op. However, I don't understand how this is supposed to work. I wrote a no-op implementation, wrapped by a function like this:

pub fn create_nop_hash_map&lt;K, V&gt;() -&gt; HashMap&lt;K, V, NopBuildHasher&gt; {...}

I'm leaving out the implementation of NopBuildHasher since it is trivial and not related to the problem I'm having.

This gives me an error:

labels: create_nop_hash_map(),
        ^^^^^^^^^^^^^^^^^^^^^ expected `HashMap&lt;String, String&gt;`, found `HashMap&lt;_, _, NopBuildHasher&gt;`

The "labels" field is defined in the third-party library, so changing its type isn't possible for me. I could obviously ask the author to change that field, but what should it look like? That library cannot know in advance which maps I'm going to leave empty. It seems to me that the only thing I could realistically ask the author is to define a new type parameter for every map-typed field, and since these parameters would cascade through the enclosing structs, the whole library would be cluttered with type parameters for the hash algorithms. This does not seem realistic to me.

The alternative approach I wanted to try is to leave the default Siphash-1-3 algorithm and just use hardcoded initial values, but I can't do this either because the RandomState fields are private and there is no constructor to specify them directly.

The last thing I could think of is to ask the author of the library to not use hash maps at all for these fields, but e.g. list-of-key-value-pairs, but it seems to me that I'm totally on the wrong track here because using a custom hashing algorithm is, I think, a common use case that many people have done before.

What would be the idiomatic way to pass hash maps with custom hashing algorithms to third-party APIs? Alternatively, what is the idiomatic way to optimize empty or single-element hash maps?

答案1

得分: 3

在当前版本的Rust中,创建一个新的RandomState实例非常便宜。只有线程创建的第一个RandomState实例会用随机密钥进行初始化。所有后续的实例重用了缓存的密钥,并且只是将其中一个递增一。这应该缓解对空哈希映射的任何担忧。性能问题在Rust 1.10中修复,该版本于2016年7月发布,而潜在的DOS问题在Rust 1.14中修复,该版本于2016年12月发布。

具有单个键的哈希映射的主要成本很可能是分配器,而不是哈希器。如果哈希映射包含一个键,它需要分配内存,这很可能比计算单个哈希要昂贵得多。对哈希器进行优化对于这种用例不太可能有重大帮助。

英文:

In the current versions of Rust, creating a new RandomState instance is very cheap. Only the first RandomState instance a thread creates is initialized with random keys. All further instances reuse the cached keys and simply increment one of them by one. This should alleviate any concerns about empty hash maps. The performance problem was fixed in Rust 1.10, released in July 2016, and the potential DOS problem in Rust 1.14, released in December 2016.

The main cost for a hash map with a single key is most likely the allocator, not the hasher. If the hash map contains a key, it needs to allocate memory, which is most likely far more expensive than computing a single hash. Optimizing the hasher is unlikely to help in any significant way for this use case.

huangapple
  • 本文由 发表于 2023年6月6日 15:23:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76412271.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定