在Golang中,针对int64键的更好的分片函数是什么?

huangapple go评论86阅读模式
英文:

Better sharding function for int64 keys in golang?

问题

我正在使用来自这个仓库的并发映射repo,在创建映射时可以使用NewWithCustomShardingFunction选择键类型。我只需要为int64键提供自己的分片函数,这就是我在这里使用的方式。

我还在使用最新版本的Go,可以使用泛型,所以我决定使用concurrent-map,将键设置为int64,通过实现自己的分片函数。

import (
    cmap "github.com/orcaman/concurrent-map/v2"
)

func shardingFunc(key int64) uint32 {
    return uint32(key) // TODO - 创建一个更好的分片函数,不依赖于uint32类型转换的方式
}

func main() {
    testMap := cmap.NewWithCustomShardingFunction[int64, *definitions.CustomerProduct](shardingFunc)
    // ... 使用映射 ...
}

我想知道对于int64键,我的分片函数是否适用,或者是否应该有更好的分片函数?我不想遇到index out of range错误或任何其他问题。

英文:

I am using concurrent map from this repo where it is possible to select the key type when creating a map using NewWithCustomShardingFunction. I just need to provide my own sharding function for int64 keys and that's what I am using here.

I am also using latest version of Go where I can use generics so I decided to use concurrent-map with keys as int64 by implementing my own sharding function.

import (
    cmap "github.com/orcaman/concurrent-map/v2"
)

func shardingFunc(key int64) uint32 {
    return uint32(key) // TODO - create a better sharding function that does not rely on how uint32 type conversion works
}

func main() {
    testMap := cmap.NewWithCustomShardingFunction[int64, *definitions.CustomerProduct](shardingFunc)
    // ... use the map ...
}

I wanted to know whether my sharding function is okay here for int64 keys or should I have better sharding function? I don't want to be in situation where it can give me index out of range error or any other issues.

答案1

得分: 6

分片函数是一个哈希函数。该函数应该能够在32位空间中均匀分布键。

如果你的init64值的低四个字节均匀分布,那么uint32(key)就可以作为一个分片函数。

uint32(key)不适合的一个例子是低字节具有常量值的情况。例如,如果键值类似于0x00010000、0x00020000等,那么uint32(key)将计算为零。这不是一个均匀分布。

如果你不知道你的int64键是如何分布的,那么最好在分片函数中使用所有键的位。下面是一个使用异或的示例:

func shardingFunc(key int64) uint32 {
    return uint32(key) ^ uint32(key >> 32) 
}
英文:

The sharding function is a hash function. The function should uniformly distribute the keys over a 32 bit space.

If the low four bytes if your init64 values are uniformly distributed, then uint32(key) will work as a sharding function.

An example of where uint32(key) is a bad choice is where the low bytes have constant values. For example, if the key values are like 0x00010000, 0x00020000, ..., then uint32(key) will evaluate to zero. This is not a uniform distribution.

If you don't know how your int64 keys are distributed, then it's best to use all of the key's bits in the shard function. Here's one that uses XOR:

func shardingFunc(key int64) uint32 {
    return uint32(key) ^ uint32(key >> 32) 
}

答案2

得分: 1

对于一个健壮的解决方案,可以使用crypto/sha256对密钥进行哈希处理,然后将其转换为uint32类型:

func shardingFunc(key int64) uint32 {
	bytes := sha256.Sum256(new(big.Int).SetInt64(key).Bytes())
	return binary.BigEndian.Uint32(bytes[0:32])
}

还有更高效但更冗长的编码方式,但希望你能理解这个思路。

英文:

For something robust, use crypto/sha256 to hash the key then convert (some of) it to unint32:

func shardingFunc(key int64) uint32 {
	bytes := sha256.Sum256(new(big.Int).SetInt64(key).Bytes())
	return binary.BigEndian.Uint32(bytes[0:32])
}

There are more efficient, although more verbose, ways of coding this, but hopefully you get the idea.

答案3

得分: 1

这是一个更加上下文敏感的答案,因为通过聊天我了解到更多的背景信息。它可能不适用于其他情况。

首先,你不会遇到"index out of range"的问题,因为并发映射库在除以分片数后会始终取余数。

其次,整数产品ID通常是连续的,这意味着它们会自然地均匀分布在每个分片上。

如果你碰巧遇到了病态对立的更新/访问模式,可能会有一些例外情况,但实际上这不会有任何区别。即使在最坏的情况下只使用了一个分片,性能也会与使用普通的Go映射相同,因为每个分片在内部都有自己的普通映射。如果你发现自己处于性能真正重要的情况下,最好自己编写一个并发映射,而不是使用这个库。我知道它被宣传为"高性能解决方案",但没有什么东西是"一刀切的优化"。这是个自相矛盾的说法。

英文:

This is a more context sensitive answer, since I know more background information through chat. It may not apply for others.

First, you won't get an "index out of range" because the concurrent map library will always take the remainder after dividing by the number of shards.

Second, integer product IDs are generally sequential, meaning that they will naturally be evenly distributed across each shard.

There are some possible exceptions to this if you happen to have pathologically antagonistic update / access patterns, but realistically it will make no difference. Even in the worst case scenario where only 1 shard is used, the performance will be effectively the same as using a regular Go map, since each shard has it's own regular map internally. If you do find yourself in a situation where performance really matters, you would be best to roll your own concurrent map rather than using this library. I know it is billed as a "high-performance solution", but there is no such thing as a 'one-size-fits-all optimization'. It is an oxymoron.

huangapple
  • 本文由 发表于 2023年1月13日 03:51:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/75101607.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定