英文:
Porting MeiYan hash function to Go
问题
我想将一种最先进的哈希函数MeiYan从C语言移植到Go语言。(据我所知,这是在速度和冲突率方面对于哈希表来说最好的哈希函数之一,至少比MurMur好。)
我对Go语言还很陌生,只花了一个周末的时间,写出了以下版本:
func meiyan(key *byte, count int) uint32 {
type P *uint32
var h uint32 = 0x811c9dc5
for count >= 8 {
a := ((*(*uint32)(unsafe.Pointer(key))) << 5)
b := ((*(*uint32)(unsafe.Pointer(key))) >> 27)
c := *(*uint32)(unsafe.Pointer(uintptr(unsafe.Pointer(key)) + 4))
h = (h ^ ((a | b) ^ c)) * 0xad3e7
count -= 8
key = (*byte)(unsafe.Pointer(uintptr(unsafe.Pointer(key)) + 8))
}
if (count & 4) != 0 {
h = (h ^ uint32(*(*uint16)(unsafe.Pointer(key)))) * 0xad3e7
key = (*byte)(unsafe.Pointer(uintptr(unsafe.Pointer(key)) + 2))
h = (h ^ uint32(*(*uint16)(unsafe.Pointer(key)))) * 0xad3e7
key = (*byte)(unsafe.Pointer(uintptr(unsafe.Pointer(key)) + 2))
}
if (count & 2) != 0 {
h = (h ^ uint32(*(*uint16)(unsafe.Pointer(key)))) * 0xad3e7
key = (*byte)(unsafe.Pointer(uintptr(unsafe.Pointer(key)) + 2))
}
if (count & 1) != 0 {
h = (h ^ uint32(*key))
h = h * 0xad3e7
}
return h ^ (h >> 16)
}
看起来有点乱,但我认为我无法让它看起来更好。现在我测量了一下速度,结果非常慢,比使用gccgo -O3
编译的C/C++版本慢3倍。有办法让它更快吗?这是编译器能做到的最好的了,还是unsafe.Pointer
转换就是最慢的?实际上,这让我感到惊讶,因为我看到其他一些类似的数值计算代码的速度与C语言相当,甚至更快。我在这里做了一些低效的事情吗?
以下是我正在移植的原始C代码:
u32 meiyan(const char *key, int count) {
typedef u32* P;
u32 h = 0x811c9dc5;
while (count >= 8) {
h = (h ^ ((((*(P)key) << 5) | ((*(P)key) >> 27)) ^ *(P)(key + 4))) * 0xad3e7;
count -= 8;
key += 8;
}
#define tmp h = (h ^ *(u16*)key) * 0xad3e7; key += 2;
if (count & 4) { tmp tmp }
if (count & 2) { tmp }
if (count & 1) { h = (h ^ *key) * 0xad3e7; }
#undef tmp
return h ^ (h >> 16);
}
这是我测量速度的方法:
func main(){
T := time.Now().UnixNano()/1e6
buf := []byte("Hello World!")
var controlSum uint64 = 0
for x := 123; x < 1e8; x++ {
controlSum += uint64(meiyan(&buf[0], 12))
}
fmt.Println(time.Now().UnixNano()/1e6 - T, "ms")
fmt.Println("controlSum:", controlSum)
}
英文:
I wanted to port a state-of-the-art hash function MeiYan from C to Go. (As far as I know this is one of the best if not just the best hash function for hash tables in terms of speed and collision rate, it beats MurMur at least.)
I am new to Go, just spent one weekend with it, and came up with this version:
func meiyan(key *byte, count int) uint32 {
type P *uint32;
var h uint32 = 0x811c9dc5;
for ;count >= 8; {
a := ((*(*uint32)(unsafe.Pointer(key))) << 5)
b := ((*(*uint32)(unsafe.Pointer(key))) >> 27)
c := *(*uint32)(unsafe.Pointer(uintptr(unsafe.Pointer(key)) + 4))
h = (h ^ ((a | b) ^ c)) * 0xad3e7
count -= 8
key = (*byte)(unsafe.Pointer(uintptr(unsafe.Pointer(key)) + 8))
}
if (count & 4) != 0 {
h = (h ^ uint32(*(*uint16)(unsafe.Pointer(key)))) * 0xad3e7
key = (*byte)(unsafe.Pointer(uintptr(unsafe.Pointer(key)) + 2))
h = (h ^ uint32(*(*uint16)(unsafe.Pointer(key)))) * 0xad3e7
key = (*byte)(unsafe.Pointer(uintptr(unsafe.Pointer(key)) + 2))
}
if (count & 2) != 0 {
h = (h ^ uint32(*(*uint16)(unsafe.Pointer(key)))) * 0xad3e7
key = (*byte)(unsafe.Pointer(uintptr(unsafe.Pointer(key)) + 2))
}
if (count & 1) != 0 {
h = (h ^ uint32(*key));
h = h * 0xad3e7
}
return h ^ (h >> 16);
}
Looks messy, but I do not think I can make it look better. Now I measure the speed and it is frustratingly slow, 3 times slower than C/C++ when compiled with gccgo -O3
. Can this be made faster? Is this just as good as compiler can make it or unsafe.Pointer
conversion is just as slow as it gets? In fact this surprised me, because I have seen that some other number crunching style code was just as fast as C or even faster. Am I doing something inneficiently here?
Here is the original C code I am porting from:
u32 meiyan(const char *key, int count) {
typedef u32* P;
u32 h = 0x811c9dc5;
while (count >= 8) {
h = (h ^ ((((*(P)key) << 5) | ((*(P)key) >> 27)) ^ *(P)(key + 4))) * 0xad3e7;
count -= 8;
key += 8;
}
#define tmp h = (h ^ *(u16*)key) * 0xad3e7; key += 2;
if (count & 4) { tmp tmp }
if (count & 2) { tmp }
if (count & 1) { h = (h ^ *key) * 0xad3e7; }
#undef tmp
return h ^ (h >> 16);
}
Here is how I measure speed:
func main(){
T := time.Now().UnixNano()/1e6
buf := []byte("Hello World!")
var controlSum uint64 = 0
for x := 123; x < 1e8; x++ {
controlSum += uint64(meiyan(&buf[0], 12))
}
fmt.Println(time.Now().UnixNano()/1e6 - T, "ms")
fmt.Println("controlSum:", controlSum)
}
答案1
得分: 3
经过仔细研究,我发现了代码运行缓慢的原因,并进行了改进,现在在我的测试中比C版本更快:
package main
import (
"fmt"
"time"
"unsafe"
)
func meiyan(key *byte, count int) uint32 {
type un unsafe.Pointer
type p32 *uint32
type p16 *uint16
type p8 *byte
var h uint32 = 0x811c9dc5
for count >= 8 {
a := *p32(un(key)) << 5
b := *p32(un(key)) >> 27
c := *p32(un(uintptr(un(key)) + 4))
h = (h ^ ((a | b) ^ c)) * 0xad3e7
count -= 8
key = p8(un(uintptr(un(key)) + 8))
}
if (count & 4) != 0 {
h = (h ^ uint32(*p16(un(key)))) * 0xad3e7
key = p8(un(uintptr(un(key)) + 2))
h = (h ^ uint32(*p16(un(key)))) * 0xad3e7
key = p8(un(uintptr(un(key)) + 2))
}
if (count & 2) != 0 {
h = (h ^ uint32(*p16(un(key)))) * 0xad3e7
key = p8(un(uintptr(un(key)) + 2))
}
if (count & 1) != 0 {
h = h ^ uint32(*key)
h = h * 0xad3e7
}
return h ^ (h >> 16)
}
func main() {
T := time.Now().UnixNano() / 1e6
buf := []byte("ABCDEFGHABCDEFGH")
var controlSum uint64 = 0
start := &buf[0]
size := len(buf)
for x := 123; x < 1e8; x++ {
controlSum += uint64(meiyan(start, size))
}
fmt.Println(time.Now().UnixNano()/1e6-T, "ms")
fmt.Println("controlSum:", controlSum)
}
哈希函数本身已经很快,但是在每次迭代中解引用数组是导致它变慢的原因:&buf[0]
被替换为 start := &buf[0]
,然后在每次迭代中使用 start
。
英文:
After some careful research I found out why my code was slow, and improved it so it is now faster than the C version in my tests:
package main
import (
"fmt"
"time"
"unsafe"
)
func meiyan(key *byte, count int) uint32 {
type un unsafe.Pointer
type p32 *uint32
type p16 *uint16
type p8 *byte
var h uint32 = 0x811c9dc5;
for ;count >= 8; {
a := *p32(un(key)) << 5
b := *p32(un(key)) >> 27
c := *p32(un(uintptr(un(key)) + 4))
h = (h ^ ((a | b) ^ c)) * 0xad3e7
count -= 8
key = p8(un(uintptr(un(key)) + 8))
}
if (count & 4) != 0 {
h = (h ^ uint32(*p16(un(key)))) * 0xad3e7
key = p8(un(uintptr(un(key)) + 2))
h = (h ^ uint32(*p16(un(key)))) * 0xad3e7
key = p8(un(uintptr(un(key)) + 2))
}
if (count & 2) != 0 {
h = (h ^ uint32(*p16(un(key)))) * 0xad3e7
key = p8(un(uintptr(un(key)) + 2))
}
if (count & 1) != 0 {
h = h ^ uint32(*key)
h = h * 0xad3e7
}
return h ^ (h >> 16);
}
func main() {
T := time.Now().UnixNano()/1e6
buf := []byte("ABCDEFGHABCDEFGH")
var controlSum uint64 = 0
start := &buf[0]
size := len(buf)
for x := 123; x < 1e8; x++ {
controlSum += uint64(meiyan(start, size))
}
fmt.Println(time.Now().UnixNano()/1e6 - T, "ms")
fmt.Println("controlSum:", controlSum)
}
The hash function itself was already fast, but dereferencing the array on each iteration is what made it slow: &buf[0]
was replaced with start := &buf[0]
and then use start
on each iteration.
答案2
得分: 1
NATS的实现看起来很令人印象深刻!在我的机器上,对于长度为30个字节的数据,每秒操作次数为157175656.56,每个操作的纳秒数为6.36!你可以看一下。也许会得到一些灵感。
英文:
The implementation from NATS looks impressive! On my machine, for a data of length 30 (bytes) op/sec 157175656.56 and nano-sec/op 6.36! Take a look at it. You might find some ideas.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论