Golang中的地图访问瓶颈

huangapple go评论104阅读模式
英文:

Map access bottleneck in Golang

问题

我正在使用Golang实现朴素贝叶斯分类算法,针对一个包含超过30000个可能标签的数据集。我已经构建了模型,并且正在进行分类阶段。我正在处理1000条记录的分类,这需要长达5分钟的时间。我使用pprof功能对代码进行了性能分析,下面是前10个结果:

总共:28896个样本
   16408  56.8%  56.8%    24129  83.5% runtime.mapaccess1_faststr
    4977  17.2%  74.0%     4977  17.2% runtime.aeshashbody
    2552   8.8%  82.8%     2552   8.8% runtime.memeqbody
    1468   5.1%  87.9%    28112  97.3% main.(*Classifier).calcProbs
     861   3.0%  90.9%      861   3.0% math.Log
     435   1.5%  92.4%      435   1.5% runtime.markspan
     267   0.9%  93.3%      302   1.0% MHeap_AllocLocked
     187   0.6%  94.0%      187   0.6% runtime.aeshashstr
     183   0.6%  94.6%     1137   3.9% runtime.mallocgc
     127   0.4%  95.0%      988   3.4% math.log10

令人惊讶的是,地图访问似乎是瓶颈。有人遇到过这种情况吗?还有哪些其他的键值数据结构可以避免这个瓶颈?所有的地图访问都在下面给出的代码片段中完成:

func (nb *Classifier) calcProbs(data string) *BoundedPriorityQueue{
    probs := &BoundedPriorityQueue{} 
    heap.Init(probs)

    terms := strings.Split(data, " ")
    for class, prob := range nb.classProb{
        condProb := prob
        clsProbs := nb.model[class]
        for _, term := range terms{
            termProb := clsProbs[term]
            if termProb != 0{
                condProb += math.Log10(termProb)
            }else{
                condProb += -6 //math.Log10(0.000001)
            }
        }
       entry := &Item{
            value: class,
			priority: condProb,
		}
        heap.Push(probs,entry)
    }
    return probs
}

这里的地图是nb.classProb,类型为map[string]float64,而nb.model是一个嵌套的地图,类型为

map[string]map[string]float64
英文:

I am using Golang to implement naive bayesian classification for a dataset with over 30000 possible tags. I have built the model and I am in the classification phase. I am working on classifying 1000 records and this is taking up to 5 minutes. I have profiled the code with pprof functionality; the top10 are shown below:

Total: 28896 samples
   16408  56.8%  56.8%    24129  83.5% runtime.mapaccess1_faststr
    4977  17.2%  74.0%     4977  17.2% runtime.aeshashbody
    2552   8.8%  82.8%     2552   8.8% runtime.memeqbody
    1468   5.1%  87.9%    28112  97.3% main.(*Classifier).calcProbs
     861   3.0%  90.9%      861   3.0% math.Log
     435   1.5%  92.4%      435   1.5% runtime.markspan
     267   0.9%  93.3%      302   1.0% MHeap_AllocLocked
     187   0.6%  94.0%      187   0.6% runtime.aeshashstr
     183   0.6%  94.6%     1137   3.9% runtime.mallocgc
     127   0.4%  95.0%      988   3.4% math.log10

Surprisingly the map access seems to be the bottleneck. Has anyone experienced this. What other key, value datastructure can be used to avoid this bottleneck? All the map access is done in the following piece of code given below:

func (nb *Classifier) calcProbs(data string) *BoundedPriorityQueue{
    probs := &BoundedPriorityQueue{} 
    heap.Init(probs)

    terms := strings.Split(data, " ")
    for class, prob := range nb.classProb{
        condProb := prob
        clsProbs := nb.model[class]
        for _, term := range terms{
            termProb := clsProbs[term]
            if termProb != 0{
                condProb += math.Log10(termProb)
            }else{
                condProb += -6 //math.Log10(0.000001)
            }
        }
       entry := &Item{
            value: class,
			priority: condProb,
		}
        heap.Push(probs,entry)
    }
    return probs
}

The maps are nb.classProb which is map[string]float64 while the nb.model is a nested map of type

map[string]map[string]float64

答案1

得分: 3

除了@tomwilde所说的之外,另一种可能加快算法速度的方法是字符串内部化。也就是说,如果您事先知道键的域,可以完全避免使用映射。我写了一个小包,可以为您执行字符串内部化。

英文:

In addition to what @tomwilde said, another approach that may speed up your algorithm is string interning. Namely, you can avoid using a map entirely if you know the domain of keys ahead of time. I wrote a small package that will do string interning for you.

答案2

得分: 2

是的,这段代码中的地图访问将成为瓶颈:它是两个嵌套循环中最重要的操作。

从你提供的代码中无法确定,但我猜你可能有一个有限数量的类别。你可以给它们编号,并像这样存储逐项类别概率:

map[string][NumClasses]float64

(即:对于每个项,存储一个类别概率的数组[或者可能已经预先计算了它们的对数],NumClasses是你拥有的不同类别的数量)。

然后,先迭代项,再迭代类别。昂贵的地图查找将在外部循环中完成,而内部循环将迭代一个数组。

这将减少地图查找的次数,减少了 NumClasses 倍。如果你的数据非常稀疏,可能需要更多的内存。

下一个优化是使用多个 goroutine 进行计算,假设你有多个 CPU 核心可用。

英文:

Yes, the map access will be the bottleneck in this code: it's the most significant operation inside the two nested loops.

It's not possible to tell for sure from the code that you've included, but I expect you've got a limited number of classes. What you might do, is number them, and store the term-wise class probabilities like this:

map[string][NumClasses]float64

(ie: for each term, store an array of class-wise probabilities [or perhaps their logs already precomputed], and NumClasses is the number of different classes you have).

Then, iterate over terms first, and classes inside. The expensive map lookup will be done in the outer loop, and the inner loop will be iteration over an array.

This'll reduce the number of map lookups by a factor of NumClasses. This may need more memory if your data is extremely sparse.

The next optimisation is to use multiple goroutines to do the calculations, assuming you've more than one CPU core available.

huangapple
  • 本文由 发表于 2014年1月11日 04:47:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/21054116.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定