英文:
Count distinct values in array - performance tips
问题
我正在遇到一些优化Go语言中的映射问题。
我想要在一个字符串数组中生成一个频率表(计算不同出现次数)。我的代码对于小数组来说效果不错,但是当我开始处理包含很多不同值的10万个以上的结构时,性能就不够好了。
目前,我的方法是生成一个包含不同值的数组,比较值并增加计数器变量(映射到字符串)。
counter := make(map[string]int)
for _, distinct := range distinctStrArray {
for _, row := range StrArray {
if row == distinct {
counter[distinct]++
}
}
}
我尝试了另一种方法,先对输入数组进行排序(以最小化对映射的更改次数)。这样会稍微快一些。
count := 0
for _, distinct := range distinctStrArray {
for _, row := range StrArray {
if row == distinct {
count++
}
}
counter[distinct] += count
count = 0
}
你有什么建议可以优化这个简单的*count(distinct)*类型的问题吗?我对任何建议都持开放态度。
谢谢!
英文:
I'm having some issues optimizing a go map.
I want to generate a frequency table (count distinct occurrences) in an array of strings. My code holds nicely for small arrays, but as I start working with 100k+ structures -with many distinct values- it just isn't performant enough.
Right now, my approach is to generate an array with the distinct values, compare values and increasing the counter variable (mapped to the string).
counter := make( map[string]int )
for _, distinct := range distinctStrArray{
for _, row := range StrArray{
if (row == distinct){
counter[distinct]++
}
}
}
I've tried another approach, where with the input array previously sorted (to minimize the number of changes to the map). This is a bit faster.
count:=0
for _, distinct := range distinctStrArray{
for _, row := range StrArray{
if (row == distinct){
count++
}
}
counter[distinct] += count
count= 0
}
Do you have any suggestion of what I can do to optimize a simple count(distinct) type problem...? I'm open to anything.
thanks!
答案1
得分: 9
没有更多的上下文,我会放弃独立的不同值数组 - 生成它需要时间,并且使用它需要嵌套循环。假设第二个数组没有其他用途,我会使用类似以下的代码:
counter := make(map[string]int)
for _, row := range StrArray {
counter[row]++
}
如果你需要不带计数的不同字符串列表以供其他目的,你可以在之后轻松获取它:
distinctStrings := make([]string, len(counter))
i := 0
for k := range counter {
distinctStrings[i] = k
i++
}
遍历不同字符串数组的时间复杂度为O(n),而通过键访问映射的时间复杂度为O(log(n))。这将将整体时间复杂度从O(n^2)降低到O(n*log(n)),对于较大的数据集来说应该是一个显著的改进。但是,像任何优化一样:测试、测量、分析、优化。
英文:
Without more context, I would dump the separate array of distinct values - generating it takes time, and using it necessitates the nested loop. Assuming there's no other purpose to the second array, I'd use something like:
counter := make( map[string]int )
for _, row := range StrArray {
counter[row]++
}
If you need the list of distinct strings without the counts for some separate purpose, you can easily get it afterward:
distinctStrings := make([]string, len(counter))
i := 0
for k := range counter {
distinctStrings[i] = k
i++
}
Iterating the array of distinct strings is O(n), while map access by key is O(log(n)). That takes your overall from O(n^2) to O(n*log(n)), which should be a significant improvement with larger datasets. But, as with any optimization: test, measure, analyze, optimize.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论