在数组中计算不同值的数量 – 性能提示

huangapple go评论148阅读模式
英文:

Count distinct values in array - performance tips

问题

我正在遇到一些优化Go语言中的映射问题。
我想要在一个字符串数组中生成一个频率表(计算不同出现次数)。我的代码对于小数组来说效果不错,但是当我开始处理包含很多不同值的10万个以上的结构时,性能就不够好了。

目前,我的方法是生成一个包含不同值的数组,比较值并增加计数器变量(映射到字符串)。

counter := make(map[string]int)
for _, distinct := range distinctStrArray {
	for _, row := range StrArray {
		if row == distinct {
			counter[distinct]++
		}
	}
}

我尝试了另一种方法,先对输入数组进行排序(以最小化对映射的更改次数)。这样会稍微快一些。

count := 0
for _, distinct := range distinctStrArray {
	for _, row := range StrArray {
		if row == distinct {
			count++
		}
	}
	counter[distinct] += count
	count = 0
}

你有什么建议可以优化这个简单的*count(distinct)*类型的问题吗?我对任何建议都持开放态度。
谢谢!

英文:

I'm having some issues optimizing a go map.
I want to generate a frequency table (count distinct occurrences) in an array of strings. My code holds nicely for small arrays, but as I start working with 100k+ structures -with many distinct values- it just isn't performant enough.

Right now, my approach is to generate an array with the distinct values, compare values and increasing the counter variable (mapped to the string).

  	counter := make( map[string]int )	 
	for _, distinct := range distinctStrArray{
		for _, row := range StrArray{
			if (row == distinct){
				counter[distinct]++
			}  
		} 
	}

I've tried another approach, where with the input array previously sorted (to minimize the number of changes to the map). This is a bit faster.

	count:=0
	for _, distinct := range distinctStrArray{
		for _, row := range StrArray{
			if (row == distinct){
				count++
			}  
		} 
	counter[distinct] += count
	count= 0
	} 

Do you have any suggestion of what I can do to optimize a simple count(distinct) type problem...? I'm open to anything.
thanks!

答案1

得分: 9

没有更多的上下文,我会放弃独立的不同值数组 - 生成它需要时间,并且使用它需要嵌套循环。假设第二个数组没有其他用途,我会使用类似以下的代码:

counter := make(map[string]int)
for _, row := range StrArray {
    counter[row]++
}

如果你需要不带计数的不同字符串列表以供其他目的,你可以在之后轻松获取它:

distinctStrings := make([]string, len(counter))
i := 0
for k := range counter {
    distinctStrings[i] = k
    i++
}

遍历不同字符串数组的时间复杂度为O(n),而通过键访问映射的时间复杂度为O(log(n))。这将将整体时间复杂度从O(n^2)降低到O(n*log(n)),对于较大的数据集来说应该是一个显著的改进。但是,像任何优化一样:测试、测量、分析、优化。

英文:

Without more context, I would dump the separate array of distinct values - generating it takes time, and using it necessitates the nested loop. Assuming there's no other purpose to the second array, I'd use something like:

counter := make( map[string]int )    
for _, row := range StrArray {
    counter[row]++
} 

If you need the list of distinct strings without the counts for some separate purpose, you can easily get it afterward:

distinctStrings := make([]string, len(counter))
i := 0
for k := range counter {
    distinctStrings[i] = k
    i++
}

Iterating the array of distinct strings is O(n), while map access by key is O(log(n)). That takes your overall from O(n^2) to O(n*log(n)), which should be a significant improvement with larger datasets. But, as with any optimization: test, measure, analyze, optimize.

huangapple
  • 本文由 发表于 2017年6月8日 00:14:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/44417913.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定