英文:
Go: extensive memory usage when reusing map keys
问题
作为我的Go教程的一部分,我正在编写一个简单的程序,用于计算多个文件中的单词数。我有一些用于处理文件和创建map[string]int
的go routines,告诉我找到了多少个特定单词的出现次数。然后,将该映射发送到减少例程,将值聚合到一个单一的映射中。听起来非常简单,看起来像是Go的一个完美的(map-reduce)任务!
我有大约1万个文档,有160万个唯一单词。我发现我的内存使用量在运行代码时不断增长,并且在处理到一半时内存不足(12GB的盒子,剩余7GB)。所以是的,它为这个小数据集使用了几个GB的内存!
试图找出问题所在,我发现收集和聚合数据的reducer是罪魁祸首。下面是代码:
func reduceWords (input chan map[string]int, output chan int) {
total := make(map[string]int)
for wordMap := range input {
for w, c := range wordMap {
total[w] += c
}
}
output <- len(total)
}
如果我从上面的示例中删除映射,内存将保持在合理的限制范围内(几百兆字节)。然而,我发现复制字符串也可以解决这个问题,即下面的示例不会消耗我的内存:
func reduceWords (input chan map[string]int, output chan int) {
total := make(map[string]int)
for wordMap := range input {
for w, c := range wordMap {
copyW := make([]byte, len(w)) // <-- 在这里进行复制!
copy(copyW, w)
total[string(copyW)] += c
}
}
output <- len(total)
}
可能是wordMap
实例在每次迭代后没有被销毁,当我直接使用该值时?(作为C++程序员,我对GC的直觉有限。)这是期望的行为吗?我做错了什么吗?我应该对Go感到失望,还是对自己感到失望?
谢谢!
英文:
As part of my Go tutorial, I'm writing simple program counting words across multiple files. I have a few go routines for processing files and creating map[string]int
telling how many occurrence of particular word have been found. The map is then being sent to reducing routine, which aggregates values to a single map. Sounds pretty straightforward and looks like a perfect (map-reduce) task for Go!
I have around 10k document with 1.6 million unique words. What I found is my memory usage is growing fast and constantly while running the code and I'm running out of memory at about half way of processing (12GB box, 7GB free). So yes, it uses gigabytes for this small data set!
Trying to figure out where the problem lies, I found the reducer collecting and aggregating data is to blame. Here comes the code:
func reduceWords (input chan map[string]int, output chan int) {
total := make(map[string]int)
for wordMap := range input {
for w, c := range wordMap {
total[w] += c
}
}
output <- len(total)
}
If I remove the map from the sample above the memory stays within reasonable limits (a few hundred megabytes). What I found though, is taking copy of a string also solves the problem, i.e. the following sample doesn't eat up my memory:
func reduceWords (input chan map[string]int, output chan int) {
total := make(map[string]int)
for wordMap := range input {
for w, c := range wordMap {
copyW := make([]byte, len(w)) // <-- will put a copy here!
copy(copyW, w)
total[string(copyW)] += c
}
}
output <- len(total)
}
Is it possible it's a wordMap
instance not being destructed after every iteration when I use the value directly? (As a C++ programmer I have limited intuition when comes to GC.) Is it desirable behaviour? Am I doing something wrong? Should I be disappointed with Go or rather with myself?
Thanks!
答案1
得分: 2
你的代码是怎样将文件转换为字符串的?我会在那里寻找问题。如果你将大块(可能是整个文件?)转换为字符串,然后将其切片成单词,那么如果你保存任何一个单词,你就会固定整个块。尝试将块保持为[]byte,将其切片成单词,然后逐个将单词转换为字符串类型。
英文:
What does your code look like that turns files into strings? I would look for a problem there. If you are converting large blocks (whole files maybe?) to strings, and then slicing those into words, then you are pinning the entire block if you save any one word. Try keeping the blocks as []byte, slicing those into words, and then converting words to the string type individually.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论