英文:
Go: Excessive memory usage, memory leak
问题
我非常非常注重内存,因为我要编写处理大型数据集的程序。
目前,我的应用程序很快就达到了32GB的内存,开始交换,然后被系统杀掉。
我不明白为什么会这样,因为除了Trainer
结构体中的TokensStruct
和TokensCount
之外,所有变量都是可回收的(在函数中快速释放)。TokensCount
只是一个无符号整数。TokensStruct
是一个包含100万行5uint32和字符串的切片,所以每个记录最多需要20字节+字符串,我们可以称之为最大50字节。50 * 1000000 = 50MB的内存需求。因此,这个脚本应该不会使用超过50MB + 开销 + 函数中的临时可回收变量(最多可能是另外50MB)。TokensStruct
的最大潜在大小是500万,因为这是dictionary
的大小,但即使如此,它也只会占用250MB的内存。dictionary
是一个映射,显然使用了大约600MB的内存,因为应用程序启动时的内存使用情况是这样的,但这不是问题,因为dictionary
只加载一次,之后不再写入。
相反,它使用了32GB的内存,然后崩溃了。根据它的速度,如果可以的话,它可能会愉快地达到1TB的内存。内存似乎随着加载的文件大小呈线性增长,这意味着它似乎从不清除任何内存。进入应用程序的所有内容都会分配更多的内存,并且内存永远不会被释放。
我尝试在垃圾回收没有运行得足够频繁的情况下实现runtime.GC()
,但没有任何改变。
由于内存使用量呈线性增长,这意味着GetTokens()
或LoadZip()
中存在内存泄漏。我不知道为什么会这样,因为它们都是函数,只执行一个任务,然后关闭。或者可能是Start()
中的tokens
变量导致了泄漏。基本上,看起来加载和解析的每个文件都没有从内存中释放,因为只有这种方式内存才能以线性方式填满并持续上升到32GB++。
真是个噩梦!Go语言有什么问题吗?有什么办法可以解决这个问题吗?
package main
import (
"bytes"
"code.google.com/p/go.text/transform"
"code.google.com/p/go.text/unicode/norm"
"compress/zlib"
"encoding/gob"
"fmt"
"github.com/AlasdairF/BinSearch"
"io/ioutil"
"os"
"regexp"
"runtime"
"strings"
"unicode"
"unicode/utf8"
)
type TokensStruct struct {
binsearch.Key_string
Value [][5]uint32
}
type Trainer struct {
Tokens TokensStruct
TokensCount uint
}
func checkErr(err error) {
if err == nil {
return
}
fmt.Println(`Some Error:`, err)
panic(err)
}
// Local helper function for normalization of UTF8 strings.
func isMn(r rune) bool {
return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}
// This map is used by RemoveAccents function to convert non-accented characters.
var transliterations = map[rune]string{'Æ': "E", 'Ð': "D", 'Ł': "L", 'Ø': "OE", 'Þ': "Th", 'ß': "ss", 'æ': "e", 'ð': "d", 'ł': "l", 'ø': "oe", 'þ': "th", 'Œ': "OE", 'œ': "oe"}
// removeAccentsBytes converts accented UTF8 characters into their non-accented equivalents, from a []byte.
func removeAccentsBytesDashes(b []byte) ([]byte, error) {
mnBuf := make([]byte, len(b))
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
n, _, err := t.Transform(mnBuf, b, true)
if err != nil {
return nil, err
}
mnBuf = mnBuf[:n]
tlBuf := bytes.NewBuffer(make([]byte, 0, len(mnBuf)*2))
for i, w := 0, 0; i < len(mnBuf); i += w {
r, width := utf8.DecodeRune(mnBuf[i:])
if r == '-' {
tlBuf.WriteByte(' ')
} else {
if d, ok := transliterations[r]; ok {
tlBuf.WriteString(d)
} else {
tlBuf.WriteRune(r)
}
}
w = width
}
return tlBuf.Bytes(), nil
}
func LoadZip(filename string) ([]byte, error) {
// Open file for reading
fi, err := os.Open(filename)
if err != nil {
return nil, err
}
defer fi.Close()
// Attach ZIP reader
fz, err := zlib.NewReader(fi)
if err != nil {
return nil, err
}
defer fz.Close()
// Pull
data, err := ioutil.ReadAll(fz)
if err != nil {
return nil, err
}
return norm.NFC.Bytes(data), nil // return normalized
}
func getTokens(pibn string) []string {
var data []byte
var err error
data, err = LoadZip(`/storedir/` + pibn + `/text.zip`)
checkErr(err)
data, err = removeAccentsBytesDashes(data)
checkErr(err)
data = bytes.ToLower(data)
data = reg2.ReplaceAll(data, []byte("$2")) // remove contractions
data = reg.ReplaceAllLiteral(data, nil)
tokens := strings.Fields(string(data))
return tokens
}
func (t *Trainer) Start() {
data, err := ioutil.ReadFile(`list.txt`)
checkErr(err)
pibns := bytes.Fields(data)
for i, pibn := range pibns {
tokens := getTokens(string(pibn))
t.addTokens(tokens)
if i%100 == 0 {
runtime.GC() // I added this just to try to stop the memory craziness, but it makes no difference
}
}
}
func (t *Trainer) addTokens(tokens []string) {
for _, tok := range tokens {
if _, ok := dictionary[tok]; ok {
if indx, ok2 := t.Tokens.Find(tok); ok2 {
ar := t.Tokens.Value[indx]
ar[0]++
t.Tokens.Value[indx] = ar
t.TokensCount++
} else {
t.Tokens.AddKeyAt(tok, indx)
t.Tokens.Value = append(t.Tokens.Value, [5]uint32{0, 0, 0, 0, 0})
copy(t.Tokens.Value[indx+1:], t.Tokens.Value[indx:])
t.Tokens.Value[indx] = [5]uint32{1, 0, 0, 0, 0}
t.TokensCount++
}
}
}
return
}
func LoadDictionary() {
dictionary = make(map[string]bool)
data, err := ioutil.ReadFile(`dictionary`)
checkErr(err)
words := bytes.Fields(data)
for _, word := range words {
strword := string(word)
dictionary[strword] = false
}
}
var reg = regexp.MustCompile(`[^a-z0-9\s]`)
var reg2 = regexp.MustCompile(`\b(c|l|all|dall|dell|nell|sull|coll|pell|gl|agl|dagl|degl|negl|sugl|un|m|t|s|v|d|qu|n|j)'([a-z])`) //contractions
var dictionary map[string]bool
func main() {
trainer := new(Trainer)
LoadDictionary()
trainer.Start()
}
英文:
I am very, very memory careful as I have to write programs that need to cope with massive datasets.
Currently my application quickly reaches 32GB of memory, starts swapping, and then gets killed by the system.
I do not understand how this can be since all variables are collectable (in functions and quickly released) except TokensStruct
and TokensCount
in the Trainer
struct. TokensCount
is just a uint. TokensStruct
is a 1,000,000 row slice of 5uint32 and string, so that means 20 bytes + string, which we could call a maximum of 50 bytes per record. 50*1000000 = 50MB of memory required. So this script should therefore not use much more than 50MB + overhead + temporary collectable variables in the functions (maybe another 50MB max.) The maximum potential size of TokensStruct
is 5,000,000, as this is the size of dictionary
, but even then it would be only 250MB of memory. dictionary
is a map and apparently uses around 600MB of memory, as that is how the app starts, but this is not an issue because dictionary
is only loaded once and never written to again.
Instead it uses 32GB of memory then dies. By the speed that it does this I expect it would happily get to 1TB of memory if it could. The memory appears to increase in a linear fashion with the size of the files being loaded, meaning that it appears to never clear any memory at all. Everything that enters the app is allocated more memory and memory is never freed.
I tried implementing runtime.GC()
in case the garbage collection wasn't running often enough, but this made no difference.
Since the memory usage increases in a linear fashion then this would imply that there is a memory leak in GetTokens()
or LoadZip()
. I don't know how this could be, since they are both functions and only do one task and then close. Or it could be that the tokens
variable in Start()
is the cause of the leak. Basically it looks like every file that is loaded and parsed is never released from memory, as that is the only way that the memory could fill up in a linear fashion and keep on rising up to 32GB++.
Absolute nightmare! What's wrong with Go? Any way to fix this?
package main
import (
"bytes"
"code.google.com/p/go.text/transform"
"code.google.com/p/go.text/unicode/norm"
"compress/zlib"
"encoding/gob"
"fmt"
"github.com/AlasdairF/BinSearch"
"io/ioutil"
"os"
"regexp"
"runtime"
"strings"
"unicode"
"unicode/utf8"
)
type TokensStruct struct {
binsearch.Key_string
Value [][5]uint32
}
type Trainer struct {
Tokens TokensStruct
TokensCount uint
}
func checkErr(err error) {
if err == nil {
return
}
fmt.Println(`Some Error:`, err)
panic(err)
}
// Local helper function for normalization of UTF8 strings.
func isMn(r rune) bool {
return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}
// This map is used by RemoveAccents function to convert non-accented characters.
var transliterations = map[rune]string{'Æ': "E", 'Ð': "D", 'Ł': "L", 'Ø': "OE", 'Þ': "Th", 'ß': "ss", 'æ': "e", 'ð': "d", 'ł': "l", 'ø': "oe", 'þ': "th", 'Œ': "OE", 'œ': "oe"}
// removeAccentsBytes converts accented UTF8 characters into their non-accented equivalents, from a []byte.
func removeAccentsBytesDashes(b []byte) ([]byte, error) {
mnBuf := make([]byte, len(b))
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
n, _, err := t.Transform(mnBuf, b, true)
if err != nil {
return nil, err
}
mnBuf = mnBuf[:n]
tlBuf := bytes.NewBuffer(make([]byte, 0, len(mnBuf)*2))
for i, w := 0, 0; i < len(mnBuf); i += w {
r, width := utf8.DecodeRune(mnBuf[i:])
if r == '-' {
tlBuf.WriteByte(' ')
} else {
if d, ok := transliterations[r]; ok {
tlBuf.WriteString(d)
} else {
tlBuf.WriteRune(r)
}
}
w = width
}
return tlBuf.Bytes(), nil
}
func LoadZip(filename string) ([]byte, error) {
// Open file for reading
fi, err := os.Open(filename)
if err != nil {
return nil, err
}
defer fi.Close()
// Attach ZIP reader
fz, err := zlib.NewReader(fi)
if err != nil {
return nil, err
}
defer fz.Close()
// Pull
data, err := ioutil.ReadAll(fz)
if err != nil {
return nil, err
}
return norm.NFC.Bytes(data), nil // return normalized
}
func getTokens(pibn string) []string {
var data []byte
var err error
data, err = LoadZip(`/storedir/` + pibn + `/text.zip`)
checkErr(err)
data, err = removeAccentsBytesDashes(data)
checkErr(err)
data = bytes.ToLower(data)
data = reg2.ReplaceAll(data, []byte("$2")) // remove contractions
data = reg.ReplaceAllLiteral(data, nil)
tokens := strings.Fields(string(data))
return tokens
}
func (t *Trainer) Start() {
data, err := ioutil.ReadFile(`list.txt`)
checkErr(err)
pibns := bytes.Fields(data)
for i, pibn := range pibns {
tokens := getTokens(string(pibn))
t.addTokens(tokens)
if i%100 == 0 {
runtime.GC() // I added this just to try to stop the memory craziness, but it makes no difference
}
}
}
func (t *Trainer) addTokens(tokens []string) {
for _, tok := range tokens {
if _, ok := dictionary[tok]; ok {
if indx, ok2 := t.Tokens.Find(tok); ok2 {
ar := t.Tokens.Value[indx]
ar[0]++
t.Tokens.Value[indx] = ar
t.TokensCount++
} else {
t.Tokens.AddKeyAt(tok, indx)
t.Tokens.Value = append(t.Tokens.Value, [5]uint32{0, 0, 0, 0, 0})
copy(t.Tokens.Value[indx+1:], t.Tokens.Value[indx:])
t.Tokens.Value[indx] = [5]uint32{1, 0, 0, 0, 0}
t.TokensCount++
}
}
}
return
}
func LoadDictionary() {
dictionary = make(map[string]bool)
data, err := ioutil.ReadFile(`dictionary`)
checkErr(err)
words := bytes.Fields(data)
for _, word := range words {
strword := string(word)
dictionary[strword] = false
}
}
var reg = regexp.MustCompile(`[^a-z0-9\s]`)
var reg2 = regexp.MustCompile(`\b(c|l|all|dall|dell|nell|sull|coll|pell|gl|agl|dagl|degl|negl|sugl|un|m|t|s|v|d|qu|n|j)'([a-z])`) //contractions
var dictionary map[string]bool
func main() {
trainer := new(Trainer)
LoadDictionary()
trainer.Start()
}
答案1
得分: 2
确保如果你从一个大字符串中进行分词,要避免内存固定。从上面的评论中可以看出,这些标记是大字符串的子字符串。
你可能需要在你的getTokens()函数中添加一些额外的代码,以确保标记不会固定内存。
func getTokens(...) {
// 在程序的末尾附近
for i, t := range(tokens) {
tokens[i] = string([]byte(t))
}
}
顺便说一下,使用ioutil.ReadFile一次性将整个文件读入内存看起来有些可疑。你确定不能使用bufio.Scanner吗?
我正在更仔细地查看代码...如果你真的关心内存问题,可以利用io.Reader。你应该尽量避免一次性读取整个文件的内容。使用io.Reader和transform“沿着流”的方式。你现在使用它的方式与其意图相悖。你正在使用的transform包的整个目的是构建灵活的读取器,可以流式处理数据。
例如,这是你正在做的事情的简化版本:
package main
import (
"bufio"
"bytes"
"fmt"
"unicode/utf8"
"code.google.com/p/go.text/transform"
)
type AccentsTransformer map[rune]string
func (a AccentsTransformer) Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error) {
for nSrc < len(src) {
// 如果我们在边缘,记录这一点并返回。
if !atEOF && !utf8.FullRune(src[nSrc:]) {
err = transform.ErrShortSrc
return
}
r, width := utf8.DecodeRune(src[nSrc:])
if r == utf8.RuneError && width == 1 {
err = fmt.Errorf("解码错误")
return
}
if d, ok := a[r]; ok {
if nDst+len(d) > len(dst) {
err = transform.ErrShortDst
return
}
copy(dst[nDst:], d)
nSrc += width
nDst += len(d)
continue
}
if nDst+width > len(dst) {
err = transform.ErrShortDst
return
}
copy(dst[nDst:], src[nSrc:nSrc+width])
nDst += width
nSrc += width
}
return
}
func main() {
transliterations := AccentsTransformer{'Æ': "E", 'Ø': "OE"}
testString := "cØØl beÆns"
b := transform.NewReader(bytes.NewBufferString(testString), transliterations)
scanner := bufio.NewScanner(b)
scanner.Split(bufio.ScanWords)
for scanner.Scan() {
fmt.Println("token:", scanner.Text())
}
}
然后可以很容易地将转换器链接在一起。例如,如果我们想从输入流中删除所有连字符,只需适当使用transform.Chain即可:
func main() {
transliterations := AccentsTransformer{'Æ': "E", 'Ø': "OE"}
removeHyphens := transform.RemoveFunc(func(r rune) bool {
return r == '-'
})
allTransforms := transform.Chain(transliterations, removeHyphens)
testString := "cØØl beÆns - the next generation"
b := transform.NewReader(bytes.NewBufferString(testString), allTransforms)
scanner := bufio.NewScanner(b)
scanner.Split(bufio.ScanWords)
for scanner.Scan() {
fmt.Println("token:", scanner.Text())
}
}
我没有对上面的代码进行详尽的测试,请不要只是复制粘贴而没有足够的测试。:P 我只是快速编写的。但是这种方法---避免整个文件读取---将更好地扩展,因为它会分块读取文件。
英文:
Make sure that if you're tokenizing from a large string, to avoid memory pinning. From the comments above, it sounds like the tokens are substrings of a large string.
You may need to add a little extra in your getTokens() function so it guarantees the tokens aren't pinning memory.
func getTokens(...) {
// near the end of your program
for i, t := range(tokens) {
tokens[i] = string([]byte(t))
}
}
By the way, reading the whole file into memory using ioutil.ReadFile all at once looks dubious. Are you sure you can't use bufio.Scanner?
I'm looking at the code more closely... if you are truly concerned about memory, take advantage of io.Reader. You should try to avoid sucking in the content of a whole file at once. Use io.Reader and the transform "along the grain". The way you're using it now is against the grain of its intent. The whole point of the transform package you're using is to construct flexible Readers that can stream through data.
For example, here's a simplification of what you're doing:
package main
import (
"bufio"
"bytes"
"fmt"
"unicode/utf8"
"code.google.com/p/go.text/transform"
)
type AccentsTransformer map[rune]string
func (a AccentsTransformer) Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error) {
for nSrc < len(src) {
// If we're at the edge, note this and return.
if !atEOF && !utf8.FullRune(src[nSrc:]) {
err = transform.ErrShortSrc
return
}
r, width := utf8.DecodeRune(src[nSrc:])
if r == utf8.RuneError && width == 1 {
err = fmt.Errorf("Decoding error")
return
}
if d, ok := a[r]; ok {
if nDst+len(d) > len(dst) {
err = transform.ErrShortDst
return
}
copy(dst[nDst:], d)
nSrc += width
nDst += len(d)
continue
}
if nDst+width > len(dst) {
err = transform.ErrShortDst
return
}
copy(dst[nDst:], src[nSrc:nSrc+width])
nDst += width
nSrc += width
}
return
}
func main() {
transliterations := AccentsTransformer{'Æ': "E", 'Ø': "OE"}
testString := "cØØl beÆns"
b := transform.NewReader(bytes.NewBufferString(testString), transliterations)
scanner := bufio.NewScanner(b)
scanner.Split(bufio.ScanWords)
for scanner.Scan() {
fmt.Println("token:", scanner.Text())
}
}
It becomes really easy then to chain transformers together. So, for example, if we wanted to remove all hyphens from the input stream, it's just a matter of using transform.Chain appropriately:
func main() {
transliterations := AccentsTransformer{'Æ': "E", 'Ø': "OE"}
removeHyphens := transform.RemoveFunc(func(r rune) bool {
return r == '-'
})
allTransforms := transform.Chain(transliterations, removeHyphens)
testString := "cØØl beÆns - the next generation"
b := transform.NewReader(bytes.NewBufferString(testString), allTransforms)
scanner := bufio.NewScanner(b)
scanner.Split(bufio.ScanWords)
for scanner.Scan() {
fmt.Println("token:", scanner.Text())
}
}
I have not exhaustively tested the code above, so please don't just copy-and-paste it without sufficient tests. I just cooked it up fast. But this kind of approach --- avoiding whole-file reading --- will scale better because it will read the file in chunks.
答案2
得分: 0
-
"list.txt"和"dictionary"有多大?如果它们很大,那么内存占用很大也就不奇怪了。
pibns := bytes.Fields(data)
len(pibns)
是多少? -
启动gc调试(执行 GODEBUG="gctrace=1" ./yourprogram)以查看是否发生了垃圾回收。
-
做一些类似下面这样的性能分析:
func lookupMem() { if f, err := os.Create("mem_prof" + time.Now().Unix()); err != nil { log.Debug("记录内存分析失败:%v", err) } else { runtime.GC() pprof.WriteHeapProfile(f) f.Close() } if f, err := os.Create("heap_prof" + "." + timestamp); err != nil { log.Debug("堆分析失败:%v", err) } else { p := pprof.Lookup("heap") p.WriteTo(f, 2) } } func (t *Trainer) Start() { ....... if i%1000 == 0 { // 如果`len(pibns)`不是很大,记录一些内存信息 lookupMem() } ....... }
英文:
1 How large are "list.txt" and "dictionary"? If it is so large, No wonder the memory is so large
pibns := bytes.Fields(data)
how much is len(pibns)
?
2 start the gc debug ( do GODEBUG="gctrace=1" ./yourprogram ) to see if there is any gc happening
3 do some profile like this:
func lookupMem(){
if f, err := os.Create("mem_prof"+time.Now.Unix()); err != nil {
log.Debug("record memory profile failed: %v", err)
} else {
runtime.GC()
pprof.WriteHeapProfile(f)
f.Close()
}
if f, err := os.Create("heap_prof" + "." + timestamp); err != nil {
log.Debug("heap profile failed:", err)
} else {
p := pprof.Lookup("heap")
p.WriteTo(f, 2)
}
}
func (t *Trainer) Start() {
.......
if i%1000==0 {
//if `len(pibns)` is not very large , record some meminfo
lookupMem()
}
.......
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论