英文:
Releasing memory from large objects
问题
我遇到了一些我不理解的东西。希望你们能帮忙!
资源:
- https://medium.com/@chaewonkong/solving-memory-leak-issues-in-go-http-clients-ba0b04574a83
- https://www.golinuxcloud.com/golang-garbage-collector/
我在几篇文章中读到了一个建议,即我们可以通过在不再需要它们之后将大的切片和映射(我猜这适用于所有引用类型)设置为nil
,从而使GC的工作更容易。这是我读到的一个例子:
func ProcessResponse(resp *http.Response) error {
data, err := ioutil.ReadAll(resp.Body)
if err != nil {
return err
}
// 在这里处理数据
data = nil // 释放内存
return nil
}
我理解的是,当函数ProcessResponse
完成时,data
变量将超出作用域,基本上不再存在。然后,GC将验证是否没有引用[]byte
切片(即data
指向的切片),并清除内存。
将data
设置为nil
如何改善垃圾回收?
谢谢!
英文:
I came across something that I don't understand. Hope you guys can help!
Resources:
- https://medium.com/@chaewonkong/solving-memory-leak-issues-in-go-http-clients-ba0b04574a83
- https://www.golinuxcloud.com/golang-garbage-collector/
I read in several articles the suggestion that we can make the job of the GC easier by setting large slices and maps (I guess this applies to all reference types) to nil
after we no longer need them. Here is one of the examples I read:
func ProcessResponse(resp *http.Response) error {
data, err := ioutil.ReadAll(resp.Body)
if err != nil {
return err
}
// Process data here
data = nil // Release memory
return nil
}
It is my understanding that when the function ProcessResponse
finishes the data
variable will be out of scope and basically will no longer exist. The GC will then verify there is no reference to the []byte
slice (the one that data
pointed to) and will clear the memory.
How setting data
to nil
improves garbage collection?
Thanks!
答案1
得分: 3
正如其他人已经指出的那样:在返回之前设置data = nil
对于垃圾回收(GC)没有任何影响。Go编译器会应用优化,而Go语言的垃圾回收器在不同的阶段工作。简单来说(省略了很多细节):将data = nil
设置为nil,并删除对底层切片的所有引用,不会触发对不再引用的内存的原子式释放。一旦切片不再被引用,它将被标记为不再使用,并且关联的内存直到下一次扫描之前都不会被释放。
垃圾回收是一个难题,主要是因为它不是那种具有适用于所有用例的最优解的问题。多年来,Go运行时已经发展了很多,特别是在运行时垃圾回收器方面进行了大量工作。结果是,很少有情况下简单的someVar = nil
会产生一点点差异,更不用说明显的差异了。
如果你正在寻找一些简单的经验法则,可以影响与垃圾回收相关的运行时开销(或者一般的运行时内存管理),我知道一个似乎与你问题中的这句话有关:
> 建议通过设置大的切片和映射来简化GC的工作
这是一种可以产生明显结果的方法,当对代码进行性能分析时。假设你正在读取一大块需要处理的数据,或者需要执行其他类型的批量操作并返回一个切片,人们经常会写出这样的代码:
func processStuff(input []someTypes) []resultTypes {
data := []resultTypes{}
for _, in := range input {
data = append(data, processT(in))
}
return data
}
通过将代码更改为以下形式,可以对其进行优化:
func processStuff(input []someTypes) []resultTypes {
data := make([]resultTypes, 0, len(input)) // 设置容量
for _, in := range input {
data = append(data, processT(in))
}
return data
}
在第一个实现中,你创建了一个长度和容量都为0的切片。第一次调用append
时,你超过了切片的当前容量,这将导致运行时分配内存。如这里所解释的那样,新的容量计算非常简单,内存被分配并且数据被复制:
t := make([]byte, len(s), (cap(s)+1)*2)
copy(t, s)
基本上,每当你在切片已满(即len
== cap
)时调用append
,你都会分配一个新的切片,它可以容纳:(len + 1) * 2
个元素。了解到这一点后,让我们看看在第一个示例中,data
的初始len
和cap
都为0,这意味着什么:
第一次迭代:append创建容量为(0+1)*2的切片,data现在的len为1,cap为2
第二次迭代:append添加到data,现在len为2,cap为2
第三次迭代:append分配一个容量为(2+1)*2的新切片,将data中的2个元素复制到该切片中并添加第三个元素,data现在被重新分配为len为3,cap为6的切片
第四至六次迭代:data增长到len为6,cap为6
第七次迭代:与第三次迭代相同,尽管cap为(6+1)*2,但所有数据都被复制过去,data被重新分配为len为7,cap为14的切片
如果切片中的数据结构较大(即有许多嵌套结构、大量间接引用等),那么这种频繁的重新分配和复制可能会变得非常昂贵。如果你的代码中包含许多这种循环,它们将在性能分析中显示出来(你会看到大量时间花在调用gcmalloc
上)。此外,如果你处理15个输入值,你的数据切片将变成这样:
dataSlice {
len: 15
cap: 30
data underlying_array[30]
}
这意味着你为30个值分配了内存,而实际上只需要15个,并且你将以4个越来越大的块分配该内存,并且每次重新分配都需要复制数据。
相比之下,第二个实现将在循环之前分配一个数据切片,它看起来像这样:
data {
len: 0
cap: 15
data underlying_array[15]
}
它一次性分配,因此不需要重新分配和复制,返回的切片在内存中占用的空间将减少一半。从这个意义上说,我们从一开始就分配了一个较大的内存块,以减少后续增量分配和复制调用的次数,从而降低总体运行时成本。
如果我不知道需要多少内存怎么办
这是一个合理的问题。这个示例并不总是适用。在这种情况下,我们不知道最终需要多少元素,因此可以:
- 进行合理的猜测:垃圾回收是困难的,与你不同,编译器和Go运行时缺乏人们具备的模糊逻辑来得出一个现实而合理的估计。有时候简单的猜测可能是:“嗯,我从那个数据源获取数据,在那里我们只存储最后N个元素,所以最坏的情况是,我将处理N个元素”,有时候可能更加模糊,例如:你正在处理包含SKU、产品名称和库存数量的CSV。你知道SKU的长度,可以假设库存数量是1到5位数的整数,产品名称平均为2-3个单词。英文单词的平均长度为6个字符,因此你可以大致估计CSV行占用的字节数:假设SKU为10个字符,80字节,产品描述为2.5 * 6 * 8 = 120字节,加上库存数量的约4个字节和2个逗号和一个换行符,平均预期行长度为207字节,我们称之为200以保守起见。统计输入文件的大小,将其以字节为单位除以200,你应该得到一个可用的、稍微保守的行数估计。在该代码的末尾添加一些日志,比较容量和估计值,可以相应地调整预测计算。
- 对代码进行性能分析。有时候你会发现自己在开发一个新功能或一个全新的项目,没有历史数据可供参考来进行估计。在这种情况下,你可以简单地“猜测”,运行一些测试场景,或者在测试环境中使用生产数据来运行你的代码版本并进行性能分析。当你正在主动分析仅涉及一个或两个切片/映射的内存使用/运行时成本时,我必须强调这是优化。只有在这是一个瓶颈或明显问题(例如,运行时内存分配妨碍了整体性能分析)的情况下,你才应该花时间进行优化。在绝大多数情况下,这种级别的优化都属于微观优化的范畴。遵循80-20原则。
总结
不,在99%的情况下,将简单的切片变量设置为nil不会有太大的差异。在创建和追加映射/切片时,更有可能产生影响的是通过使用make()
并指定合理的cap
值来减少不必要的分配。其他可能产生影响的因素包括使用指针类型/接收器,尽管这是一个更复杂的话题。现在,我只想说一下,我一直在处理一个需要处理远远超出典型uint64
范围的数字的代码库,而且不幸的是,我们必须能够以比float64
更精确的方式使用小数。我们通过使用类似holiman/uint256的东西解决了uint64
问题,它使用指针接收器,并且使用shopspring/decimal解决了小数问题,它使用值接收器并复制所有内容。在优化代码的过程中,我们已经达到了使用小数时不断复制值的性能影响成为问题的程度。看看这些包如何实现简单的加法操作,并尝试确定哪个操作更昂贵:
// 原始版本
a, b := 1, 2
a += b
// uint256版本
a, b := uint256.NewUint(1), uint256.NewUint(2)
a.Add(a, b)
// decimal版本
a, b := decimal.NewFromInt(1), decimal.NewFromInt(2)
a = a.Add(b)
这只是我最近工作中花时间优化的一些事情,但从中最重要的一点是:
过早优化是万恶之源
当你处理更复杂的问题/代码时,要达到一个你正在研究分配循环作为潜在瓶颈和优化的程度需要付出很大的努力。你可以(并且可以说应该)采取措施避免过于浪费(例如,如果你知道切片的最终长度,可以设置切片的容量),但你不应该浪费太多时间手工调整每一行代码,直到该代码的内存占用尽可能小。代价将是:代码更脆弱/更难维护和阅读,潜在的性能下降(真的,你可以相信Go运行时会做得不错),大量的辛苦努力,以及生产力的大幅下降。
英文:
As others have pointed out already: setting data = nil
right before returning doesn't change anything in terms of GC. The go compiler will apply optimisations, and golang's garbage collector works in distinct phases. In the simplest of terms (with many omissions and over-simplifications): setting data = nil
, and removing all references to the underlying slice is not going to trigger an atomic style release of the memory that is no longer referenced. Once the slice is no longer referenced, it'll be marked as such, and the associated memory won't be released until the next sweep.
Garbage collection is a hard problem, in no small part due to the fact that it's not the sort of problem that has an optimal solution that will produce the best results for all use-cases. Over the years, the go runtime has evolved quite a lot, with significant work being done precisely on the runtime garbage collector. The result is that there are very few situations where a simple someVar = nil
will make even a small difference, let alone a noticeable one.
If you are looking for some simple rule-of-thumb type tips that can impact the runtime overhead associated with garbage collection (or runtime memory management in general), I do know of one that seems to be vaguely covered by this sentence in your question:
> suggestion that we can make the job of the GC easier by setting large slices and maps
This is something that can produce noticeable results, when profiling code. Say you're reading a large chunk of data that you need to process, or you're having to perform some other type of batch operation and return a slice, it's not uncommon to see people write things like this:
func processStuff(input []someTypes) []resultTypes {
data := []resultTypes{}
for _, in := range input {
data = append(data, processT(in))
}
return data
}
This can be optimised quite easily by changing the code to this:
func processStuff(input []someTypes) []resultTypes {
data := make([]resultTypes, 0, len(input)) // set cap
for _, in := range input {
data = append(data, processT(in))
}
return data
}
What happens in the first implementation is that you create a slice with len
and cap
of 0. The first time append
is called, you're exceeding the current capacity of the slice, which will cause the runtime to allocate memory. As explained here, the new capacity is calculated rather simplistically, the memory is allocated and the data is copied over:
t := make([]byte, len(s), (cap(s)+1)*2)
copy(t, s)
Essentially, each time you call append
when the slice you're appending to is full (ie len
== cap
), you'll allocate a new slice that can hold: (len + 1) * 2
elements. Knowing that, in the first example, data
starts out with len
and cap
== 0, let's work see what that means:
1st iteration: append creates slice with cap (0+1) *2, data is now len 1, cap 2
2nd iteration: append adds to data, now has len 2, cap 2
3rd iteration: append allocates a new slice with cap (2 + 1) *2, copies the 2 elements from data to this slice and adds the third, data is now reassigned to a slice with len 3, cap 6
4th-6th iterations: data grows to len 6, cap 6
7th iteration: same as 3rd iteration, although cap is (6 + 1) * 2, everything is copied over, data is reassigned a slice with len 7, cap 14
If the data structures in your slice are on the larger side (ie many nested structures, lots of indirection, etc...) then this frequent re-allocating and copying can become quite expensive. If your code contains lots of these kind of loops, it will begin to show up in pprof (you'll start seeing a lot of time being spent calling gcmalloc
). Moreover, if you're processing 15 input values, your data slice will end up looking like this:
dataSlice {
len: 15
cap: 30
data underlying_array[30]
}
Meaning you'll have allocated memory for 30 values, when you only needed 15, and you'll have allocated that memory in 4 increasingly large chunks, with copying data each realloc.
By contrast, the second implementation will allocate a data slice that looks like this before the loop:
data {
len: 0
cap: 15
data underlying_array[15]
}
It's allocated in one go, so no re-allocations and copying is needed, and the slice that is returned will take up half the space in memory. In that sense, we start out by allocating a larger slab of memory at the start, to cut down on the number of incremental allocation and copy calls required later on, which will, overall, cut down on runtime costs.
What if I don't know how much memory I need
That's a fair question. This example is not always going to apply. In this case we knew how many elements we'd need, and we could allocate memory accordingly. Sometimes, that's just not how the world works. If you don't know how much data you'll end up needing, then you can:
- Make an educated guess: GC is difficult, and unlike you, the compiler and go runtime lack the fuzzy logic people have to come up with a realistic, reasonable guesstimate. Sometimes it'll be as simple as: "Well, I'm getting data from that data source, where we only ever store the last N elements, so worst case scenario, I'll be handling N elements", sometimes it's a bit more fuzzy, for example: you're processing a CSV containing a SKU, product name, and stock count. You know the length of the SKU, you can assume stock count will be an integer between 1 and 5 digits long, and a product name will on average be 2-3 words long. English words have an average length of 6 characters, so you can have a rough idea of how many bytes make up a CSV line: say SKU == 10 characters, 80 bytes, product description 2.5 * 6 * 8 = 120 bytes, and ~4 bytes for the stock count + 2 commas and a line break, makes for an average expected line length of 207 bytes, let's call it 200 to err on the side of caution. Stat the input file, divide its size in bytes by 200 and you should have a serviceable, slightly conservative estimate of the number of lines. Add some logging at the end of that code comparing the cap to the estimate, and you can tweak your prediction calculation accordingly.
- Profile your code. It happens from time to time that you'll find yourself working on a new feature, or an entirely new project, and you don't have historical data to fall back on for a guesstimate. In that case, you can simply guess, run some test scenario's, or spin up a test environment feeding your version of the code production data and profile the code. When you're in the situation where you're actively profiling memory usage/runtime costs for just one or two slices/maps, I must stress that this is optimisation. You should only be spending time on this if this is a bottleneck or noticeable issue (e.g. overall profiling is impeded by runtime memory allocation). In the vast, vast majority of cases, this level of optimisation would fall firmly under the umbrella of micro-optimisation. Adhere to the 80-20 principle
Recap
No, setting a simple slice variable to nil won't make much of a difference in 99% of cases. When creating and appending to maps/slices, what is more likely to make a difference is to cut back on extraneous allocations by using make()
+ specifying a sensible cap
value. Other things that can make a difference is using pointer types/receivers, although that's an even more complex topic to delve in to. For now, I'll just say that I've been working on a code base that has to operate on numbers far beyond the range of your typical uint64
, and we have to unfortunately be able to use decimals in a way that is more precise than float64
will allow. We've solved the uint64
issue by using someting like holiman/uint256, which uses pointer receivers, and tackle the decimal problem with shopspring/decimal, which uses value receivers and copies everything. After spending a lot of time optimising the code, we've reached the point where the performance impact of the constant copying of values when using decimals has become an issue. Look at how these packages implement simple operations like addition and try to work out which operation is more costly:
// original
a, b := 1, 2
a += b
// uint256 version
a, b := uint256.NewUint(1), uint256.NewUint(2)
a.Add(a, b)
// decimal version
a, b := decimal.NewFromInt(1), decimal.NewFromInt(2)
a = a.Add(b)
These are just a couple of things that, in my recent work, I've spent time on optimising, but the single most important thing to take away from this is:
Premature optimisation is the root of all evil
When you're working on more complex problems/code, then getting to a point where you're looking in to allocation cycles for slices or maps as potential bottlenecks and optimisations takes a lot of effort. You can, and arguably should, take measures to avoid being too wasteful (e.g. setting a slice cap if you know what the eventual length of said slice will be), but you shouldn't waste too much time hand-crafting every line until the memory footprint of that code is as small as it possibly can be. The cost will be: code that is more fragile/harder to maintain and read, potentially deteriorated overall performance (seriously, you can trust the go runtime to do a decent job), lots of blood, sweat, and tears, and a steep decrease in productivity.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论