为什么 Golang 的 MADV_FREE 有时会导致 OOM(Out of Memory)问题?

huangapple go评论91阅读模式
英文:

Why Golang MADV_FREE leads to OOM sometimes?

问题

我们使用go1.12和k8s部署服务。在实际的生产环境中,我们有一个项目在容器被杀死之前一直发生OOM(内存溢出)的情况。通过在线调查,发现是由于Golang的MADV_FREE引起的,后来我们将其设置为MADV_DONTNEED,问题得到了解决。

在互联网上,有人说MADV_FREE意味着系统只在感到压力时才释放内存。但是内存分配一直在发生,我们的其他服务也在同样的环境中。为什么没有发生OOM呢?

英文:

We use the go1.12 and k8s deployment services. In the actual production environment, we have a project that has been OOM until container is killed. Through online survey, it is because Golang MADV_FREE , later we set to MADV_DONTNEED, the problem is sloved.

On the Internet, it said it was MADV_Free means that the system releases memory only when it feels pressure. But memory Alloc happens all the time, Our other services are in the same environment. Why is there no OOM happen?

答案1

得分: 0

好的,以下是翻译的内容:

嗯,我怀疑这样的问题是否适合在 Stack Overflow 上提问,因为很难得到简短而准确的答案,不过,让我试试。

首先要考虑的是,在内核中当内存不足时,内核会启动 OOM killer,它只是找到内存消耗最高的进程并将其关闭。
(我希望你说的是内核中的 OOM killer,而不是某个特定于 Kubernetes 的服务或你们内部开发的东西。)

然后我们来看一下 Go 1.12 的发布说明,它已经切换到使用 MADV_FREE

> 在 Linux 上,运行时现在使用 MADV_FREE 来释放未使用的内存。这样做更高效,但_可能导致报告的 RSS 更高_。
> 当需要时,内核会回收未使用的数据。
> 要恢复到 Go 1.11 的行为(MADV_DONTNEED),设置环境变量 GODEBUG=madvdontneed=1

(重点是我自己加的。)

这意味着,假设一个程序使用 Go 1.12 编译,在某段时间内以某种标准负载运行,然后相同的程序在相同的时间段内以相同的负载运行,但使用了 GODEBUG=madvdontneed=1 设置,那么从外部看,第一种情况下的 RSS 的_表面_消耗将比第二种情况下的高。
再强调一遍,由 Go 内存管理器使用 madvise(2) 标记的内存页的数量在两次运行期间大致相同,但由于使用这两种方式释放的页面的处理语义不同,使用中的 RSS 的读数将不同。不是实际的内存使用量,而只是 RSS 的读数。

这显然使得使用 MADV_FREE 将内存返回给操作系统的进程更有可能被 OOM killer 选择。

话虽如此,我建议你从不同的角度来看待你的问题。
测量使用 Go 编写并由“标准”Go 实现构建的程序的内存消耗并不完全没有用,但只有在捕捉到一些_明显的_问题时才有用,比如在多个 GC 扫描周期中持续增长的内存,这可能表示存在内存泄漏。
要真正评估实际的内存使用模式,你必须使用正在运行的程序提供的 Go 运行时指标。
我在这里详细说明了原因(链接:https://stackoverflow.com/a/73633428/720999)。

所以我会说,你最好集中精力修复 OOM killer 的设置或类似的东西(请注意,可以使特定进程免受 OOM killer 的影响)。

还要注意,你正在使用一个过时的、不受支持的 Go 版本,
而在 Go 1.16 中,madvise 的行为再次被还原,所以又重新使用了 MADV_DONTNEED
也许现在是升级的好时机。


¹ 实际上,情况比这复杂,因为 OOM killer 有一组启发式算法,用于找到“资源占用过多”的进程,而内存消耗只是它考虑的指标之一。

英文:

Well, I doubt such question is fit for SO as it's unlikely to have short on-point answers, still, let me get a go at it.

The first thing to consider is that the in-kernel OOM killer, when engaged when the kernel finds there's a memory shortage, merely locates a process with the highest memory consumption¹ and brings it down.
(I hope you're talking about the in-kernel OOM killer and not some k8s-specific service or something you've developed in-house.)

Then let's consider the Go 1.12 release notes which has switched to using MADV_FREE:

> On Linux, the runtime now uses MADV_FREE to release unused memory. This is more efficient but may result in higher reported RSS.
> The kernel will reclaim the unused data when it is needed.
> To revert to the Go 1.11 behavior (MADV_DONTNEED), set the environment variable GODEBUG=madvdontneed=1.

(Emphasis mine.)

What this means is that if, say, a program is compiled using Go 1.12 runs during some measure of time under some standard load and then the same program runs during the same measure of time under the same load but with GODEBUG=madvdontneed=1 setting, the apparent consumption of RSS — as seen from the outside — will be higher in the first case than in the second.
To reiterate, the number of memory pages actually marked with madvise(2) by the Go memory manager will be roughly the same during the both runs, but due to the different semantics of handling the pages freed using both of these ways, the readings of the RSS in use will be different. Not the actual memory usage but only readings of the RSS.

This obviously makes a process which returns memory to the OS using MADV_FREE be more likely picked by the OOM killer.

Having said that, I'd advise you to actually look at your problem from a different angle.
Measuring memory consumption of a program written in Go and build by a "stock" Go implementation is not exactly useless but is only useful to catch sort of obvious stuff like steady memory growth over multiple GC scan cycles which possibly indicates a memory leak.
To actually assess the real memory usage pattern, you have to use the mertics provided by the Go runtime of the running program.
I have tried to detail the reasons for this here.

So I would say, you'd better concentrate on fixing the OOM killer settings or something like this (note that it's possible to make a particular process be immune to the OOM killer).

Also note that you're using a dirt-old unsupported version of Go,
and in Go 1.16 the behavior of madvise has been reverted once again, so it uses MADV_DONTNEED again.
Maybe it's a good time to upgrade.


¹ It's actually more complex than that as the OOM killer has a set of heuristics which it uses to find "resource hogs", and the memory consumption is just one of the metrics it considers.

huangapple
  • 本文由 发表于 2022年11月8日 17:10:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/74358183.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定