pprof和ps之间的内存使用差异

huangapple go评论95阅读模式
英文:

memory usage discrepency between pprof and ps

问题

我一直在尝试对使用cobra构建的cli工具的堆使用情况进行分析。pprof工具显示如下:

Flat	Flat%	Sum%	Cum	Cum%	Name	Inlined?
1.58GB	49.98%	49.98%	1.58GB	49.98%	os.ReadFile	
1.58GB	49.98%	99.95%	1.58GB	50.02%	github.com/bytedance/sonic.(*frozenConfig).Unmarshal	
0		0.00%	99.95%	3.16GB	100.00%	runtime.main	
0		0.00%	99.95%	3.16GB	100.00%	main.main	
0		0.00%	99.95%	3.16GB	100.00%	github.com/spf13/cobra.(*Command).execute	
0		0.00%	99.95%	3.16GB	100.00%	github.com/spf13/cobra.(*Command).ExecuteC	
0		0.00%	99.95%	3.16GB	100.00%	github.com/spf13/cobra.(*Command).Execute	(inline)
0		0.00%	99.95%	3.16GB	100.00%	github.com/mirantis/broker/misc.ParseUcpNodesInspect	
0		0.00%	99.95%	3.16GB	100.00%	github.com/mirantis/broker/cmd.glob..func3	
0		0.00%	99.95%	3.16GB	100.00%	github.com/mirantis/broker/cmd.getInfos	
0		0.00%	99.95%	3.16GB	100.00%	github.com/mirantis/broker/cmd.Execute	
0		0.00%	99.95%	1.58GB	50.02%	github.com/bytedance/sonic.Unmarshal

但是ps显示最后几乎占用了6752.23 Mb(rss)。

此外,我将defer profile.Start(profile.MemProfileHeap).Stop()放在最后一个执行的函数中。将分析器放在func main中没有显示任何内容。因此,我通过函数进行了跟踪,并发现了最后一个函数中的大量内存使用情况。

我的问题是,如何找到缺失的约3GB内存?

英文:

I have been trying to profile the heap usage of a cli tool built with cobra.
The pprof tool is showing like the following,

Flat	Flat%	Sum%	Cum	Cum%	Name	Inlined?
1.58GB	49.98%	49.98%	1.58GB	49.98%	os.ReadFile	
1.58GB	49.98%	99.95%	1.58GB	50.02%	github.com/bytedance/sonic.(*frozenConfig).Unmarshal	
0		0.00%	99.95%	3.16GB	100.00%	runtime.main	
0		0.00%	99.95%	3.16GB	100.00%	main.main	
0		0.00%	99.95%	3.16GB	100.00%	github.com/spf13/cobra.(*Command).execute	
0		0.00%	99.95%	3.16GB	100.00%	github.com/spf13/cobra.(*Command).ExecuteC	
0		0.00%	99.95%	3.16GB	100.00%	github.com/spf13/cobra.(*Command).Execute	(inline)
0		0.00%	99.95%	3.16GB	100.00%	github.com/mirantis/broker/misc.ParseUcpNodesInspect	
0		0.00%	99.95%	3.16GB	100.00%	github.com/mirantis/broker/cmd.glob..func3	
0		0.00%	99.95%	3.16GB	100.00%	github.com/mirantis/broker/cmd.getInfos	
0		0.00%	99.95%	3.16GB	100.00%	github.com/mirantis/broker/cmd.Execute	
0		0.00%	99.95%	1.58GB	50.02%	github.com/bytedance/sonic.Unmarshal

But ps is sowing at the end it almost consumes 6752.23 Mb(rss).

Also, I am putting the defer profile.Start(profile.MemProfileHeap).Stop() at the last function gets executed. Putting the profiler in the func main doesn't show anything. So I traced through the functions and found the considerable usage of memory at the last one.

pprof和ps之间的内存使用差异

My question is, how do I find the missing ~3gb of memory?

答案1

得分: 13

你的问题存在多个问题:

  1. ps(以及top等)显示了多个内存读数。其中唯一感兴趣的通常被称为RESRSS。你没有告诉我是哪一个。
    基本上,查看通常称为VIRT的读数是没有意义的。

  2. 正如Volker所说,pprof不测量内存消耗,它测量的是(在你运行它的模式下)内存分配速率——即“多少”,而不是“多频繁”。

    要理解它的含义,考虑一下pprof的工作原理。
    在分析过程中,一个定时器会滴答作响,每次滴答时,分析器会快照你运行的程序,扫描所有活动goroutine的堆栈,并将堆栈中的变量与堆中的活动对象关联起来,每个堆栈帧都属于一个活动函数。

    这意味着,如果你的进程将调用,比如说,os.ReadFile——根据其约定,它会分配一个足够长的字节切片来容纳要读取的整个文件的内容,每次调用100次来读取1个GiB的文件,并且分析器的定时器能够准确地定位这100次调用(它可能会错过一些调用,因为它是采样的),那么os.ReadFile将被归因为已分配了100个GiB。
    但是,如果你的程序没有以一种保持每个调用返回的切片的方式编写,而是在处理完这些切片后将它们丢弃,那么过去调用返回的切片很可能在新的调用分配之前已经被垃圾回收器收集。

  3. 虽然规范没有要求,但Go的两个“标准”现代实现——最初被称为“gc”的实现(大多数人认为这是_唯一的_实现),以及GCC前端——都具有与你自己进程流并行运行的垃圾回收器;它实际上收集你的进程产生的垃圾的时刻由一组复杂的启发式算法(如果感兴趣,可以从这里开始)控制,这些算法试图在为垃圾回收器花费CPU时间和为不执行垃圾回收花费RAM之间取得平衡;这意味着对于短生命周期的进程,垃圾回收器可能甚至不会启动一次,这意味着你的进程将以所有生成的垃圾仍然存在的状态结束,并且当进程结束时,所有这些内存将按照通常的方式被操作系统回收。

  4. 当垃圾回收器回收垃圾时,释放的内存不会立即返回给操作系统。而是涉及两个阶段的过程:

    • 首先,释放的区域会返回给作为你运行程序的Go运行时的一部分的内存管理器。
      这是一个合理的做法,因为在典型的程序中,内存使用通常足够高,而且释放的内存很可能会很快被重新分配。

    • 其次,空闲时间足够长的内存页会被标记,以让操作系统知道它可以将其用于自己的需要。

    基本上,这意味着即使垃圾回收器释放了一些内存,你在运行中的Go进程之外也看不到这一点,因为这些内存首先返回给进程自己的池。

  5. 不同版本的Go(再次强调,我指的是“gc”实现)对于将释放的页面返回给操作系统实施了不同的策略:首先,它们被madvise(2)标记为MADV_FREE,然后标记为MADV_DONTNEED,然后再次标记为MADV_FREE
    如果你使用的是一个将释放的内存标记为MADV_DONTNEED的Go版本,那么RSS的读数将更加不可靠,因为以这种方式标记的内存仍然计入进程的RSS,即使操作系统被提示在需要时可以回收该内存。

总结一下。
这个主题足够复杂,你似乎过于快速地得出了某些结论 pprof和ps之间的内存使用差异


一个更新。
我决定稍微扩展一下关于内存管理的内容,因为我觉得你对这些东西的整体情况可能还有一些遗漏,因此你可能会觉得对你的问题的评论是无关紧要和轻视的。

不要使用pstop等工具来测量用Go编写的程序的内存消耗是有原因的,这是因为当代高级编程语言编写的程序所使用的运行时环境中实现的内存管理与操作系统内核和硬件上实现的底层内存管理相去甚远。

让我们以Linux为例来具体说明。
你当然可以直接向内核请求分配内存:mmap(2)是一个执行这个操作的系统调用
如果你使用MAP_PRIVATE(通常还使用MAP_ANONYMOUS),内核将确保你的进程的页表有一个或多个新条目,用于包含你请求的字节数的连续内存区域,并返回序列中第一个页面的地址。
此时,你可能认为你的进程的RSS已经增加了这么多字节,但实际上并没有:内存被“保留”但实际上没有被分配;要真正分配一个内存页,进程必须“触摸”页面中的任何字节——通过读取或写入它:这将在CPU上生成所谓的“页错误”,内核处理程序将要求硬件实际分配一个真正的“硬件”内存页。只有在这之后,该页才会真正计入进程的RSS

好了,这很有趣,但你可能会看到一个问题:使用完整的页面(在不同的系统上可能具有不同的大小;在x86系列系统上通常为4 KiB)并不太方便:当你使用高级语言编程时,你不会在这样一个低级别上考虑内存;相反,你希望运行的程序在你需要时以某种方式实现“对象”(我在这里并不是指面向对象编程;只是包含某些语言或用户定义类型的内存片段)。
这些对象的大小可以是任意的,大多数情况下远小于一个内存页,而且更重要的是,当分配这些对象时,你通常甚至不考虑这些对象占用多少空间。
即使在像C这样的语言中编程,这些天被认为是相当低级的语言,你通常习惯于使用标准C库提供的malloc(3)系列内存管理函数,它允许你分配任意大小的内存区域。

解决这种问题的一种方法是在内核为你的程序可以做的事情之上拥有一个更高级别的内存管理器,事实上,每个用高级语言编写的通用程序(甚至是C和C++!)都在使用一个:对于解释性语言(如Perl、Tcl、Python、POSIX shell等),它由解释器提供;对于字节编译语言(如Java),它由执行该代码的进程(如Java的JRE)提供;对于编译为机器(CPU)代码的语言,如Go的“标准”实现,它由包含在生成的可执行映像文件中或在程序加载到内存执行时动态链接到程序中的“运行时”代码提供。
这样的内存管理器通常非常复杂,因为它们必须处理许多复杂的问题,如内存碎片化,并且通常必须尽可能少地与内核通信,因为系统调用很慢。
后一种要求自然意味着进程级内存管理器尝试缓存它们一旦从内核获取的内存,并且不愿意释放它。

所有这些意味着,在一个典型的_活动的_Go程序中,你可能会有疯狂的_内存翻转_——大量的小对象一直在被分配和释放,而这几乎对从进程“外部”监视的RSS值没有任何影响:所有这些翻转都由进程内存管理器和(在Go的标准实现中)与内存管理器紧密集成的垃圾回收器处理。

因此,为了对长时间运行的生产级Go程序中发生的情况有有用的可操作的想法,这样的程序通常提供一组持续更新的_指标_(传递、收集和监视它们被称为遥测)。对于Go程序,负责生成这些指标的部分可以定期调用runtime.ReadMemStatsruntime/debug.ReadGCStats,或直接使用runtime/metrics提供的内容。在Zabbix、Graphana等监控系统中查看这些指标非常有教育意义:你可以清楚地看到在每次GC周期后,可供进程内存管理器使用的空闲内存量如何增加,而RSS大致保持不变。

还要注意,你可以考虑在特殊的环境变量GODEBUG中使用各种与GC相关的调试设置来运行Go程序,这些设置在这里有描述:基本上,你让运行你的程序的Go运行时发出关于GC工作方式的详细信息(还可以参考这里)。

希望这能激发你对这些问题进一步探索的好奇心 pprof和ps之间的内存使用差异

你可能会发现这篇文章对于Go运行时实现的内存管理——与内核和硬件的关系——是一个很好的介绍;推荐阅读。

英文:

There are multiple problems (with your question):

  1. ps (and top etc) show multiple memory readings. The only one of interest is typically called RES or RSS. You don't tell which one it was.
    Basically, looking at the reading typically named VIRT is not interesting.

  2. As Volker said, pprof does not measure memory consumption, it measures (in the mode you have run it) memory allocation rate—in the sense of "how much", not "how frequently".

    To understand what it means, consider how pprof works.
    During profiling, a timer ticks, and on each tick, the profiler sort of snaphots your running program, scans the stacks of all live goroutines and attributes live objects on the heap to the variables contained in the stack frames of those stacks, and each stack frame belongs to an active function.

    This means that, if your process will call, say, os.ReadFile—which, by its contract, allocates a slice of bytes long enough to contain the whole contents of the file to be read,—100 times to read 1 GiB file each time, and the profiler's timer will manage to pinpoint each of these 100 calls (it can miss some of the calls as it's sampling), os.ReadFile will be attributed to having had allocated 100 GiB.
    But if your program is not written in such a way that it holds each of the slices returned by these calls, but rather does something with those slices and throws them away after processing, the slices from the past calls will likely be already collected by the GC by the time the newer ones are allocated.

  3. While not required by the spec, the two "standard" contemporary implementations of Go—the one originally dubbed "gc", which most people think is the implementation, and the GCC frontend—feature garbage collector which runs in parallel with the flow of your own process; the moments it actually collects the garbage produced by your process are governed by a set of complicated heuristics (start here if interested) which try to balance between spending CPU time for GC and spending RAM for not doing it pprof和ps之间的内存使用差异 , and it means for short-lived processes, the GC might not kick in even a single time, meaning your process will end with all the generated garbage still floating, and all that memory will be reclaimed by the OS in the usual way when the process ends.

  4. When the GC collects garbage, the freed memory is not returned to the OS immediately. Instead, two-staged process is involved:

    • First, the freed regions are returned to the memory manager which is a part of the Go rutime powering your running program.
      This is a sensible thing because in a typical program memory churn is usually high enough and freed memory will likely be quickly allocated back again.

    • Second, memory pages staying free long enough are marked to let the OS know it can use it for its own needs.

    Basically it means that even after the GC frees some memory, you won't see this outside the running Go process as this memory is first retuned to the process' own pool.

  5. Different versions of Go (again, I mean the "gc" implementation) implemented different policies about returning the freed pages to the OS: first they were marked by madvise(2) as MADV_FREE, then as MADV_DONTNEED and then again as MADV_FREE.
    If you happen to use a version of Go whose runtime marks freed memory as MADV_DONTNEED, the readings of RSS will be even less sensible because the memory marked that way still counts against the process' RSS even though the OS was hinted it can reclaim that memory when needed.

To recap.
This topic is complex enough and you appear to be drawing certain conclusions too fast pprof和ps之间的内存使用差异


An update.
I've decided to expand on memory management a bit because I feel like certain bits and pieces may be missing from the big picture of this stuff in your head, and because of this you might find the comments to your question to be moot and dismissive.

The reasoning for the advice to not measure memory consumption of programs written in Go using ps, top and friends is rooted in the fact the memory management implemented in the runtime environments powering programs written in contemporary high-level programming languages is quite far removed from the down-to-the-metal memory management implemented in the OS kernels and the hardware they run on.

Let's consider Linux to have concrete tangible examples.
You certainly can ask the kernel directly to allocate a memory for you: mmap(2) is a syscall which does that.
If you call it with MAP_PRIVATE (and usually also with MAP_ANONYMOUS), the kernel will make sure the page table of your process has one or more new entries for as many pages of memory to contain the contiguous region of as many bytes as you have requested, and return the address of the first page in the sequence.
At this time you might think that the RSS of your process had grown by that number of bytes, but it had not: the memory was "reserved" but not actually allocated; for a memory page to really get allocated, the process had to "touch" any byte within the page—by reading it or writing it: this will generate the so-called "page fault" on the CPU, and the in-kernel handler will ask the hardware to actually allocate a real "hardware" memory page. Only after that the page will actually count against the process' RSS.

OK, that's fun, but you probably can see a problem: it's not too convenient to operate with complete pages (wich can be of different size on different systems; typically it's 4 KiB on systems of the x86 lineage): when you program in a high-level language, you don't think on such a low level about the memory; instead, you expect the running program to somehow materialize "objects" (I do not mean OOP here; just pieces of memory containing values of some language- or user-defined types) as you need them.
These objects may be of any size, most of the time way smaller than a single memory page, and—what is more important,—most of the time you do not even think about how much space these objects are consuming when allocated.
Even when programming in a language like C, which these days is considered to be quite low-level, you're usually accustomed to using memory management functions in the malloc(3) family provided by the standard C library, which allow you to allocate regions of memory of arbitrary size.

A way to solve this sort of problem is to have a higher-level memory manager on top on what the kernel can do for your program, and the fact is, every single general-purpose program written in a high-level language (even C and C++!) is using one: for interpreted languages (such as Perl, Tcl, Python, POSIX shell etc) it is provided by the interpreter; for byte-compiled languages such as Java, it is provided by the process which executes that code (such as JRE for Java); for languages which compile down to machine (CPU) code—such as the "stock" implementation of Go—it is provided by the "runtime" code included into the resulting executable image file or linked into the program dynamically when it's being loaded into the memory for execution.
Such memory managers are usually quite complicated as they have to deal with many complex problems such as memory fragmentation, and they usually have to avoid talking to the kernel as much as possible because syscalls are slow.
The latter requirement naturally means process-level memory managers try to cache the memory they have once taken from the kernel, and are reluctant to release it back.

All this means that, say, in a typical active Go program you might have crazy memory churn — hordes of small objects being allocated and deallocated all the time which has next to no effect on the values of RSS monitored "from the outside" of the process: all this churn is handled by the in-process memory manager and—as in the case of the stock Go implementation—the GC which is naturally tightly integrated with the MM.

Because of that, to have useful actionable idea about what is happening in a long-running production-grade Go program, such program usually provides a set of continuously updated metrics (delivering, collecting and monitoring them is called telemetry). For Go programs, a part of the program tasked with producing these metrics can either make periodic calls to runtime.ReadMemStats and runtime/debug.ReadGCStats or directly use what the runtime/metrics has to offer. Looking at such metrics in a monitoring system such as Zabbix, Graphana etc is quite instructive: you can literally see how the amount of free memory available to the in-process MM increases after each GC cycle while the RSS stays roughly the same.

Also note that you might consider running your Go program with various GC-related debugging settings in a special environment variable GODEBUG described here: basically, you make the Go runtime powering your running program emit detailed information on how the GC is working (also see this).

Hope this will make your curious to make further exploration of these matters pprof和ps之间的内存使用差异

You might find this to be a good introduction on memory management implemented by the Go runtime—in connection with the kernel and the hardware; recommended read.

rss: https://en.wikipedia.org/wiki/Resident_set_size "Resident Set Size"
madvise: https://manpages.debian.org/2/madvise "madvise(2)"

huangapple
  • 本文由 发表于 2022年9月7日 17:03:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/73632745.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定