memory usage discrepency between pprof and ps



Flat	Flat%	Sum%	Cum	Cum%	Name	Inlined?
1.58GB	49.98%	49.98%	1.58GB	49.98%	os.ReadFile	
1.58GB	49.98%	99.95%	1.58GB	50.02%	github.com/bytedance/sonic.(*frozenConfig).Unmarshal	
0		0.00%	99.95%	3.16GB	100.00%	runtime.main	
0		0.00%	99.95%	3.16GB	100.00%	main.main	
0		0.00%	99.95%	3.16GB	100.00%	github.com/spf13/cobra.(*Command).execute	
0		0.00%	99.95%	3.16GB	100.00%	github.com/spf13/cobra.(*Command).ExecuteC	
0		0.00%	99.95%	3.16GB	100.00%	github.com/spf13/cobra.(*Command).Execute	(inline)
0		0.00%	99.95%	3.16GB	100.00%	github.com/mirantis/broker/misc.ParseUcpNodesInspect	
0		0.00%	99.95%	3.16GB	100.00%	github.com/mirantis/broker/cmd.glob..func3	
0		0.00%	99.95%	3.16GB	100.00%	github.com/mirantis/broker/cmd.getInfos	
0		0.00%	99.95%	3.16GB	100.00%	github.com/mirantis/broker/cmd.Execute	
0		0.00%	99.95%	1.58GB	50.02%	github.com/bytedance/sonic.Unmarshal

但是ps显示最后几乎占用了6752.23 Mb(rss)。

此外,我将defer profile.Start(profile.MemProfileHeap).Stop()放在最后一个执行的函数中。将分析器放在func main中没有显示任何内容。因此,我通过函数进行了跟踪,并发现了最后一个函数中的大量内存使用情况。



I have been trying to profile the heap usage of a cli tool built with cobra.
The pprof tool is showing like the following,

But ps is sowing at the end it almost consumes 6752.23 Mb(rss).

Also, I am putting the defer profile.Start(profile.MemProfileHeap).Stop() at the last function gets executed. Putting the profiler in the func main doesn't show anything. So I traced through the functions and found the considerable usage of memory at the last one.


My question is, how do I find the missing ~3gb of memory?


得分: 13


  1. ps(以及top等)显示了多个内存读数。其中唯一感兴趣的通常被称为RESRSS。你没有告诉我是哪一个。

  2. 正如Volker所说,pprof不测量内存消耗,它测量的是(在你运行它的模式下)内存分配速率——即“多少”,而不是“多频繁”。



  3. 虽然规范没有要求,但Go的两个“标准”现代实现——最初被称为“gc”的实现(大多数人认为这是_唯一的_实现),以及GCC前端——都具有与你自己进程流并行运行的垃圾回收器;它实际上收集你的进程产生的垃圾的时刻由一组复杂的启发式算法(如果感兴趣,可以从这里开始)控制,这些算法试图在为垃圾回收器花费CPU时间和为不执行垃圾回收花费RAM之间取得平衡;这意味着对于短生命周期的进程,垃圾回收器可能甚至不会启动一次,这意味着你的进程将以所有生成的垃圾仍然存在的状态结束,并且当进程结束时,所有这些内存将按照通常的方式被操作系统回收。

  4. 当垃圾回收器回收垃圾时,释放的内存不会立即返回给操作系统。而是涉及两个阶段的过程:

    • 首先,释放的区域会返回给作为你运行程序的Go运行时的一部分的内存管理器。

    • 其次,空闲时间足够长的内存页会被标记,以让操作系统知道它可以将其用于自己的需要。


  5. 不同版本的Go(再次强调,我指的是“gc”实现)对于将释放的页面返回给操作系统实施了不同的策略:首先,它们被madvise(2)标记为MADV_FREE,然后标记为MADV_DONTNEED,然后再次标记为MADV_FREE

这个主题足够复杂,你似乎过于快速地得出了某些结论 pprof和ps之间的内存使用差异




好了,这很有趣,但你可能会看到一个问题:使用完整的页面(在不同的系统上可能具有不同的大小;在x86系列系统上通常为4 KiB)并不太方便:当你使用高级语言编程时,你不会在这样一个低级别上考虑内存;相反,你希望运行的程序在你需要时以某种方式实现“对象”(我在这里并不是指面向对象编程;只是包含某些语言或用户定义类型的内存片段)。

解决这种问题的一种方法是在内核为你的程序可以做的事情之上拥有一个更高级别的内存管理器,事实上,每个用高级语言编写的通用程序(甚至是C和C++!)都在使用一个:对于解释性语言(如Perl、Tcl、Python、POSIX shell等),它由解释器提供;对于字节编译语言(如Java),它由执行该代码的进程(如Java的JRE)提供;对于编译为机器(CPU)代码的语言,如Go的“标准”实现,它由包含在生成的可执行映像文件中或在程序加载到内存执行时动态链接到程序中的“运行时”代码提供。




希望这能激发你对这些问题进一步探索的好奇心 pprof和ps之间的内存使用差异



There are multiple problems (with your question):

  1. ps (and top etc) show multiple memory readings. The only one of interest is typically called RES or RSS. You don't tell which one it was.
    Basically, looking at the reading typically named VIRT is not interesting.

  2. As Volker said, pprof does not measure memory consumption, it measures (in the mode you have run it) memory allocation rate—in the sense of "how much", not "how frequently".

    To understand what it means, consider how pprof works.
    During profiling, a timer ticks, and on each tick, the profiler sort of snaphots your running program, scans the stacks of all live goroutines and attributes live objects on the heap to the variables contained in the stack frames of those stacks, and each stack frame belongs to an active function.

    This means that, if your process will call, say, os.ReadFile—which, by its contract, allocates a slice of bytes long enough to contain the whole contents of the file to be read,—100 times to read 1 GiB file each time, and the profiler's timer will manage to pinpoint each of these 100 calls (it can miss some of the calls as it's sampling), os.ReadFile will be attributed to having had allocated 100 GiB.
    But if your program is not written in such a way that it holds each of the slices returned by these calls, but rather does something with those slices and throws them away after processing, the slices from the past calls will likely be already collected by the GC by the time the newer ones are allocated.

  3. While not required by the spec, the two "standard" contemporary implementations of Go—the one originally dubbed "gc", which most people think is the implementation, and the GCC frontend—feature garbage collector which runs in parallel with the flow of your own process; the moments it actually collects the garbage produced by your process are governed by a set of complicated heuristics (start here if interested) which try to balance between spending CPU time for GC and spending RAM for not doing it pprof和ps之间的内存使用差异 , and it means for short-lived processes, the GC might not kick in even a single time, meaning your process will end with all the generated garbage still floating, and all that memory will be reclaimed by the OS in the usual way when the process ends.

  4. When the GC collects garbage, the freed memory is not returned to the OS immediately. Instead, two-staged process is involved:

    • First, the freed regions are returned to the memory manager which is a part of the Go rutime powering your running program.
      This is a sensible thing because in a typical program memory churn is usually high enough and freed memory will likely be quickly allocated back again.

    • Second, memory pages staying free long enough are marked to let the OS know it can use it for its own needs.

    Basically it means that even after the GC frees some memory, you won't see this outside the running Go process as this memory is first retuned to the process' own pool.

  5. Different versions of Go (again, I mean the "gc" implementation) implemented different policies about returning the freed pages to the OS: first they were marked by madvise(2) as MADV_FREE, then as MADV_DONTNEED and then again as MADV_FREE.
    If you happen to use a version of Go whose runtime marks freed memory as MADV_DONTNEED, the readings of RSS will be even less sensible because the memory marked that way still counts against the process' RSS even though the OS was hinted it can reclaim that memory when needed.

To recap.
This topic is complex enough and you appear to be drawing certain conclusions too fast pprof和ps之间的内存使用差异

An update.
I've decided to expand on memory management a bit because I feel like certain bits and pieces may be missing from the big picture of this stuff in your head, and because of this you might find the comments to your question to be moot and dismissive.

The reasoning for the advice to not measure memory consumption of programs written in Go using ps, top and friends is rooted in the fact the memory management implemented in the runtime environments powering programs written in contemporary high-level programming languages is quite far removed from the down-to-the-metal memory management implemented in the OS kernels and the hardware they run on.

Let's consider Linux to have concrete tangible examples.
You certainly can ask the kernel directly to allocate a memory for you: mmap(2) is a syscall which does that.
If you call it with MAP_PRIVATE (and usually also with MAP_ANONYMOUS), the kernel will make sure the page table of your process has one or more new entries for as many pages of memory to contain the contiguous region of as many bytes as you have requested, and return the address of the first page in the sequence.
At this time you might think that the RSS of your process had grown by that number of bytes, but it had not: the memory was "reserved" but not actually allocated; for a memory page to really get allocated, the process had to "touch" any byte within the page—by reading it or writing it: this will generate the so-called "page fault" on the CPU, and the in-kernel handler will ask the hardware to actually allocate a real "hardware" memory page. Only after that the page will actually count against the process' RSS.

OK, that's fun, but you probably can see a problem: it's not too convenient to operate with complete pages (wich can be of different size on different systems; typically it's 4 KiB on systems of the x86 lineage): when you program in a high-level language, you don't think on such a low level about the memory; instead, you expect the running program to somehow materialize "objects" (I do not mean OOP here; just pieces of memory containing values of some language- or user-defined types) as you need them.
These objects may be of any size, most of the time way smaller than a single memory page, and—what is more important,—most of the time you do not even think about how much space these objects are consuming when allocated.
Even when programming in a language like C, which these days is considered to be quite low-level, you're usually accustomed to using memory management functions in the malloc(3) family provided by the standard C library, which allow you to allocate regions of memory of arbitrary size.

A way to solve this sort of problem is to have a higher-level memory manager on top on what the kernel can do for your program, and the fact is, every single general-purpose program written in a high-level language (even C and C++!) is using one: for interpreted languages (such as Perl, Tcl, Python, POSIX shell etc) it is provided by the interpreter; for byte-compiled languages such as Java, it is provided by the process which executes that code (such as JRE for Java); for languages which compile down to machine (CPU) code—such as the "stock" implementation of Go—it is provided by the "runtime" code included into the resulting executable image file or linked into the program dynamically when it's being loaded into the memory for execution.
Such memory managers are usually quite complicated as they have to deal with many complex problems such as memory fragmentation, and they usually have to avoid talking to the kernel as much as possible because syscalls are slow.
The latter requirement naturally means process-level memory managers try to cache the memory they have once taken from the kernel, and are reluctant to release it back.

All this means that, say, in a typical active Go program you might have crazy memory churn — hordes of small objects being allocated and deallocated all the time which has next to no effect on the values of RSS monitored "from the outside" of the process: all this churn is handled by the in-process memory manager and—as in the case of the stock Go implementation—the GC which is naturally tightly integrated with the MM.

Because of that, to have useful actionable idea about what is happening in a long-running production-grade Go program, such program usually provides a set of continuously updated metrics (delivering, collecting and monitoring them is called telemetry). For Go programs, a part of the program tasked with producing these metrics can either make periodic calls to runtime.ReadMemStats and runtime/debug.ReadGCStats or directly use what the runtime/metrics has to offer. Looking at such metrics in a monitoring system such as Zabbix, Graphana etc is quite instructive: you can literally see how the amount of free memory available to the in-process MM increases after each GC cycle while the RSS stays roughly the same.

Also note that you might consider running your Go program with various GC-related debugging settings in a special environment variable GODEBUG described here: basically, you make the Go runtime powering your running program emit detailed information on how the GC is working (also see this).

Hope this will make your curious to make further exploration of these matters pprof和ps之间的内存使用差异

You might find this to be a good introduction on memory management implemented by the Go runtime—in connection with the kernel and the hardware; recommended read.

rss: https://en.wikipedia.org/wiki/Resident_set_size "Resident Set Size"
madvise: https://manpages.debian.org/2/madvise "madvise(2)"

