golang GC profiling? runtime.mallocgc appears to be top one; then is moving to sync.Pool the solution?

huangapple go评论81阅读模式
英文:

golang GC profiling? runtime.mallocgc appears to be top one; then is moving to sync.Pool the solution?

问题

我有一个用Go语言编写的应用程序,用于进行消息处理,需要从网络(UDP)以每秒20K的速率(可能更高)接收消息,每个消息的长度可以达到UDP数据包的最大长度(64KB-头部大小)。该程序需要解码接收到的数据包,并将其编码为另一种格式并发送到另一个网络。

目前,在一台拥有24核心和64GB RAM的机器上,它可以正常运行,但偶尔会丢失一些数据包。编程模式已经遵循了管道的方式,使用多个Go协程/通道,并且它占用整个机器10%的CPU负载;因此,它有潜力使用更多的CPU%或RAM来处理所有的20K/s消息,而不会丢失任何一个。然后我开始进行性能分析,根据这个性能分析,我发现在CPU分析中,runtime.mallocgc是最高的,它是垃圾收集器的运行时。我怀疑这个垃圾收集器可能是问题的原因,它会卡顿几毫秒(或几微秒)并丢失一些数据包,一些最佳实践建议切换到sync.Pool可能会有所帮助,但是我切换到池后似乎导致更多的CPU争用,并且丢失了更多的数据包,而且更频繁。

使用池的用法如下:

// 在程序初始化时全局创建一个池
var rfpool = &sync.Pool{New: func() interface{} { return new(aPrivateStruct) }}

// 获取对象
rf := rfpool.Get().(*aPrivateStruct)

// 在处理完消息后放回池中
rfpool.Put(rf)

我不确定我做错了什么?或者想知道有哪些其他方法可以调整垃圾收集器以减少CPU使用率?Go版本是1.8。

列表显示在sync.(*Pool).getSlow中发生了很多锁争用。

(pprof) list sync.*.getSlow
Total: 7.65mins
ROUTINE ======================== sync.(*Pool).getSlow in /opt/go1.8/src/sync/pool.go
  1.07mins   3.69mins (flat, cum) 48.29% of Total
         .          .    144:    x = p.New()
         .          .    145: }
         .          .    146: return x
         .          .    147: }
         .          .    148:
      80ms       80ms    149: func (p *Pool) getSlow() (x interface{}) {
         .          .    150:     // See the comment in pin regarding ordering of the loads.
      30ms       30ms    151:     size := atomic.LoadUintptr(&p.localSize) // load-acquire
     180ms      180ms    152:     local := p.local                         // load-consume
         .          .    153:     // Try to steal one element from other procs.
      30ms      130ms    154:     pid := runtime_procPin()
      20ms       20ms    155:     runtime_procUnpin()
     730ms      730ms    156:     for i := 0; i < int(size); i++ {
    51.55s     51.55s    157:         l := indexLocal(local, (pid+i+1)%int(size))
     580ms   2.01mins    158:         l.Lock()
    10.65s     10.65s    159:         last := len(l.shared) - 1
      40ms       40ms    160:         if last >= 0 {
         .          .    161:             x = l.shared[last]
         .          .    162:             l.shared = l.shared[:last]
         .       10ms    163:             l.Unlock()
         .          .    164:             break
         .          .    165:         }
     490ms     37.59s    166:         l.Unlock()
         .          .    167:     }
      40ms       40ms    168: return x
         .          .    169: }
         .          .    170:
         .          .    171: // pin pins the current goroutine to P, disables preemption and returns poolLocal pool for the P.
         .          .    172: // Caller must call runtime_procUnpin() when done with the pool.
         .          .    173: func (p *Pool) pin() *poolLocal {

希望这些信息对你有帮助。

英文:

i have an application written in Go is doing message processing, need to pick up message from network (UDP) at a rate of 20K/second (potentially more), and each message can be up to UDP packet's maximum length (64KB-headersize), the program needs to decode this incoming packet and encode into another format and send to another network;

right now on a 24core+64GB RAM machine it runs ok, but occasionally lost some packets, the programming pattern is already following the pipelines using multiple go-routines / channels and it takes 10% of whole machine cpu load; so it has potential to use more CPU% or RAM to handle all 20K/s messages without losing any one; then i started profiling, following this profiling I found in cpu profile the runtime.mallocgc appears the top one, that is garbage collector runtime, I suspect this GC could be the culprit it hangs for a few millisecond (or some microseconds) and lost some packets, and some best practices says switch to sync.Pool might help, but my switch to pool seemingly results more CPU contention and lost even more packets and more often

(pprof) top20 -cum (sync|runtime)
245.99s of 458.81s total (53.61%)
Dropped 487 nodes (cum &lt;= 22.94s)
Showing top 20 nodes out of 22 (cum &gt;= 30.46s)
flat  flat%   sum%        cum   cum%
0     0%     0%    440.88s 96.09%  runtime.goexit
1.91s  0.42%  1.75%    244.87s 53.37%  sync.(*Pool).Get
64.42s 14.04% 15.79%    221.57s 48.29%  sync.(*Pool).getSlow
94.29s 20.55% 36.56%    125.53s 27.36%  sync.(*Mutex).Lock
1.62s  0.35% 36.91%     72.85s 15.88%  runtime.systemstack
22.43s  4.89% 41.80%     60.81s 13.25%  runtime.mallocgc
22.88s  4.99% 46.79%     51.75s 11.28%  runtime.scanobject
1.78s  0.39% 47.17%     49.15s 10.71%  runtime.newobject
26.72s  5.82% 53.00%     39.09s  8.52%  sync.(*Mutex).Unlock
0.76s  0.17% 53.16%     33.74s  7.35%  runtime.gcDrain
0     0% 53.16%     33.70s  7.35%  runtime.gcBgMarkWorker
0     0% 53.16%     33.69s  7.34%  runtime.gcBgMarkWorker.func2

the use of pool is the standard

// create this one globally at program init
var rfpool = &amp;sync.Pool{New: func() interface{} { return new(aPrivateStruct); }}
// get
rf := rfpool.Get().(*aPrivateStruct)
// put after done processing this message
rfpool.Put(rf)

not sure am I doing wrong?
or wonder what are the other ways could tune the GC to use less CPU%? the go version is 1.8

the list shows a lot of lock contention happened in the pool.getSlow src to pool.go at golang.org

(pprof) list sync.*.getSlow
Total: 7.65mins
ROUTINE ======================== sync.(*Pool).getSlow in /opt/go1.8/src/sync/pool.go
1.07mins   3.69mins (flat, cum) 48.29% of Total
.          .    144:		x = p.New()
.          .    145:	}
.          .    146:	return x
.          .    147:}
.          .    148:
80ms       80ms    149:func (p *Pool) getSlow() (x interface{}) {
.          .    150:	// See the comment in pin regarding ordering of the loads.
30ms       30ms    151:	size := atomic.LoadUintptr(&amp;p.localSize) // load-acquire
180ms      180ms    152:	local := p.local                         // load-consume
.          .    153:	// Try to steal one element from other procs.
30ms      130ms    154:	pid := runtime_procPin()
20ms       20ms    155:	runtime_procUnpin()
730ms      730ms    156:	for i := 0; i &lt; int(size); i++ {
51.55s     51.55s    157:		l := indexLocal(local, (pid+i+1)%int(size))
580ms   2.01mins    158:		l.Lock()
10.65s     10.65s    159:		last := len(l.shared) - 1
40ms       40ms    160:		if last &gt;= 0 {
.          .    161:			x = l.shared[last]
.          .    162:			l.shared = l.shared[:last]
.       10ms    163:			l.Unlock()
.          .    164:			break
.          .    165:		}
490ms     37.59s    166:		l.Unlock()
.          .    167:	}
40ms       40ms    168:	return x
.          .    169:}
.          .    170:
.          .    171:// pin pins the current goroutine to P, disables preemption and returns poolLocal pool for the P.
.          .    172:// Caller must call runtime_procUnpin() when done with the pool.
.          .    173:func (p *Pool) pin() *poolLocal {

答案1

得分: 2

sync.Pool在高并发负载下运行速度较慢。尝试在启动时分配所有结构,并多次使用它。例如,您可以在启动时创建多个goroutine(工作线程),而不是在每个请求上运行新的goroutine。我建议阅读这篇文章:https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs。

英文:

sync.Pool runs slowly with a high concurrency load. Try to allocate all the structures once during the startup and use it many times. For example, you may create several goroutines (workers) at start time, instead of run new goroutine on each request. I recommend read this article: https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs .

答案2

得分: 1

你可以尝试将GOGC的值设置为大于100。

或者,你可以实现自己的自由列表。

英文:

https://golang.org/pkg/sync/#Pool
> a free list maintained as part of a short-lived object is not a
> suitable use for a Pool, since the overhead does not amortize well in
> that scenario. It is more efficient to have such objects implement
> their own free list

  1. You may try to set GOGC value to greater then 100.

https://dave.cheney.net/2015/11/29/a-whirlwind-tour-of-gos-runtime-environment-variables

  1. Or, implement your own free list.

http://golang-jp.org/doc/effective_go.html#leaky_buffer

答案3

得分: 0

Go 1.13(2019年第四季度)可能会改变这一点:请参阅CL 166961

最初的问题是问题22950:“sync:避免在每次GC时清除整个池

> 我发现每个周期大约有1000个分配,这让我感到惊讶。这似乎表明池在每次GC时都清除了其全部内容。
查看实现代码似乎证实了这一点。

结果:

> ## sync:通过受害者缓存平滑处理GC中的池行为

> 目前,每个池在每次GC开始时都会完全清除。这对于池的重度用户来说是个问题,因为它会在池清除后立即引起分配激增,从而影响吞吐量和延迟。

> 这个CL通过引入受害者缓存机制来解决这个问题。

> 不再清除池,而是丢弃受害者缓存并将主缓存移动到受害者缓存中。

> 结果是,在稳定状态下,几乎没有新的分配,但如果池的使用量下降,对象仍将在两次GC之内被收集(而不是一次)。

> 这种受害者缓存方法还改善了池对GC动态的影响。
当前的方法导致池中的所有对象都是短寿命的。然而,如果一个应用程序处于稳定状态并且只是要重新填充其池,那么这些对象对于计算GC触发器和目标时的活动堆大小来说就像是长寿命的对象一样。
由于池化对象在计算GC触发器和目标时被视为短寿命对象,但在活动堆中作为长寿命对象,这会导致GC触发得太频繁。
如果池化对象是应用程序堆的非微不足道的一部分,这会增加GC的CPU开销。受害者缓存使池化对象能够像长寿命对象一样影响GC触发器和目标。

> 这对Get/Put性能没有影响,但是在发生GC时大大减少了对池用户的影响
PoolExpensiveNew通过大幅减少“New”函数的调用频率来证明了这一点。

英文:

Go 1.13 (Q4 2019) might change that: see CL 166961.

The original issue was issue 22950: "sync: avoid clearing the full Pool on every GC"

> where I find it surprising that there is around 1000 allocations again every cycle. This seems to indicate that the Pool is clearing its entire contents upon every GC.
A peek at the implementation seems to indicate that this is so.

Result:

> ## sync: smooth out Pool behavior over GC with a victim cache

> Currently, every Pool is cleared completely at the start of each GC.
This is a problem for heavy users of Pool because it causes an allocation spike immediately after Pools are clear, which impacts both throughput and latency.

> This CL fixes this by introducing a victim cache mechanism.

> Instead of clearing Pools, the victim cache is dropped and the primary cache is
moved to the victim cache.

> As a result, in steady-state, there are (roughly) no new allocations, but if Pool usage drops, objects will still be collected within two GCs (as opposed to one).

> This victim cache approach also improves Pool's impact on GC dynamics.
The current approach causes all objects in Pools to be short lived. However, if an application is in steady state and is just going to repopulate its Pools, then these objects impact the live heap size as if they were long lived.
Since Pooled objects count as short lived when computing the GC trigger and goal, but act as long lived objects in the live heap, this causes GC to trigger too frequently.
If Pooled objects are a non-trivial portion of an application's heap, this
increases the CPU overhead of GC. The victim cache lets Pooled objects
affect the GC trigger and goal as long-lived objects.

> This has no impact on Get/Put performance, but substantially reduces
the impact to the Pool user when a GC happens
.
PoolExpensiveNew demonstrates this in the substantially reduction in the rate at which
the "New" function is called.

huangapple
  • 本文由 发表于 2017年3月30日 13:46:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/43109483.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定