go – 使用cgo时,`__GI___pthread_mutex_unlock` 占用了大部分的执行时间。

huangapple go评论97阅读模式
英文:

go - `__GI___pthread_mutex_unlock` takes most of the execution time when using cgo

问题

我正在使用cgo从Go调用C函数。在C函数内部,有一个回调到Go函数。换句话说,我在进行Go -> C -> Go的调用。

在运行pprof之后,我注意到__GI___pthread_mutex_unlock占用了一半的执行时间。据我所知,cgo有一定的开销,特别是从C回调到Go。但是奇怪的是,cgo花费了一半的执行时间来进行一些锁定操作。我的代码有什么问题吗?

这是要翻译的内容。

英文:

I'm using cgo to call a C function from Go. Inside the C function there is a callback to a Go function. In other way, I'm calling Go -> C -> Go.

After running pprof, I noticed that the __GI___pthread_mutex_unlock took half of the execution time. AFAIK, cgo has an overhead, especially calling back from C to Go. But it's weird that cgo takes half of the execution time to do some locking. Is there something wrong in my code?

main.go

package main

/*
#include <stdint.h>
#include <stdlib.h>
#include <string.h>

extern void go_get_value(uint8_t* key32, uint8_t *value32);

void get_value(uint8_t* key32, uint8_t *value32) {
	uint8_t value[32];
	go_get_value(key32, value);
	memcpy(value32, value, 32);
}
*/
import "C"
import (
	"fmt"
	"runtime"
	"sync"
	"time"
	"unsafe"

	"github.com/pkg/profile"

	_ "github.com/ianlancetaylor/cgosymbolizer"
)

func getValue(key [32]byte) []byte {
	key32 := (*C.uint8_t)(C.CBytes(key[:]))
	value32 := (*C.uint8_t)(C.malloc(32))
	C.get_value(key32, value32)
	ret := C.GoBytes(unsafe.Pointer(value32), 32)
	C.free(unsafe.Pointer(key32))
	C.free(unsafe.Pointer(value32))
	return ret
}

func main() {
	defer profile.Start().Stop()

	numWorkers := runtime.NumCPU()
	fmt.Printf("numWorkers = %v\n", numWorkers)
	numTasks := 10_000_000
	tasks := make(chan struct{}, numTasks)
	for i := 0; i < numTasks; i++ {
		tasks <- struct{}{}
	}
	close(tasks)
	start := time.Now()
	var wg sync.WaitGroup
	for i := 0; i < numWorkers; i++ {
		wg.Add(1)
		go func() {
			for range tasks {
				value := getValue([32]byte{})
				_ = value
			}
			wg.Done()
		}()
	}
	wg.Wait()
	fmt.Printf("took %vms\n", time.Since(start).Milliseconds())
}

callback.go

package main

/*
#include <stdint.h>

extern void go_get_value(uint8_t* key32, uint8_t *value32);
*/
import "C"
import (
	"unsafe"
)

func copyToCbytes(src []byte, dst *C.uint8_t) {
	n := len(src)
	for i := 0; i < n; i++ {
		*(*C.uint8_t)(unsafe.Pointer(uintptr(unsafe.Pointer(dst)) + uintptr(i))) = (C.uint8_t)(src[i])
	}
}

//export go_get_value
func go_get_value(key32 *C.uint8_t, value32 *C.uint8_t) {
	key := C.GoBytes(unsafe.Pointer(key32), 32)
	_ = key
	value := make([]byte, 32)
	copyToCbytes(value, value32)
}

Running enviroment:

lscpu

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           79
Model name:                      Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping:                        0
CPU MHz:                         2200.152
BogoMIPS:                        4400.30
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       512 KiB
L1i cache:                       512 KiB
L2 cache:                        4 MiB
L3 cache:                        55 MiB
NUMA node0 CPU(s):               0-31
...
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtop
                                 ology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowpre
                                 fetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities

Golang version

go version go1.20.5 linux/amd64

Here is the pprof result:
go – 使用cgo时,`__GI___pthread_mutex_unlock` 占用了大部分的执行时间。

EDIT: added running environment

答案1

得分: 2

尽管我无法使用你上面的程序进行复现:

文件:回调
构建ID:e295a7c26f8d6b18641985f09c9fa3872b3ae569
类型:CPU
时间:2023年6月20日下午12:41(+07)
持续时间:3.08秒,总样本数=19.49秒(632.73%)
进入交互模式(键入“help”获取命令,“o”获取选项)
(pprof) top
显示占总时间19490ms的节点11520ms,占比59.11%
删除了120个节点(累计<=97.45ms)
显示88个节点中的前10个
      flat  flat%   sum%        cum   cum%
    2320ms 11.90% 11.90%     2320ms 11.90%  __GI___lll_lock_wake
    2170ms 11.13% 23.04%     2170ms 11.13%  futex_wait(内联)
    1750ms  8.98% 32.02%     1750ms  8.98%  runtime.procyield
    1010ms  5.18% 37.20%     1240ms  6.36%  runtime.casgstatus
     950ms  4.87% 42.07%     1060ms  5.44%  runtime.reentersyscall
     870ms  4.46% 46.54%     3480ms 17.86%  lll_mutex_lock_optimized(内联)
     740ms  3.80% 50.33%     1900ms  9.75%  runtime.mallocgc
     610ms  3.13% 53.46%     4090ms 20.99%  ___pthread_mutex_lock
     570ms  2.92% 56.39%      810ms  4.16%  runtime.exitsyscallfast
     530ms  2.72% 59.11%     2220ms 11.39%  runtime.lock2

但是,每个回调函数都有一个全局互斥锁,因此如果进行并行回调,性能会受到影响。

英文:

Though I could not reproduce it with your program above:

 &#177; go1.20.5 tool pprof /tmp/profile3378726905/cpu.pprof
File: callback
Build ID: e295a7c26f8d6b18641985f09c9fa3872b3ae569
Type: cpu
Time: Jun 20, 2023 at 12:41pm (+07)
Duration: 3.08s, Total samples = 19.49s (632.73%)
Entering interactive mode (type &quot;help&quot; for commands, &quot;o&quot; for options)
(pprof) top
Showing nodes accounting for 11520ms, 59.11% of 19490ms total
Dropped 120 nodes (cum &lt;= 97.45ms)
Showing top 10 nodes out of 88
      flat  flat%   sum%        cum   cum%
    2320ms 11.90% 11.90%     2320ms 11.90%  __GI___lll_lock_wake
    2170ms 11.13% 23.04%     2170ms 11.13%  futex_wait (inline)
    1750ms  8.98% 32.02%     1750ms  8.98%  runtime.procyield
    1010ms  5.18% 37.20%     1240ms  6.36%  runtime.casgstatus
     950ms  4.87% 42.07%     1060ms  5.44%  runtime.reentersyscall
     870ms  4.46% 46.54%     3480ms 17.86%  lll_mutex_lock_optimized (inline)
     740ms  3.80% 50.33%     1900ms  9.75%  runtime.mallocgc
     610ms  3.13% 53.46%     4090ms 20.99%  ___pthread_mutex_lock
     570ms  2.92% 56.39%      810ms  4.16%  runtime.exitsyscallfast
     530ms  2.72% 59.11%     2220ms 11.39%  runtime.lock2

But there's a global mutex for every callbacks, so that would kill the performance if you do parallel callbacks.

huangapple
  • 本文由 发表于 2023年6月20日 03:00:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76509418.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定