英文:
go - `__GI___pthread_mutex_unlock` takes most of the execution time when using cgo
问题
我正在使用cgo从Go调用C函数。在C函数内部,有一个回调到Go函数。换句话说,我在进行Go -> C -> Go的调用。
在运行pprof之后,我注意到__GI___pthread_mutex_unlock
占用了一半的执行时间。据我所知,cgo有一定的开销,特别是从C回调到Go。但是奇怪的是,cgo花费了一半的执行时间来进行一些锁定操作。我的代码有什么问题吗?
这是要翻译的内容。
英文:
I'm using cgo to call a C function from Go. Inside the C function there is a callback to a Go function. In other way, I'm calling Go -> C -> Go.
After running pprof, I noticed that the __GI___pthread_mutex_unlock
took half of the execution time. AFAIK, cgo has an overhead, especially calling back from C to Go. But it's weird that cgo takes half of the execution time to do some locking. Is there something wrong in my code?
main.go
package main
/*
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
extern void go_get_value(uint8_t* key32, uint8_t *value32);
void get_value(uint8_t* key32, uint8_t *value32) {
uint8_t value[32];
go_get_value(key32, value);
memcpy(value32, value, 32);
}
*/
import "C"
import (
"fmt"
"runtime"
"sync"
"time"
"unsafe"
"github.com/pkg/profile"
_ "github.com/ianlancetaylor/cgosymbolizer"
)
func getValue(key [32]byte) []byte {
key32 := (*C.uint8_t)(C.CBytes(key[:]))
value32 := (*C.uint8_t)(C.malloc(32))
C.get_value(key32, value32)
ret := C.GoBytes(unsafe.Pointer(value32), 32)
C.free(unsafe.Pointer(key32))
C.free(unsafe.Pointer(value32))
return ret
}
func main() {
defer profile.Start().Stop()
numWorkers := runtime.NumCPU()
fmt.Printf("numWorkers = %v\n", numWorkers)
numTasks := 10_000_000
tasks := make(chan struct{}, numTasks)
for i := 0; i < numTasks; i++ {
tasks <- struct{}{}
}
close(tasks)
start := time.Now()
var wg sync.WaitGroup
for i := 0; i < numWorkers; i++ {
wg.Add(1)
go func() {
for range tasks {
value := getValue([32]byte{})
_ = value
}
wg.Done()
}()
}
wg.Wait()
fmt.Printf("took %vms\n", time.Since(start).Milliseconds())
}
callback.go
package main
/*
#include <stdint.h>
extern void go_get_value(uint8_t* key32, uint8_t *value32);
*/
import "C"
import (
"unsafe"
)
func copyToCbytes(src []byte, dst *C.uint8_t) {
n := len(src)
for i := 0; i < n; i++ {
*(*C.uint8_t)(unsafe.Pointer(uintptr(unsafe.Pointer(dst)) + uintptr(i))) = (C.uint8_t)(src[i])
}
}
//export go_get_value
func go_get_value(key32 *C.uint8_t, value32 *C.uint8_t) {
key := C.GoBytes(unsafe.Pointer(key32), 32)
_ = key
value := make([]byte, 32)
copyToCbytes(value, value32)
}
Running enviroment:
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping: 0
CPU MHz: 2200.152
BogoMIPS: 4400.30
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 512 KiB
L1i cache: 512 KiB
L2 cache: 4 MiB
L3 cache: 55 MiB
NUMA node0 CPU(s): 0-31
...
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtop
ology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowpre
fetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
Golang version
go version go1.20.5 linux/amd64
EDIT: added running environment
答案1
得分: 2
尽管我无法使用你上面的程序进行复现:
文件:回调
构建ID:e295a7c26f8d6b18641985f09c9fa3872b3ae569
类型:CPU
时间:2023年6月20日下午12:41(+07)
持续时间:3.08秒,总样本数=19.49秒(632.73%)
进入交互模式(键入“help”获取命令,“o”获取选项)
(pprof) top
显示占总时间19490ms的节点11520ms,占比59.11%
删除了120个节点(累计<=97.45ms)
显示88个节点中的前10个
flat flat% sum% cum cum%
2320ms 11.90% 11.90% 2320ms 11.90% __GI___lll_lock_wake
2170ms 11.13% 23.04% 2170ms 11.13% futex_wait(内联)
1750ms 8.98% 32.02% 1750ms 8.98% runtime.procyield
1010ms 5.18% 37.20% 1240ms 6.36% runtime.casgstatus
950ms 4.87% 42.07% 1060ms 5.44% runtime.reentersyscall
870ms 4.46% 46.54% 3480ms 17.86% lll_mutex_lock_optimized(内联)
740ms 3.80% 50.33% 1900ms 9.75% runtime.mallocgc
610ms 3.13% 53.46% 4090ms 20.99% ___pthread_mutex_lock
570ms 2.92% 56.39% 810ms 4.16% runtime.exitsyscallfast
530ms 2.72% 59.11% 2220ms 11.39% runtime.lock2
但是,每个回调函数都有一个全局互斥锁,因此如果进行并行回调,性能会受到影响。
英文:
Though I could not reproduce it with your program above:
± go1.20.5 tool pprof /tmp/profile3378726905/cpu.pprof
File: callback
Build ID: e295a7c26f8d6b18641985f09c9fa3872b3ae569
Type: cpu
Time: Jun 20, 2023 at 12:41pm (+07)
Duration: 3.08s, Total samples = 19.49s (632.73%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 11520ms, 59.11% of 19490ms total
Dropped 120 nodes (cum <= 97.45ms)
Showing top 10 nodes out of 88
flat flat% sum% cum cum%
2320ms 11.90% 11.90% 2320ms 11.90% __GI___lll_lock_wake
2170ms 11.13% 23.04% 2170ms 11.13% futex_wait (inline)
1750ms 8.98% 32.02% 1750ms 8.98% runtime.procyield
1010ms 5.18% 37.20% 1240ms 6.36% runtime.casgstatus
950ms 4.87% 42.07% 1060ms 5.44% runtime.reentersyscall
870ms 4.46% 46.54% 3480ms 17.86% lll_mutex_lock_optimized (inline)
740ms 3.80% 50.33% 1900ms 9.75% runtime.mallocgc
610ms 3.13% 53.46% 4090ms 20.99% ___pthread_mutex_lock
570ms 2.92% 56.39% 810ms 4.16% runtime.exitsyscallfast
530ms 2.72% 59.11% 2220ms 11.39% runtime.lock2
But there's a global mutex for every callbacks, so that would kill the performance if you do parallel callbacks.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论