2023年6月20日 03:00:16go评论191阅读模式

英文:

go - `__GI___pthread_mutex_unlock` takes most of the execution time when using cgo

问题

我正在使用cgo从Go调用C函数。在C函数内部，有一个回调到Go函数。换句话说，我在进行Go -> C -> Go的调用。

在运行pprof之后，我注意到__GI___pthread_mutex_unlock占用了一半的执行时间。据我所知，cgo有一定的开销，特别是从C回调到Go。但是奇怪的是，cgo花费了一半的执行时间来进行一些锁定操作。我的代码有什么问题吗？

这是要翻译的内容。

英文:

I'm using cgo to call a C function from Go. Inside the C function there is a callback to a Go function. In other way, I'm calling Go -> C -> Go.

After running pprof, I noticed that the __GI___pthread_mutex_unlock took half of the execution time. AFAIK, cgo has an overhead, especially calling back from C to Go. But it's weird that cgo takes half of the execution time to do some locking. Is there something wrong in my code?

main.go

package main

/*
#include &lt;stdint.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;string.h&gt;

extern void go_get_value(uint8_t* key32, uint8_t *value32);

void get_value(uint8_t* key32, uint8_t *value32) {
	uint8_t value[32];
	go_get_value(key32, value);
	memcpy(value32, value, 32);
}
*/
import &quot;C&quot;
import (
	&quot;fmt&quot;
	&quot;runtime&quot;
	&quot;sync&quot;
	&quot;time&quot;
	&quot;unsafe&quot;

	&quot;github.com/pkg/profile&quot;

	_ &quot;github.com/ianlancetaylor/cgosymbolizer&quot;
)

func getValue(key [32]byte) []byte {
	key32 := (*C.uint8_t)(C.CBytes(key[:]))
	value32 := (*C.uint8_t)(C.malloc(32))
	C.get_value(key32, value32)
	ret := C.GoBytes(unsafe.Pointer(value32), 32)
	C.free(unsafe.Pointer(key32))
	C.free(unsafe.Pointer(value32))
	return ret
}

func main() {
	defer profile.Start().Stop()

	numWorkers := runtime.NumCPU()
	fmt.Printf(&quot;numWorkers = %v\n&quot;, numWorkers)
	numTasks := 10_000_000
	tasks := make(chan struct{}, numTasks)
	for i := 0; i &lt; numTasks; i++ {
		tasks &lt;- struct{}{}
	}
	close(tasks)
	start := time.Now()
	var wg sync.WaitGroup
	for i := 0; i &lt; numWorkers; i++ {
		wg.Add(1)
		go func() {
			for range tasks {
				value := getValue([32]byte{})
				_ = value
			}
			wg.Done()
		}()
	}
	wg.Wait()
	fmt.Printf(&quot;took %vms\n&quot;, time.Since(start).Milliseconds())
}

callback.go

package main

/*
#include &lt;stdint.h&gt;

extern void go_get_value(uint8_t* key32, uint8_t *value32);
*/
import &quot;C&quot;
import (
	&quot;unsafe&quot;
)

func copyToCbytes(src []byte, dst *C.uint8_t) {
	n := len(src)
	for i := 0; i &lt; n; i++ {
		*(*C.uint8_t)(unsafe.Pointer(uintptr(unsafe.Pointer(dst)) + uintptr(i))) = (C.uint8_t)(src[i])
	}
}

//export go_get_value
func go_get_value(key32 *C.uint8_t, value32 *C.uint8_t) {
	key := C.GoBytes(unsafe.Pointer(key32), 32)
	_ = key
	value := make([]byte, 32)
	copyToCbytes(value, value32)
}

Running enviroment:

lscpu

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           79
Model name:                      Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping:                        0
CPU MHz:                         2200.152
BogoMIPS:                        4400.30
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       512 KiB
L1i cache:                       512 KiB
L2 cache:                        4 MiB
L3 cache:                        55 MiB
NUMA node0 CPU(s):               0-31
...
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtop
                                 ology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowpre
                                 fetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities

Golang version

go version go1.20.5 linux/amd64

Here is the pprof result:

EDIT: added running environment

答案1

得分: 2

尽管我无法使用你上面的程序进行复现：

文件：回调
构建ID：e295a7c26f8d6b18641985f09c9fa3872b3ae569
类型：CPU
时间：2023年6月20日下午12:41（+07）
持续时间：3.08秒，总样本数=19.49秒（632.73%）
进入交互模式（键入“help”获取命令，“o”获取选项）
(pprof) top
显示占总时间19490ms的节点11520ms，占比59.11%
删除了120个节点（累计<=97.45ms）
显示88个节点中的前10个
      flat  flat%   sum%        cum   cum%
    2320ms 11.90% 11.90%     2320ms 11.90%  __GI___lll_lock_wake
    2170ms 11.13% 23.04%     2170ms 11.13%  futex_wait（内联）
    1750ms  8.98% 32.02%     1750ms  8.98%  runtime.procyield
    1010ms  5.18% 37.20%     1240ms  6.36%  runtime.casgstatus
     950ms  4.87% 42.07%     1060ms  5.44%  runtime.reentersyscall
     870ms  4.46% 46.54%     3480ms 17.86%  lll_mutex_lock_optimized（内联）
     740ms  3.80% 50.33%     1900ms  9.75%  runtime.mallocgc
     610ms  3.13% 53.46%     4090ms 20.99%  ___pthread_mutex_lock
     570ms  2.92% 56.39%      810ms  4.16%  runtime.exitsyscallfast
     530ms  2.72% 59.11%     2220ms 11.39%  runtime.lock2

但是，每个回调函数都有一个全局互斥锁，因此如果进行并行回调，性能会受到影响。

英文:

Though I could not reproduce it with your program above:

 &#177; go1.20.5 tool pprof /tmp/profile3378726905/cpu.pprof
File: callback
Build ID: e295a7c26f8d6b18641985f09c9fa3872b3ae569
Type: cpu
Time: Jun 20, 2023 at 12:41pm (+07)
Duration: 3.08s, Total samples = 19.49s (632.73%)
Entering interactive mode (type &quot;help&quot; for commands, &quot;o&quot; for options)
(pprof) top
Showing nodes accounting for 11520ms, 59.11% of 19490ms total
Dropped 120 nodes (cum &lt;= 97.45ms)
Showing top 10 nodes out of 88
      flat  flat%   sum%        cum   cum%
    2320ms 11.90% 11.90%     2320ms 11.90%  __GI___lll_lock_wake
    2170ms 11.13% 23.04%     2170ms 11.13%  futex_wait (inline)
    1750ms  8.98% 32.02%     1750ms  8.98%  runtime.procyield
    1010ms  5.18% 37.20%     1240ms  6.36%  runtime.casgstatus
     950ms  4.87% 42.07%     1060ms  5.44%  runtime.reentersyscall
     870ms  4.46% 46.54%     3480ms 17.86%  lll_mutex_lock_optimized (inline)
     740ms  3.80% 50.33%     1900ms  9.75%  runtime.mallocgc
     610ms  3.13% 53.46%     4090ms 20.99%  ___pthread_mutex_lock
     570ms  2.92% 56.39%      810ms  4.16%  runtime.exitsyscallfast
     530ms  2.72% 59.11%     2220ms 11.39%  runtime.lock2

But there's a global mutex for every callbacks, so that would kill the performance if you do parallel callbacks.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

go – 使用cgo时，`GI_pthread_mutex_unlock` 占用了大部分的执行时间。

问题

答案1

How do I encode URLs containing spaces or special characters in Go?

GoLang通过POST请求发送文件

Go/GoLang check IP address in range

Golang：将整数转换为切片

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论