Goroutines和C线程之间的原子栅栏并发 – 语义是什么?

huangapple go评论73阅读模式
英文:

Atomic fence concurrency between Goroutines and C threads - what are the semantics?

问题

我在想是否可以显式地协调goroutine和C线程之间的原子操作并发。

这里的使用案例涉及一个C语言的音频处理库,它创建一个操作系统线程,并定期调用用户提供的回调函数来获取音频数据。这必须几乎实时发生,所以我不想承担cgo调用、堆栈交换和Go并发的开销。环形缓冲区可以解决这个问题,其中一个线程写入缓冲区,另一个线程读取,并使用内存屏障进行同步。

然而,目前Go中原子操作的内存语义在文档中完全没有定义,因此对于这个目的完全没有用处,可能对其他目的也是如此...(https://golang.org/pkg/sync/atomic/只是说"atomic",参见https://github.com/golang/go/issues/5045

但是,它必须以某种方式工作,即使没有文档。怎么做?

请注意,我不是在询问如何解决我描述的问题。我不是在问环形缓冲区是否是正确的选择,或者是否应该"通过共享进行通信"或其他什么。我是在询问Go中当前实现的原子操作的内存顺序语义(比如,最新的发布版本-1.16.5)。

特别是,这里有一个示例程序,设置了一个类似于我实际使用情况的情况:

package main

/*
#include <pthread.h>
#include <malloc.h>

typedef struct {
   int fence_0;
   char *data;
} shared_data;

shared_data *make_shared_data() {
   shared_data *sd = calloc(sizeof(shared_data), 1);
   sd->data = calloc(1024,1);
   sd->data[0] = 17;
   return sd;
}

void *get_shared_data_ptr(shared_data *sd) {
   return sd->data;
}

int read_data_in_pthread(shared_data *sd) {
   int l;
   __atomic_load(&sd->fence_0, &l, __ATOMIC_ACQUIRE);
   if (l < 2) return 0;
   return sd->data[0] + sd->data[1023]; 
}

*/
import "C"
import (
   "fmt"
   "runtime"
   "reflect"
   "unsafe"
   "sync/atomic"
)

func main() {

   // Prevent thread/cache switching (to avoid asking a third, unimportant question and allow the below "naughty")
   runtime.LockOSThread()

   // Allocate a C-owned structure.
   csd := C.make_shared_data()

   // This is just an expedient for the sake of this example, I'm aware it's naughty/bad, etc.
   ptr := (*byte)(C.get_shared_data_ptr(csd))
   arrptr := &reflect.SliceHeader{Data: uintptr(unsafe.Pointer(ptr)), Len: 1024, Cap: 1024}
   arr := *(*[]byte)(unsafe.Pointer(arrptr))

   fmt.Printf("%d\n", arr[0])
   done := make(chan bool)

   // Repeatedly execute a reader function in a cgo thread which will output zero if first fence is not 2
   // and output the sum of the first and last data points if it is.
   go func(){
         var s uint8
         s = 0
         for s == 0 {
            s = uint8(C.read_data_in_pthread(csd))
         }
         fmt.Printf("finished: %d\n", s)
         done <- true
      }()

   go func(){
         atomic.StoreInt32((*int32)(&csd.fence_0), 1)
         for i := 0; i < 1024; i++ {
            arr[i] = 255
         }
         atomic.StoreInt32((*int32)(&csd.fence_0), 2)
   }()

   <-done
}

问题是:(a) 这个程序的输出是否可能是17? (b) 如果不是,这个程序的输出是否总是254,还是可能是255

如果Go的原子存储使用类似于gcc的ATOMIC_SEQ_CST的内存模型,内存屏障是顺序的,我们将始终看到254。这似乎是一个合理的默认值。但是,这一点一定是真的吗?

如果不是,我的程序将是不可移植的并产生错误。所以,我想确切地知道。

(是的,我知道上面的测试用例肯定是完全不可移植的/只在GNU/Linux上运行...实际的库实际上是可移植的。)

英文:

I'm wondering if it is possible to coordinate atomic operation concurrency between goroutines and C threads explicitly.

The use case here involves an audio processing library in C, which creates an OS thread, and periodically calls a user-supplied callback to retrieve audio data. This must happen in almost real-time, so I don't want to incur the overhead of cgo calls, stack swaps, and Go-land concurrency. A ring buffer can solve this problem in general, where one thread writes to the buffer, another reads, and synchronization is performed with memory fences.

However, it appears that currently the memory semantics of atomic operations in Go is left completely undefined in the docs, and therefore utterly useless for this purpose, and probably many others.... (https://golang.org/pkg/sync/atomic/ unhelpfully just says "atomic", see https://github.com/golang/go/issues/5045)

But - it has to work in some way, even if that's not documented. How?

PLEASE NOTE I am not asking about solutions to the problem I describe, however. I am not asking if ring buffers are the correct choice, or if I should "communicate by sharing" or whatever. I am asking after the currently implemented memory order semantics of atomic operations in Go (say, the latest release version - 1.16.5 for concreteness).

In particular, here is a sample program which sets up a similar situation to what occurs in my actual use case:

package main
/*
#include &lt;pthread.h&gt;
#include &lt;malloc.h&gt;
typedef struct {
int fence_0;
char *data;
} shared_data;
shared_data *make_shared_data() {
shared_data *sd = calloc(sizeof(shared_data), 1);
sd-&gt;data = calloc(1024,1);
sd-&gt;data[0] = 17;
return sd;
}
void *get_shared_data_ptr(shared_data *sd) {
return sd-&gt;data;
}
int read_data_in_pthread(shared_data *sd) {
int l;
__atomic_load(&amp;sd-&gt;fence_0, &amp;l, __ATOMIC_ACQUIRE);
if (l &lt; 2) return 0;
return sd-&gt;data[0] + sd-&gt;data[1023]; 
}
*/
import &quot;C&quot;
import (
&quot;fmt&quot;
&quot;runtime&quot;
&quot;reflect&quot;
&quot;unsafe&quot;
&quot;sync/atomic&quot;
)
func main() {
// Prevent thread/cache switching (to avoid asking a third, unimportant question and allow the below &quot;naughty&quot;)
runtime.LockOSThread()
// Allocate a C-owned structure.
csd := C.make_shared_data()
// This is just an expedient for the sake of this example, I&#39;m aware it&#39;s naughty/bad, etc.
ptr := (*byte)(C.get_shared_data_ptr(csd))
arrptr := &amp;reflect.SliceHeader{Data: uintptr(unsafe.Pointer(ptr)), Len: 1024, Cap: 1024}
arr := *(*[]byte)(unsafe.Pointer(arrptr))
fmt.Printf(&quot;%d\n&quot;, arr[0])
done := make(chan bool)
// Repeatedly execute a reader function in a cgo thread which will output zero if first fence is not 2
// and output the sum of the first and last data points if it is.
go func(){
var s uint8
s = 0
for s == 0 {
s = uint8(C.read_data_in_pthread(csd))
}
fmt.Printf(&quot;finished: %d\n&quot;, s)
done &lt;- true
}()
go func(){
atomic.StoreInt32((*int32)(&amp;csd.fence_0), 1)
for i := 0; i &lt; 1024; i++ {
arr[i] = 255
}
atomic.StoreInt32((*int32)(&amp;csd.fence_0), 2)
}()
&lt;-done
}

The question is: (a) Can the output of this program ever be 17? (b) IF not, must the output of this program always be 254, or might it be 255?

If the Go atomic stores work with a memory model similar to gcc's ATOMIC_SEQ_CST, the memory fence is sequential, and we'll always see 254. This would seem to be a sensible default. But, is it necessarily true?

If not, my program will be non-portable and produce errors. So, I'd like to know for sure.

(Yes, I know the test case above is definitely entirely non-portable / only runs on GNU/Linux... the actual library in question is in fact portable.)

答案1

得分: 2

在Go语言的内存模型和C/C++中的(多个)内存模型之间存在一种阻抗不匹配。这可能会给实现者带来一些麻烦:通过cgo在C代码中进行调用时,如果Go系统使用某种全局或部分存储顺序模型,而C系统使用放松的内存模型,可能需要进行大量的CPU同步。

实际上,每个实现都会努力使用相同类型的同步机制来进行原子加载和原子存储,例如32位的原子加载和原子存储。但是:

这里的使用案例涉及一个在C中创建操作系统线程的音频处理库,并定期调用用户提供的回调函数来获取音频数据。这必须几乎实时发生,因此我不想承担cgo调用、堆栈切换和Go并发的开销。环形缓冲区可以解决这个问题,其中一个线程写入缓冲区,另一个线程读取,并使用内存屏障进行同步。

[省略]

但是-它必须以某种方式工作,即使没有记录。怎么做?

你需要逐个查看每个实现,因为“怎么做”可能在每个实现中都有所不同。因此,查看你的系统在PowerPC实现上使用了什么,查看你的系统在ARM实现上使用了什么,以此类推。你需要让你的低级Go例程针对具体的实现进行选择,以与你的低级C例程配合工作。

英文:

There's a sort of impedance mismatch, as it were, between the Go memory model and the (multiple) memory models available in C and C++ (see cppreference.com on C memory order options, and note that C++ has a more nuanced view than C11 did, beginning in C++20). This can, at least in theory, make for some big headaches for implementors: calls in and out of C code, via cgo, might need to do heavy-duty CPU sync if, e.g., the Go system uses some sort of total or partial store order model and the C system uses a relaxed memory model.

In practice, each implementation will strive to use the same kinds of synchronizations for atomic-load-32 and atomic-store-32, for instance. But:

> The use case here involves an audio processing library in C, which creates an OS thread, and periodically calls a user-supplied callback to retrieve audio data. This must happen in almost real-time, so I don't want to incur the overhead of cgo calls, stack swaps, and Go-land concurrency. A ring buffer can solve this problem in general, where one thread writes to the buffer, another reads, and synchronization is performed with memory fences.
>
> [snip]
>
> But - it has to work in some way, even if that's not documented. How?

You're going to have to look at each implementation, one at a time, because the "how" could—at least potentially—be different each time. So find out what your systems use on their PowerPC implementations, find out what your systems use on their ARM implementations, and so on. You'll want to have your low level Go routines be implementation-specific, chosen to work with your low-level C routines.

答案2

得分: 1

语言本身并不定义任何原子操作。然而,sync/atomic包定义了原子操作。你提供的链接中的问题标题以“doc:”为前缀,意味着他们只是在讨论如何改进有关atomic与Go内存模型交互的文档。该包仍然可用。其中的操作是按照描述进行原子操作的。任何已知的异常都在“Bugs”部分中列出:https://golang.org/pkg/sync/atomic/#pkg-note-BUG

英文:

The language itself doesn't define any atomic operations. The sync/atomic package, however, does. The issue you link is prefixed "doc:", meaning that they're only debating how to improve the documentation surrounding atomic's interaction with the Go memory model. The package still works. The operations in it are atomic as described. Any known exceptions are listed in the "Bugs" section: https://golang.org/pkg/sync/atomic/#pkg-note-BUG

huangapple
  • 本文由 发表于 2021年6月12日 01:12:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/67941030.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定