C++内存屏障只影响函数内的代码吗?

huangapple go评论71阅读模式
英文:

Do C++ memory barriers affect only the code in a function?

问题

根据我的理解,该障碍阻止编译器和CPU对其之前的读/写操作重新排序,以确保它们不会在障碍之后的操作之前发生。

但是,这仅适用于它所在的函数吗?如果它被内联,会怎么样?
或者它只是导致某些CPU缓冲区刷新?

英文:

From what I understand, the barrier prevents the read/write operations before it from being reordered by the compiler and the CPU, so that they do not happen before the operations after the barrier.

However does this only apply to the function it's in? What if it gets inlined then?
Or does it just cause some some CPU buffers to flush?

答案1

得分: 3

TL;DR 是的,障碍物影响整个线程/程序,不考虑任何函数调用。

我觉得你可能混淆了两件事。

让两个线程执行一些读写指令的序列,然后对于相同的地址 A、值 X 和指令 write(A, 'X'); y = read(A),根据 C++ 内存模型,基本上有两种情况:

  • a) 如果两个指令在同一个线程上执行,read 保证返回 X,即 y=='X'
  • b) 如果这些指令在不同的线程上执行,就没有保证,除非通过一些同步原语明确定 synchronized,否则它是未定义行为。

换句话说,编译器生成指令序列的方式对你来说并不重要 - 要么它就能正常工作,要么你不应该这样做。

只要你无法观察到不同,编译器可以随心所欲地重新排序 C++ 语句和相应的 CPU 指令,只要可观察的结果与按照 C++ 表达式和语句的顺序执行一样。但只要你不能观察到差异,编译器可以做几乎任何它想做的事情。

当然,编译器永远不会重新排序它不能看到的东西,因为它可能具有明确定义的可观察副作用。因此,具有 virtual、跨 TUs(翻译单元)而没有 -flto 选项的共享库的调用不会重新排序。但依赖于这一点来观察跨线程的行为是未定义的。

所有这些都发生在 C++ 机器模型内部,都不会为你提供任何关于执行的 CPU 指令的种类的保证。

此外,C++ 明确表示不承诺如何从其他线程或外部世界观察 CPU 指令序列,除非明确定义为 synchronized。如果编译器观察到写入某个内存位置是多余的,因为 线程/程序本身无法区分,它就不必写入任何东西。例如:

int* ptr = ...
*ptr=42;
int x = *ptr;
// 可以只替换为,因此根本不会写入任何内存
int x = 42;

你不是在说将 42 写入内存,不,你是说程序必须表现得好像你已经将它写入内存了,除非 ptr 被跨线程 synchronized,否则编译器将不会关心其他线程对 ptr 的访问。

此外,C++ 内存模型默认在每个线程上运行,只能使用一组特定的原语(原子操作、锁、屏障...)从多个线程中访问。只有对它们来说,访问是同步的,因此只有在它们周围才会定义来自所有其他指令的效果的可见性。

详细信息可以参考 cppreference,但基本思想是,对共享原语的任何访问都可以用来限制多个线程中执行的 CPU 指令的可观察性。

共享原语上的操作可以强制 C++ 限制生成的 CPU 指令的重新排序,以符合 C++ 评估顺序的规则。

例如,对于以下共享变量

int x = 0;
std::atomic_bool a;

和在并行调用并以注释的顺序执行的两个函数

void thread1(){
    x = 5; // 1
    a.store(true,std::memory_order_seq_cst); //2
}
void thread2(){
    a.load(true,std::memory_order_seq_cst); // 3
    int y = x; // 4
}

然后 y==5

  • 步骤 2 - 保证稍后执行的对 a 的所有读取将观察到 x = 5。这意味着阻止编译器在步骤 1 和 2 之间交换步骤 - 至少会有编译器屏障。
  • 步骤 3 - 确保所有发生在步骤 2 之前的写入/读取对 thread2 实际可见 - CPU 缓存的同步或其他必要的操作。它还防止在步骤 3 之前对步骤 4 的重新排序。

只要记住,内存模型不会限制执行顺序,只会限制所选执行顺序的可见性。对于前者,你需要锁定或显式(非内存)屏障。如果步骤 3 在步骤 2 之前运行,那么步骤 4 仍然是未定义的行为。

英文:

TL;DR yes, the barrier concerns the whole thread/program, regardless of any functions calls.

I feel like you might be mixing two things.

Let's have two threads execute a each some sequence of read and write instructions somehow interleaved.

Then for the same address A, value X and for instructions write(A, 'X'); y = read(A) there are basically two cases according to the C++ memory model:

  • a) If both instructions execute on the same thread, read is guaranteed to return 'X'->y=='X'.
  • b) If the instruction happen in different threads, there are no guarantees, it is undefined behaviour unless synchronized explicitly through some synchronization primitives.

In other words, how the compiler generated the sequence of instructions is kind of irrelevant to you - it either just works or you should not be doing it.

The compiler can reorder both C++ statements and the corresponding CPU instructions as it sees fit as long as the observable result is the same as the sequential execution as per C++ rules of evaluating expressions and statements. But as long as you cannot observe the difference, the compiler can to almost anything it wants.

Of course the compiler can never reorder what it cannot into see because it might have well-defined observable side effect. Therefore calls with virtual, across TUs without -flto, to shared libraries are not reordered. But relying on this for observability across threads is undefined behaviour.

All of that happens inside the C++ machine model, none of that gives you any guarantees on what sort of CPU instructions are executed at all.

Furthermore, C++ explicitly gives no promises how the sequence of CPU instructions is observable from any other thread or from the outside world for that matter unless explicitly synchronized. If the compiler observes that writing to some memory location is redundant because the thread/program itself cannot tell the difference, it does not have to write anything. For example:

int* ptr = ...
*ptr=42;
int x = *ptr;
// Can be just replaced with and thus no memory is written to at all
int x = 42;

You are not saying, write to 42 to the memory, no, you are saying the program must behave as if you have written it to the memory and unless ptr is synchronized across threads, the compiler will not care about other threads' accesses to ptr at all.

Going on, C++ memory model operates by default on per-thread basis with only a specific set of primitives (atomics, locks, barriers...) which can be accessed from multiple threads. Only for them, the access is synchronized and therefore it is only for them where visibility of CPU read/write instructions play any role at all, and it is only around them where the visibility of effect from all other instructions is defined.

The details are on cppreference but the idea is that any access to the shared primitive
can be used to constraint the observability of the executed CPU instructions in multiple threads.

Operations on the shared primitive can force C++ to constraint the reordering of the generated CPU instructions to the rules of C++ evaluation order.

For example for the following shared variables

int x = 0;
std::atomic_bool a;

and two functions called in parallel and executed in the commented order

void thread1(){
    x = 5; // 1
    a.store(true,std::memory_order_seq_cst); //2
}
void thread2(){
    a.load(true,std::memory_order_seq_cst); // 3
    int y = x; // 4
}

then y==5.

  • Step 2 - guarantees that any reads from a which are executed later will observe x = 5. Meaning, this prevents the compiler from exchanging steps 1 and 2 - compiler barrier at least.
  • Step 3 - ensures that all those writes/read that happened before step 2 are actually visible to thread2 - CPU sync of caches or whatever is necessary. It also prevents reordering step 4 before step 3.

Just be careful that the memory model does not constraint execution order, only the visibility of the chosen execution order. For the former, you need locks or explicit (non-memory) barriers. If step 3 happens to run before step 2, then step 4 is still undefined behaviour.

huangapple
  • 本文由 发表于 2023年3月7日 04:01:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/75655325.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定