内存控制器在传播缓存行时如何确保原子内存顺序?

huangapple go评论78阅读模式
英文:

How does the memory controller guarantee memory ordering of atomics when propagating cachelines?

问题

内存控制器如何确保两个缓存行的顺序以保证对x的原子操作的内存顺序被遵守,这是一个复杂的问题。可能的解决方案之一是,内存控制器会始终按照它们在L1中被更新的顺序传播缓存行。也就是说,如果CL 2在CL 1之后被更新,内存控制器会先将CL 1传播到主内存,然后再传播CL 2,从而满足内存顺序的约束。然而,确切的实现方式可能因处理器和架构而异,这通常是由硬件和体系结构设计来处理的复杂问题。

英文:

I'm currently taking a deep look at std::atomics and the C++ memory model. What really helped my mental model is the concept of the store and load buffer of the CPU, which is basically a fifo queue for data that has to be written or read to/from the L1 cache which is present in Intel architectures at least. I understand that atomic operations are basically instructions to the CPU which prevent the wrapped type from tearing AND reordering write or read instructions across the barrier at compiletime or at runtime. To illustrate the gap in my mental model I quickly came up this example:

Demo

#include <atomic>
#include <iostream>
#include <thread>

int a;
int b;
int c;
std::atomic<int> x;
int e = 0;

auto thread1() {
    while(1) {
        a = 3;
        b = 5;
        c = 1;
        x.store(10, std::memory_order::release);
        e++;
        std::cout << "stored!" << std::endl;
    }
}

auto thread2() {
    while(1) {
        x.load(std::memory_order::acquire);
        std::cout << b << std::endl;
    }
}


int main() {
    [[maybe_unused]] auto t1 = std::thread(&thread1);
    [[maybe_unused]] auto t2 = std::thread(&thread2);
}

Here, one thread writes to the global variables a,b,c, the atomic variable x and normal variable e (read before increment) while the other thread reads from the atomic variable x and normal variable b. Assume for the next part that both threads actually run on different CPU cores. Also keep in mind that this simple example completely ignores contention synchronisation and only serves a static example.

Now this is my mental model of the process:

内存控制器在传播缓存行时如何确保原子内存顺序?

As you can see the store buffer feeds data in an ordered manner into the L1 cache. The data then propagates through the L2 and L3 caches to main memory. Nobody knows when it will arrive there though, but it will arrive in full cachelines of 64 Kb (on most architectures). Let's assume now that the global variables a,b,c happen to be placed on a different cachelines than x and e. Which sparked my question: How will the memory controller know to propogate the two cachelines such that the memory ordering implied by the atomic operation on x is respected?

What I mean is that if cacheline 1) happens to arrive in main memory before CL 2), everything is fine, the newly written values of a,b and c are visible to other thread before the store of x. But what if the reverse happens? If cacheline 2) propogates first, the write to x and possibly e will be visible BEFORE the write to a,b and c which would result in an invalid memory ordering. This must be prevented somehow. I figured out a few possible solutions:

  1. The memory controller will propagate cachelines always in the same order as they're updated in L1. As CL 2) is updated after 1) it will push 1) to main first before 2) and the constraints are satisfied.
  2. The memory controller somehow "knows" about the ordering relationships of the cachelines and basically keeps a mental note of which CL to propagate first through the caches.

There might be other solutions I can't think of right now, but I think understanding this puzzle piece would help me complete my mental understanding to an acceptable amount of detail. Also please correct me if my understanding is somehow flawed.

答案1

得分: 5

请提供您需要翻译的具体文本,然后我会为您提供翻译。

英文:

> Nobody knows when it will arrive there though

Inner caches participate in the cache-coherency protocol. AFAIK, all modern CPUs use some variation of MESI. (The wikipedia article describes it in terms of processors snooping a shared bus, but actual CPUs use a "directory", e.g. Intel CPUs with an inclusive L3 cache use L3 tags to keep track of which core might have a modified copy of a cache line. Skylake-Xeon and later have a separate coherency directory and snoop filter since their L3 cache is NINE (not inclusive, not exclusive).)

Before a store can commit from the store buffer to a cache-line, this core has to get exclusive ownership of it. If it wasn't already in Modified or Exclusive state, it needs to do a Read For Ownership which invalidates any copies in other cores, and wait for a response to acknowledge that it has ownership. Only then can it commit the store, making it globally visible at that point. (A share request or RFO from another core could come in and see the value.)

With a coherent cache, memory reordering is only local (within each CPU core, ordering of its loads and stores to coherent cache). e.g. the store buffer delays loads and out-of-order exec (or just in-order with a hit-under-miss cache) does loads early and possibly out-of-order.

The interconnect between cores may have to take some care to not introduce new memory reordering, but mostly it Just Works by maintaining coherency. In practice the hardware does maintain ordering.


So memory barrier instructions just have to make later memory operations wait for some earlier things to complete, e.g. for the store buffer to drain if it's a full barrier like x86 mfence. Or not if it's just an acq_rel fence (https://preshing.com/20130922/acquire-and-release-fences/).

Note that atomic operations like store(val, release) only have one-way ordering, they can reorder with later stores and loads, but not earlier. (https://preshing.com/20120913/acquire-and-release-semantics/). Fences are different from operations, and stronger, needing to be 2-way.


x86's strongly-ordered memory model has to preserve the illusion of program order (plus a store buffer with store-forwarding which can lead to StoreLoad reordering), but weakly-ordered ISAs are allowed to create LoadLoad, StoreStore, and even LoadStore reordering. (https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/) In that case the load and store buffers aren't FIFOs.

Even x86 CPUs do speculatively load out of order, and check later that the cache line is still owned once the load was architecturally allowed to happen, probably at retirement. (This is why x86 has machine_clears.memory_ordering pipeline nukes when another core is modifying data we're loading.)


Shared cache is the backstop for coherency (L3 in this case); data doesn't have to go all the way to DRAM. Even on a multi-socket Xeon, or a big Zen with multiple L3 clusters, they can communicate with each other and pass cache lines to one another without a slow round-trip to DRAM as well.

In general, the C++ memory model as defined by the standard is a lot weaker than anything you can explain in terms of simple cache-coherent hardware. e.g. IRIW reordering is only possible in real life on a few machines, such as POWER, with store-forwarding between logical cores on the same physical core. See https://stackoverflow.com/questions/27807118/will-two-atomic-writes-to-different-locations-in-different-threads-always-be-see

Another good example is that only atomic RMWs continue a release-sequence, so an acquire load only establishes a happens-before relationship with the release-store whose value it loads, not with earlier release stores in the modification order. But in a machine where stores can only become visible to other cores via commit to a coherent cache, pure stores do also have to wait for ownership of the cache line, so in terms of cache coherency are similar to RMWs. So the coherent-cache model can't explain such effects allowed on paper by the ISO C++ standard.

In early drafts of the C++ standard, it was going to guarantee that even pure stores continue a release-sequence, but this got changed. Perhaps because of some real hardware where that's not the case, like maybe PowerPC, or at least some ISAs that don't guarantee something on paper.

Don't fall into the trap of thinking that something might be guaranteed in the ISO standard's formalism just because you can't think of a hardware mechanism that can explain a reordering that violates it. (But thinking in hardware terms is useful in the other direction: if you know how C++ atomics compile to asm for AArch64 for example, and can think of a way that a reordering is possible on AArch64, it must not be guaranteed by ISO C++. It's exceedingly unlikely that there's a bug in the mappings.)

On most modern CPUs, a cache line is 64 bytes (64 B), not 64 kilobits (64 Kb = 8KB).


Related:

huangapple
  • 本文由 发表于 2023年4月11日 01:17:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/75979182.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定