什么是用于动态调度的正确std::atomic内存顺序?

huangapple go评论83阅读模式
英文:

What is the right std::atomic memory order for dynamic scheduling?

问题

我在思考经典的“原子计数动态调度”习语的正确内存顺序是什么。即:

  1. 使用 fetch-add 获取下一个要处理的元素的索引 i
  2. 如果 i 超出数组的末尾,则终止
  3. 线程安全地处理元素 <sub>i</sub>,因为没有其他线程可以拥有 i`
  4. 转到 1.

例如:

#include &lt;atomic&gt;

std::atomic_int counter = 0;

void foo(int *data, int size) {
    // 我们也可以写成 counter++
    for (int i; (i = counter.fetch_add(1, std::memory_order::seq_cst)) &lt; size;) {
        data[i] *= 2;
    }
}
// 驱动代码
#include &lt;thread&gt;
#include &lt;numeric&gt;
#include &lt;cassert&gt;

int main() {
    int data[1&#39;000&#39;000];
    std::iota(std::begin(data), std::end(data), 0);

    std::thread other{foo, data, std::size(data)};
    foo(data, std::size(data));
    other.join();

    for (int i = 0; i &lt; std::size(data); ++i) {
        assert(data[i] == i * 2);
    }
}

这段代码有效,应该是安全的,因为在获取下一个索引之前或之后不能重新排序处理元素,并且所有的 fetch-add 操作在所有线程中都以一致的总顺序观察到。这些要求对我来说似乎过于严格,我认为我们可以使用更宽松的排序。

我认为 std::memory_order_relaxedstd::memory_order::acquire 过于宽松,因为所有线程最初都观察到 counter = 0;,我们必须确保 data[0] *= 2 不会在第一个 fetch_add 之前被移动。这两个内存顺序都允许这种情况发生。

答案应该是以下之一:

  • std::memory_order::seq_cst
  • std::memory_order::acq_rel
  • std::memory-order::release

在这种情况下,哪一个是正确的内存顺序?

英文:

I am wondering what the correct memory order is for the classic "atomic counter dynamic scheduling" idiom. That is:

  1. use fetch-add to get the index i of the next element to process
  2. if i is past the end of the array, terminate
  3. process element<sub>i</sub> thread-safely, since no other thread can have i
  4. go to 1.

For example:

#include &lt;atomic&gt;

std::atomic_int counter = 0;

void foo(int *data, int size) {
    // we could also write counter++
    for (int i; (i = counter.fetch_add(1, std::memory_order::seq_cst)) &lt; size;) {
        data[i] *= 2;
    }
}
// driver code
#include &lt;thread&gt;
#include &lt;numeric&gt;
#include &lt;cassert&gt;

int main() {
    int data[1&#39;000&#39;000];
    std::iota(std::begin(data), std::end(data), 0);

    std::thread other{foo, data, std::size(data)};
    foo(data, std::size(data));
    other.join();

    for (int i = 0; i &lt; std::size(data); ++i) {
        assert(data[i] == i * 2);
    }
}

This code works, and it should be safe, because processing an element cannot be reordered before or after getting the next index, and all fetch-adds are observed in a consistent total order by all threads. These requirements seem overly strict to me, and I believe we could use a more relaxed ordering.

std::memory_order_relaxed and std::memory_order::acquire are too relaxed I believe, because all threads observe counter = 0; initially, and we have to ensure that data[0] *= 2 is not moved before the first fetch_add. These two memory orders would allow that.

The answer has to be one of:

  • std::memory_order::seq_cst
  • std::memory_order::acq_rel
  • std::memory-order::release

Which one is the correct order in this situation?

答案1

得分: 5

"relaxed"足够了。

每个counter.fetch_add(1, relaxed)都会返回不同的值,并且从0开始的每个值都会被返回一次。仅仅原子性就可以保证这一点,因为所有操作都在同一个原子对象上进行。

关于哪个线程会得到哪个i值没有保证,对data[i]的操作也没有顺序,但这没有关系,因为直到thread.join()创建了写入和读取之间的同步为止,每个data[i]都只会被一个线程访问一次。


C++标准中相关的措辞是指原子RMW必须看到“最新值”,并且每个原子对象都存在一致的修改顺序。在单个线程内,对同一原子对象的操作尊重程序顺序中的顺序(即一个fetch_add不能与另一个fetch_add“重新排序”)。

在硬件上,相关的行为是RMW的原子性和缓存一致性,它保证了在原子RMW执行其load+add+store之前,缓存行的所有其他副本都必须无效化。(这就是使得在同一地址上的原子RMW可以串行化的原因。)


如果不依赖于.join()来与读取器进行同步

否则,您需要一个单独的atomic<bool> done标志,或者如果其他线程只是查看counter,则需要release RMWs。等待的线程必须等待counter == size + nthreads,以确保每个线程都已经完成了一个在它最后的data[i] *= 2之后的release操作(并且这些形成了一个release序列,因此读取器将与所有线程同步)。

每个线程在看到fetch_add() < size为false后将停止进行递增。在counter的修改顺序中看到这种情况发生的第一个线程已经加载了i=size并存储了i=size+1。离开循环的第二个线程已经加载了size+1并存储了size+2。因此,对于nthreads=1或2,counter==size+nthreads在它们作为RMW的一部分进行最终存储后。

因此,一个加载操作看到counter == size+nthreads可以确保所有线程在它们最后的data[i] *= 2之后都已经进行了fetch_add。如果这些是release fetch_add,并且这是一个acquire加载,那么对data[0..size-1]对象的存储将在此加载之前发生,因此在此线程中后续的代码可以看到修改后的值。

(您只能检查所有线程是否完成;在此之前,不能保证data[0..counter-1]是否已经写入或类似的情况。即使使用seq_cst递增,也可能有一个线程声称已经有了一个i值,但在访问data[i]之前会被暂停或被调度出去一段时间。因此,任何数组元素仍然可能被修改。)


编译器是否可以将此批量处理为fetch_add(128, relaxed)并循环处理i值?从假设的角度来看是可以的,但实际上不行。请参阅https://stackoverflow.com/questions/45960387/why-dont-compilers-merge-redundant-stdatomic-writes - 除非允许重新排序非原子操作,否则编译器根本不会优化原子操作。

从假设的角度来看,编译器甚至可以静态地决定始终使用一个线程执行所有递增操作,而另一个线程不执行递增操作(除了退出循环的最后一个线程)。例如,首先进行+= 1 million操作,然后循环处理这些i值。这不是一个正确性问题,而且如果编译器真的这样做了,那将非常荒谬。这不是您需要排除的问题。

即使您使用seq_cst,一个足够聪明的Deathstation 9000编译器也可以分析整个程序并看到没有任何内容与counter的值同步,因此它仍然可以进行相同的转换。如果有任何观察到counter的最终值的地方,它也必须确保counter = size + nthreads,以便不能只执行fetch_add(1'000'000)


如果要手动使用大于1的批处理大小

一个批处理大小,比如4096字节或整数,可能是有意义的。声明要处理4096个元素并循环处理i + 0..4095允许内部循环使用SIMD,并在下一个慢速原子RMW必须等待counter缓存行在线程之间弹跳之前有足够的工作在飞行中。

只有一个线程访问包含一系列data[]值的缓存行可以避免缓存行在线程之间弹跳。因此,如果data[]对齐,那么64字节的缓存行是您可能希望使用的最小批处理大小,否则您需要更大的批处理,以便在批处理的末尾只有分割。但无论如何,您希望使用更大的批次,以摊销这些成本,因为在汇编中,每个原子RMW都实际上是seq_cst和完整的内存屏障。

英文:

relaxed is sufficient.

Every counter.fetch_add(1, relaxed) will return a different value, and every value from 0 upwards will be returned once. Atomicity alone guarantees this since all operations are on the same atomic object.

There's no guarantee about which thread will get which i value, and the operations on data[i] aren't ordered, but that's fine because exactly one thread will access each data[i] until thread.join() creates synchronization between the writer and reader.


The relevant wording in the C++ standard is the part saying an atomic RMW must see the "latest value" and that a consistent modification-order exists for each atomic object separately. And that within a single thread, operations on the same atomic object respect the sequenced-before program order. (i.e. one fetch_add can't "reorder" with another fetch_add.)

In hardware, the relevant behaviours are atomicity of the RMW, and cache coherency which guarantees that all other copies of the cache line must be invalidated before an atomic RMW can do its load+add+store. (That's what makes atomic RMWs on the same address serializable.)


If not relying on .join() for synchronization with the reader

Otherwise you'd need a separate atomic&lt;bool&gt; done flag, or release RMWs if some other thread was just looking at counter. The waiting thread would have to wait for counter == size + nthreads to be sure that every thread has done a release operation sequenced-after it last data[i] *= 2. (And these form a release-sequence so the reader would synchronize-with all the threads.)

Each thread will stop doing increments after it sees a false fetch_add() &lt; size. The first thread (in modification-order of counter) to see that happen have loaded i=size and stored i=size+1. The second thread to leave the loop will have loaded size+1 and stored size+2. So with nthreads=1 or 2, counter==size+nthreads after they've done their final store as part of the RMW.

So a load that sees counter == size+nthreads can be sure that all threads have done a fetch_add after their last data[i] *= 2;. If those were release fetch_add and this was an acquire load, the stores to the data[0..size-1] objects are guaranteed to have happened-before this load, so later code in this thread can see the modified values.

(You can only check that all threads are finished; before that it's not guaranteed that data[0..counter-1] are finished writing or anything like that. Even with seq_cst increments, you could have a thread claim an i value but stall or get scheduled out for a long time before accessing that data[i]. So any array element could still be being modified.)


Could a compiler batch this for you into fetch_add(128, relaxed)

... and loop over the i values? Hypothetically yes, but practically no. See https://stackoverflow.com/questions/45960387/why-dont-compilers-merge-redundant-stdatomic-writes - compilers don't optimize atomics at all, except for reordering non-atomic operations across them when allowed.

Hypothetically, a compiler could even statically decide to compile it so it always runs with one thread doing all the increments and the other thread doing none. (Except the final one to exit the loop). e.g. by doing a += 1 mil first and then looping over those i values. This isn't a correctness problem, and would be pretty insane for a compiler to actually do. It's not something you need to rule out.

Even if you used seq_cst, a sufficiently-smart Deathstation 9000 compiler could analyze the whole program and see that nothing synchronizes-with the values of counter, so it could still make the same transformation. If anything observed the final value of counter, it would also have to make sure counter = size + nthreads so it couldn't just do fetch_add(1&#39;000&#39;000) in both threads.


You want to manually use batches larger than 1

A batch size like 4096 bytes or ints could make sense. Claiming 4096 elements to work on and looping over i + 0..4095 allows that inner loop to use SIMD, and have plenty of work in-flight before the next slow atomic RMW that has to wait for that counter cache-line to bounce between threads.

Having only one thread access a cache line containing a range of data[] values avoids bouncing those cache lines around. So one 64-byte cache line is a minimum batch size you might want to use if data[] is aligned, otherwise you need larger batches so there's only splits at the ends of batches. But you want larger batches anyway since 64 bytes is only a couple or a few SIMD loads/stores.

If data[] is page-aligned, then page-sized blocks mean only one thread needs a dTLB miss for that page, and hardware prefetching on modern x86 CPUs stops at page boundaries (since contiguous virtual pages might not be physically contiguous.) So it's a natural place to end a batch, with HW prefetch not creating contention for the thread working on the next chunk.

Especially on x86 where every atomic RMW is effectively seq_cst and a full memory barrier in asm, you want to make your batch size plenty large to amortize those costs. As large as possible without leaving a lot of leftover work at the end if one thread stalls or something.

答案2

得分: 3

relaxed 是可以的,因为无论内存顺序如何,每个单独的原子变量在所有线程之间始终保持一致。

只有当你有多个变量(原子或非原子)在线程之间共享时,才需要内存顺序,因为使用 relaxed 时,线程可以在不同变量上的操作如何交错方面存在分歧,而更强的顺序可以解决这个问题。

英文:

relaxed is ok, because regardless of memory order, each separate atomic variable is always consistent across all threads.

Memory orders are only necessary when you have more than one variable (atomic or not) shared across threads, because with relaxed, threads can disagree how operations on different variables are interleaved, and stronger orders fix that.

答案3

得分: 0

fetch_add 不是一个 CAS(比较和交换)操作,而是定义为一个读-修改-写原子操作。它不等同于 atomic_var.compare_exchange(i, i+1, relaxed)。整个操作被视为原子操作,应完全执行。

内存顺序参数用于指定同时或顺序发生的其他原子操作的顺序。它使编译器有一定的自由度来重新排序操作。

当使用原子递增操作时,即使在 relaxed 模式下,执行顺序也将确保原子变量的第 n 次 fetch_add 返回 n-1。如果不是这样,代码将会出错(可能还会影响宇宙,但这只是一个小不便)。

换句话说,在线程 A 中,你永远不会观察到 i = 0,然后在线程 B 中观察到 i = 2,依次类推。一个线程 一定会 在任何线程观察到 i = 2 之前观察到 i = 1

未知的是哪个线程将在此变量上执行第 n 次获取操作。当你调整内存顺序时,你允许编译器在允许的操作序列上重新排序它。

由于你没有在这里使用另一个变量,即使是 relaxed 顺序也会起作用。如果你在使用另一个变量(比如,在获取 counter 后更改原子变量 tmp),那么这会产生影响,因为变量 tmp 的更改值的 顺序 如果你没有限制顺序的话,可能与其他线程中的不同。在这种情况下,即使线程 A 中的代码在 counter 之前更改了 tmp,线程 B 也可能看到之前的 tmp 值。

如果在 fetch_add 上设置了 seq_cst,那么这种行为就不会发生。

英文:

I think you're confusing with a CAS (compare and swap) operation here.

fetch_add is not a CAS, it's defined as a Read Modify Write atomic operation. It's not equivalent to atomic_var.compare_exchange(i, i+1, relaxed). The whole operation is considered atomic and should be completely executed.

The memory order parameters is specified for the ordering of other atomic operation happening concurrently or sequentially. It gives the compiler some freedom to reorder the operations.

When using an atomic increment operation, even in relaxed mode, the order of execution will ensure that the n-th fetch_add of the atomic variable will return n-1. If it didn't do that, the code would break (and probably the universe too, but that's a minor inconvenient).

Said differently, you'd never observe, successively, for example, in Thread A, i = 0 and in Thread B, i = 2. One thread will observe i = 1 before any thread observes i = 2.

What's unknown is which thread will execute the n-th fetch on this variable.
When you play with the memory order, you allow the compiler to reorder this on whatever sequence of operation it's allowed to.

Since you don't play with another variable here, even the relaxed order will work. If you were playing with another variable (say, for example, changing atomic variable tmp after fetching counter), then this would have an impact, since the order of the changed value of the variable tmp might not be the same in the other thread if you didn't restrict the ordering.

In that case, Thread B might see the previous tmp value even if code in Thread A changed tmp before counter.

If you set seq_cst on fetch_add, then this behavior can't happen.

huangapple
  • 本文由 发表于 2023年6月15日 21:03:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/76482776.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定