Sequentially consistent fence(顺序一致性屏障)

huangapple go评论73阅读模式
英文:

Sequentially consistent fence

问题

  1. 如果我使用 asm 编译器级别的 memory 屏障来执行 cpuid,这是否与顺序一致的屏障具有相同的行为?

  2. 现在假设我执行 cpuid 不使用 asm 编译器级别的 memory 屏障。此外,假设对 A 的存储是标准的内核代码,而对 B 的加载由 BPF 程序执行。在这种情况下,cpuid 是否具有与顺序一致屏障相同的行为?我的印象是它具有相同的行为,因为(1)cpuid 提供了硬件序列化,(2)编译器重新排序是不可能的,因为内核与 BPF 程序分别编译。

  3. C++ 标准要求在同一地址上的线程之间进行同步。似乎发出 mfence(或另一种类型的屏障)就足以实现硬件序列化,而 mfence 不具备内存地址作为参数。因此,标准是否仅为了防止编译器重新排序而强制实施这一要求?

英文:

I have kernel code and userspace code that synchronize on atomic variables. The kernel and userspace code may be running on the same logical core or different logical cores. Let's assume the architecture is x86_64.

Here is an initial implementation to get our feet wet:

Kernel (C)                        Userspace (C++)
---------------------------       -----------------------------------
Store A (smp_store_release)       Store B (std::memory_order_release)
Load B (smp_load_acquire)         Load A (std::memory_order_acquire)

I require that from the perspective of each thread, its own load happens after its own store. So for example, from userspace's perspective, the load to A must happen after the store to B.

Furthermore, I similarly require that for a given thread, it observes the other thread do the load after the store. So for example, from the kernel's perspective, it must observe that userspace stores to B before loading A.

Clearly, the code above is insufficient to meet these two requirements, so for the sake of this question, I rewrite it as so:

Kernel (C)                        Userspace (C++)
---------------------------       -----------------------------------
Store A (smp_store_release)       Store B (std::memory_order_release)
cpuid                             std::atomic_thread_fence(std::memory_order_seq_cst)
Load B (smp_load_acquire)         Load A (std::memory_order_acquire)

According to the Intel manual, cpuid is a serializing operation.

Here are my questions:

  1. If I issue cpuid with the asm compiler-level memory barrier, does this have the same behavior as a sequentially consistent fence?
  2. Now let's say I issue cpuid without the asm compiler-level memory barrier. Furthermore, let's say that the store to A is standard kernel code while the load to B is done by a BPF program. Does cpuid have the same behavior as a sequentially consistent fence in this case? My impression is that it does, because (1) cpuid provides hardware serialization and (2) compiler reordering is impossible since the kernel is compiled separately from the BPF program.
  3. The C++ standard requires that synchronization occur between threads on the same address. It seems that issuing an mfence (or another type of fence) is sufficient to achieve hardware serialization, and mfence does not have a memory address as an argument. Thus, does the standard impose this requirement solely to prevent compiler reordering?

答案1

得分: 2

重新排序和序列化(排序)仅适用于当前线程内,用于其对一致性共享缓存的访问的全局可见性。请参阅https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/。仅仅使用屏障不会与另一个线程同步。足够的屏障可以确保操作的顺序是程序顺序的某种交错(例如,如果所有原子操作都使用seq_cst),但不能确保另一个线程中的某个加载实际上会看到此线程中的某个存储;这可能已经发生了,您无法倒回时间。

在x86上,要执行一个不会与任何后续加载(或任何其他方向的任何内容)重新排序的存储,使用xchg。它具有使其成为完全屏障的隐式lock前缀。

要从现代编译器中获得这个,使用B.store(value, std::memory_order_seq_cst),其中seq_cst默认已经足够。 (旧版GCC曾经将其编译为mov存储 + mfence,这是等效的:https://stackoverflow.com/questions/49107683/why-does-a-stdatomic-store-with-sequential-consistency-use-xchg)

要在源代码中记录,不希望后续加载与该存储重新排序,也将其设置为seq_cst,例如A.load(std::memory_order_seq_cst)

ISO C++保证seq_cst操作是与源代码顺序一致的全局总序的一部分,因此seq_cst操作不允许彼此之间的StoreLoad重新排序。

这对于一些其他ISA实际上很重要,尤其是AArch64,其中seq_cst存储可以与后续不是seq_cst的加载重新排序。只有stlr / ldar具有特殊的交互,可以防止StoreLoad重新排序。使用ldapr进行的acquire加载不必等待较旧的stlr存储提交到L1d缓存。

x86目前尚未具有在存储的后部或作为其一部分执行任何弱于完全屏障的硬件支持,但仍足够强大以用于seq_cst,不同于AArch64。但没有理由使用更弱的操作(如acquire)编写源代码,并依赖于汇编详细信息来提供所需的StoreLoad排序:您仍然可以只使用B.store(val)A.load(),两者都使用默认的seq_cst(或显式指定)。这将在x86和其他所有地方都能安全且廉价地编译。(x86 seq_cst加载只是普通加载,因为避免StoreLoad重新排序的成本已经在每个seq_cst存储中完成。https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)

seq_cst存储也是一个释放操作,而seq_cst加载也是一个获取操作,以防您担心使用seq_cst而不是std::memory_order_release

如果您的内核代码不支持seq_cst存储操作,请在其存储和加载之间放置一个seq_cst屏障/完全屏障。在C++中,使用std::atomic也可以实现相同的效果,只是让编译器将屏障包含在SC存储的一部分中,而不是需要单独的mfence或虚拟lock add byte [rsp], 0之类的东西来实现单独的std::atomic_thread_fence(std::memory_order_seq_cst)效率更高。

我不知道为什么您认为在这里使用cpuid是个好主意,因为您只有数据存储/加载,没有跨修改代码或其他复杂的内容。也许您认为“序列化”意味着影响其他核心?它并不是。但是为了回答您的问题:

  1. 是的,cpuid至少与mfence一样强大,某些方面更强大(关于跨修改代码的代码获取,如果我没记错的话)。当您说“汇编编译器级别的memory屏障”时,您是否指的是asm("cpuid" ::: "memory", "eax", "ebx", "ecx", "edx"),带有"memory" clobber的GNU C asm语句?(并且在CPUID写的寄存器上有寄存器clobber,不像新的serialize指令 在Sapphire Rapids IIRC中,这可能也避免了作为VM中的CPUID总是vmexit的方式。)

  2. 是的,如果在您关心的操作之间没有编译时重新排序的机会,asm("mfence" ::: );足够了,但不是一个好主意。

  3. 使全局内存操作的可见性在当前核心或线程上重新排序(排序)不能单独与另一个线程同步。要与另一个线程同步并创建happens-before关系,需要在此线程中的加载看到由其他线程完成的存储。具体而言,是一个获取(或更强)加载看到一个释放(或更强)存储,或者等效的屏障。

    请参阅https://preshing.com/20120913/acquire-and-release-semantics/,关于写入缓冲区然后data_ready.store(true, std::memory_order_release),以便读取data_readytrue的读取

英文:

Reordering, and serialization (ordering), are things that only apply within the current thread, for the global visibility of its accesses to coherent shared cache. See https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/. Barriers alone don't sync with another thread. Sufficient barriers can ensure that the order of operations is some interleaving of program order (e.g. if all atomic ops use seq_cst), not that some load in another thread actually will see some store in this thread; it might have already happened and you can't rewind time.


On x86, to do a store that can't reorder with any later loads (or anything else in any direction), use xchg. It has an implicit lock prefix that makes it a full barrier.

To get that from modern compilers, use B.store(value, std::memory_order_seq_cst), where seq_cst is the default anyway. (Older GCC used to compile that to a mov-store + mfence, which is equivalent: https://stackoverflow.com/questions/49107683/why-does-a-stdatomic-store-with-sequential-consistency-use-xchg)

To document in your source that you don't want the later load to reorder with that store, make it seq_cst as well, like A.load(std::memory_order_seq_cst).

ISO C++ guarantees that seq_cst operations are part of a global total order that's consistent with source order, so seq_cst operations don't allow StoreLoad reordering with each other.

That's actually important on some other ISAs, notably AArch64, where a seq_cst store can reorder with later loads that aren't seq_cst. Only stlr / ldar have the special interaction that prevents StoreLoad reordering. An acquire load using ldapr doesn't have to wait for older stlr stores to commit to L1d cache.

x86 doesn't (yet?) have hardware support for doing anything weaker than a full barrier after or as part of the store but still strong enough for seq_cst, unlike AArch64. But there's no reason to write the source with weaker operations like acquire and rely on asm details to give you the StoreLoad ordering you need: You can still just use B.store(val) ; A.load() with both using the default seq_cst (or make it explicit). That will compile safely and cheaply on x86 and everywhere else. (x86 seq_cst loads are just plain loads because the cost of avoiding StoreLoad reordering is already done in every seq_cst store. https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)

A seq_cst store is also a release operation, and a seq_cst load is also an acquire operation, in case you're worried about using seq_cst instead of std::memory_order_release.

If your kernel code doesn't support a seq_cst store operation, put a seq_cst fence / full barrier between its store and load. The same thing would work with std::atomic in C++, it's just more efficient to let the compiler include the barrier as part of the SC store, instead of needing a separate mfence or dummy lock add byte [rsp], 0 or something to implement a separate std::atomic_thread_fence(std::memory_order_seq_cst).


I have no idea why you think cpuid would be a good idea here since you only have data stores/loads, not cross-modifying code or anything tricky. Perhaps you're thinking that "serialize" implies affecting other cores? It doesn't. But to answer your questions:

  1. Yes, cpuid is at least as strong a barrier as mfence, stronger in some ways (re: code fetch for cross-modifying code IIRC). When you say "asm compiler-level memory barrier", you're talking about asm("cpuid" ::: "memory", "eax", "ebx", "ecx", "edx"), a GNU C asm statement with a "memory" clobber? (And register clobbers on the regs CPUID writes, unlike the new serialize instruction in Sapphire Rapids IIRC, which probably also avoids being a vmexit the way CPUID always is in VMs.)

  2. Yes, if there's no chance for compile-time reordering between the ops you care about, asm("mfence" ::: ); is sufficient, but not a good idea.

  3. Serializing (ordering) the global visibility of memory operations on the current core or thread doesn't sync with another thread on its own. To sync-with another thread and create a happens-before relationship, you need a load in this thread to see a store done by the other thread. Specifically an acquire (or stronger) load seeing a release (or stronger) store, or equivalent barriers.

    See https://preshing.com/20120913/acquire-and-release-semantics/ re: writing a buffer then data_ready.store(true, std::memory_order_release), so a reader that reads data_ready as true can safely read the buffer and see the earlier non-atomic stores done by the writer.

    Just running a barrier instruction to order your own loads+stores after a data_ready.load() doesn't help anything if the writer is still in the middle of writing the buffer and hasn't stored to data_ready yet.

    Memory fences aka barriers aren't like a pthread_barrier() synchronization operation where every thread reaching it waits until all threads have reached it. Totally different concept.

huangapple
  • 本文由 发表于 2023年5月28日 05:02:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76349024.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定