为什么跨缓存行边界的原子存储编译为普通的MOV存储指令?

huangapple go评论74阅读模式
英文:

Why atomic store on variable that cross cache-line boundaries compiles to normal MOV store instruction?

问题

以下是您要翻译的代码部分:

let's look on the code

#include <stdint.h>
#pragma pack (push,1)
typedef struct test_s
{
    uint64_t a1;
    uint64_t a2;
    uint64_t a3;
    uint64_t a4;
    uint64_t a5;
    uint64_t a6;
    uint64_t a7;
    uint8_t b1;
    uint64_t a8;
}test;

int main()
{
    test t;
    __atomic_store_n(&(t.a8), 1, __ATOMIC_RELAXED);
}

对于您提到的代码部分,我已将其翻译为中文。

英文:

let's look on the code

#include &lt;stdint.h&gt;
#pragma pack (push,1)
typedef struct test_s
{
    uint64_t a1;
    uint64_t a2;
    uint64_t a3;
    uint64_t a4;
    uint64_t a5;
    uint64_t a6;
    uint64_t a7;
    uint8_t b1;
    uint64_t a8;
}test;

int main()
{
    test t;
	__atomic_store_n(&amp;(t.a8), 1, __ATOMIC_RELAXED);
}

since we have packed structure, the a8 is not naturally aligned and also should be split between different 64 bytes cache boundaries, but the generated assembly GCC 12.2 is

main:
        push    rbp
        mov     rbp, rsp
        mov     eax, 1
        mov     QWORD PTR [rbp-23], rax
        mov     eax, 0
        pop     rbp
        ret

why does it translate to simple MOV? Doesn't the MOV not atomic in that case?

Addition:
Same code on clang 16 call to atomic function and translates to

main:                                   # @main
        push    rbp
        mov     rbp, rsp
        sub     rsp, 80
        lea     rdi, [rbp - 72]
        add     rdi, 57
        mov     qword ptr [rbp - 80], 1
        mov     rsi, qword ptr [rbp - 80]
        xor     edx, edx
        call    __atomic_store_8@PLT
        xor     eax, eax
        add     rsp, 80
        pop     rbp
        ret

答案1

得分: 5

Correct, the store isn't atomic in that case, misaligned atomic operations aren't supported in GNU C.

You created a misaligned uint64_t and took its address. That's not safe in general. Packed structs only work reliably when you access their misaligned members through the struct directly. You can also create crashes with misaligned-pointer undefined behavior, e.g. with a packed struct { char a; int arr[1024]; } and then passing the pointer as a plain int* to a function that might auto-vectorize.

If you use __atomic_store_n on variables that aren't sufficiently aligned, it's undefined behavior AFAIK. I don't think it supports typedef __attribute__((aligned(1), may_alias)) int *unaligned_int; producing different asm.

GCC's __atomic builtins don't have a way to query the required alignment like we can with alignas(std::atomic_ref&lt;uint64_t&gt;::required_alignment) uint64_t foo;

There is bool __atomic_is_lock_free (size_t size, void *ptr) which takes a pointer arg to check the alignment (0 for typical / default alignment for the type), but it returns 1 for size=8 even with a guaranteed-cache-line-split object like the a8 member of _Alignas(64) test global_t;. (Without known alignment for the start of the struct, a8 in a pointed-to object might happen to be fully within one cache line, which is sufficient on Intel but not AMD for atomicity guarantees.

I think you're supposed to assume that for any lock-free atomic, it needs alignas(sizeof(T)), i.e. natural alignment, otherwise you can't safely use __atomic builtins on it. This isn't explicitly documented in https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html but perhaps is somewhere else.

See also https://stackoverflow.com/questions/61996108/atomic-ref-when-external-underlying-type-is-not-aligned-as-requested re: implementation design considerations for that case, whether to check alignment and make things slow, or whether to let the user shoot themselves in the foot like you're doing, by making the access non-atomic.

GCC could detect this and warn, which would be good, but I wouldn't expect them to add compiler back-end support for x86's ability to do misaligned atomic accesses (with a lock prefix for an RMW instruction, or xchg) at the cost of extremely bad performance which locks the bus and thus slows down other cores. That's a disaster on modern many-core servers, so nobody wants that, the right fix is to fix your code.

Most other ISAs can't do misaligned atomic operations at all.

Semi-related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146#c4 - even in non-packed structs, GCC under-aligned C11 _Atomic members for a long time, e.g. keeping the default alignof(uint64_t)==4 on some 32-bit ISAs like x86 -m32, not promoting to the necessary alignas(sizeof(T)). _Atomic uint64_t a8 doesn't change GCC's code-gen, even with direct direct for a load, and clang refused to compile it.

Interesting clang output

As you note, it warns, unlike GCC. With __attribute__((packed)) on the struct rather than #pragma pack, we also get warnings for taking the address at all. (Godbolt)

    return __atomic_load_n(&(t->a8), __ATOMIC_RELAXED);
                             ^~~~~
<source>:41:12: warning: misaligned atomic operation may incur significant performance penalty; the expected alignment (8 bytes) exceeds the actual alignment (1 bytes) [-Watomic-alignment]
    return __atomic_load_n(&(t->a8), __ATOMIC_RELAXED);

The __atomic_store_8 library function clang calls will actually give atomicity on x86-64; it ignores the memory_order parameter in RDX and assumes __ATOMIC_SEQ_CST - the implementation is just xchg [rdi],rsi / ret.

But __atomic_load_8 won't: its implementation is mov rax, [rdi] / ret (because C++ atomic mappings to x86 asm put the cost of blocking StoreLoad reordering between seq_cst ops onto stores, leaving SC loads the same as acquire.) So clang isn't gaining anything by choosing not to inline __atomic_load_n for a known-misaligned 8-byte load.

OTOH it doesn't hurt, and a custom implementation of libatomic could do something about it, e.g. with lock cmpxchg, or whatever else if you're running in some emulator or other weird environment.

Interesting that clang chooses not to inline based on misalignment. But its warning only makes sense for atomic RMW ops on x86-64, where it is a performance penalty rather than lack of atomicity. Or SC stores, as long as libatomic implements that with xchg rather than mov + mfence.

英文:

Correct, the store isn't atomic in that case, misaligned atomic operations aren't supported in GNU C.

You created a misaligned uint64_t and took its address. That's not safe in general. Packed structs only work reliably when you access their misaligned members through the struct directly. You can also create crashes with misaligned-pointer undefined behaviour, e.g. with a packed struct { char a; int arr[1024]; } and then passing the pointer as a plain int* to a function that might auto-vectorize.

If you use __atomic_store_n on variables that aren't sufficiently aligned, it's undefined behaviour AFAIK. I don't think it supports typedef __attribute__((aligned(1), may_alias)) int *unaligned_int; producing different asm.

GCC's __atomic builtins don't have a way to query the required alignment like we can with alignas(std::atomic_ref&lt;uint64_t&gt;::required_alignment) uint64_t foo;

There is bool __atomic_is_lock_free (size_t size, void *ptr) which takes a pointer arg to check the alignment (0 for typical / default alignment for the type), but it returns 1 for size=8 even with a guaranteed-cache-line-split object like the a8 member of _Alignas(64) test global_t;. (Without known alignment for the start of the struct, a8 in a pointed-to object might happen to be fully within one cache line, which is sufficient on Intel but not AMD for atomicity guarantees.)

I think you're supposed to assume that for any lock-free atomic, it needs alignas(sizeof(T)), i.e. natural alignment, otherwise you can't safely use __atomic builtins on it. This isn't explicitly documented in https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html but perhaps is somewhere else.


See also https://stackoverflow.com/questions/61996108/atomic-ref-when-external-underlying-type-is-not-aligned-as-requested re: implementation design considerations for that case, whether to check alignment and make things slow, or whether to let the user shoot themselves in the foot like you're doing, by making the access non-atomic.

GCC could detect this and warn, which would be good, but I wouldn't expect them to add compiler back-end support for x86's ability to do misaligned atomic accesses (with a lock prefix for an RMW instruction, or xchg) at the cost of extremely bad performance which locks the bus and thus slows down other cores. That's a disaster on modern many-core servers, so nobody wants that, the right fix is to fix your code.

Most other ISAs can't do misaligned atomic operations at all.


Semi-related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146#c4 - even in non-packed structs, GCC under-aligned C11 _Atomic members for a long time, e.g. keeping the default alignof(uint64_t)==4 on some 32-bit ISAs like x86 -m32, not promoting to the necessary alignas(sizeof(T)). _Atomic uint64_t a8 doesn't change GCC's code-gen, even with direct direct for a load, and clang refused to compile it.

Interesting clang output

As you note, it warns, unlike GCC. With __attribute__((packed)) on the struct rather than #pragma pack, we also get warnings for taking the address at all. (Godbolt)

&lt;source&gt;:41:30: warning: taking address of packed member &#39;a8&#39; of class or structure &#39;test_s&#39; may result in an unaligned pointer value [-Waddress-of-packed-member]
    return __atomic_load_n(&amp;(t-&gt;a8), __ATOMIC_RELAXED);
                             ^~~~~
&lt;source&gt;:41:12: warning: misaligned atomic operation may incur significant performance penalty; the expected alignment (8 bytes) exceeds the actual alignment (1 bytes) [-Watomic-alignment]
    return __atomic_load_n(&amp;(t-&gt;a8), __ATOMIC_RELAXED);

The __atomic_store_8 library function clang calls will actually give atomicity on x86-64; it ignores the memory_order parameter in RDX and assumes __ATOMIC_SEQ_CST - the implementation is just xchg [rdi],rsi / ret.

But __atomic_load_8 won't: its implementation is mov rax, [rdi] / ret (because C++ atomic mappings to x86 asm put the cost of blocking StoreLoad reordering between seq_cst ops onto stores, leaving SC loads the same as acquire.) So clang isn't gaining anything by choosing not to inline __atomic_load_n for a known-misaligned 8-byte load.

OTOH it doesn't hurt, and a custom implementation of libatomic could do something about it, e.g. with lock cmpxchg, or whatever else if you're running in some emulator or other weird environment.

Interesting that clang chooses not to inline based on misalignment. But its warning only makes sense for atomic RMW ops on x86-64, where it is a performance penalty rather than lack of atomicity. Or SC stores, as long as libatomic implements that with xchg rather than mov + mfence.

huangapple
  • 本文由 发表于 2023年4月11日 14:21:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/75982929.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定