2023年4月11日 14:21:36go评论147阅读模式

英文:

Why atomic store on variable that cross cache-line boundaries compiles to normal MOV store instruction?

问题

以下是您要翻译的代码部分：

let's look on the code

#include <stdint.h>
#pragma pack (push,1)
typedef struct test_s
{
    uint64_t a1;
    uint64_t a2;
    uint64_t a3;
    uint64_t a4;
    uint64_t a5;
    uint64_t a6;
    uint64_t a7;
    uint8_t b1;
    uint64_t a8;
}test;

int main()
{
    test t;
    __atomic_store_n(&(t.a8), 1, __ATOMIC_RELAXED);
}

对于您提到的代码部分，我已将其翻译为中文。

英文:

let's look on the code

#include &lt;stdint.h&gt;
#pragma pack (push,1)
typedef struct test_s
{
    uint64_t a1;
    uint64_t a2;
    uint64_t a3;
    uint64_t a4;
    uint64_t a5;
    uint64_t a6;
    uint64_t a7;
    uint8_t b1;
    uint64_t a8;
}test;

int main()
{
    test t;
	__atomic_store_n(&amp;(t.a8), 1, __ATOMIC_RELAXED);
}

since we have packed structure, the a8 is not naturally aligned and also should be split between different 64 bytes cache boundaries, but the generated assembly GCC 12.2 is

main:
        push    rbp
        mov     rbp, rsp
        mov     eax, 1
        mov     QWORD PTR [rbp-23], rax
        mov     eax, 0
        pop     rbp
        ret

why does it translate to simple MOV? Doesn't the MOV not atomic in that case?

Addition:
Same code on clang 16 call to atomic function and translates to

main:                                   # @main
        push    rbp
        mov     rbp, rsp
        sub     rsp, 80
        lea     rdi, [rbp - 72]
        add     rdi, 57
        mov     qword ptr [rbp - 80], 1
        mov     rsi, qword ptr [rbp - 80]
        xor     edx, edx
        call    __atomic_store_8@PLT
        xor     eax, eax
        add     rsp, 80
        pop     rbp
        ret

答案1

得分: 5

Correct, the store isn't atomic in that case, misaligned atomic operations aren't supported in GNU C.

You created a misaligned uint64_t and took its address. That's not safe in general. Packed structs only work reliably when you access their misaligned members through the struct directly. You can also create crashes with misaligned-pointer undefined behavior, e.g. with a packed struct { char a; int arr[1024]; } and then passing the pointer as a plain int* to a function that might auto-vectorize.

If you use __atomic_store_n on variables that aren't sufficiently aligned, it's undefined behavior AFAIK. I don't think it supports typedef __attribute__((aligned(1), may_alias)) int *unaligned_int; producing different asm.

GCC's __atomic builtins don't have a way to query the required alignment like we can with alignas(std::atomic_ref<uint64_t>::required_alignment) uint64_t foo;

There is bool __atomic_is_lock_free (size_t size, void *ptr) which takes a pointer arg to check the alignment (0 for typical / default alignment for the type), but it returns 1 for size=8 even with a guaranteed-cache-line-split object like the a8 member of _Alignas(64) test global_t;. (Without known alignment for the start of the struct, a8 in a pointed-to object might happen to be fully within one cache line, which is sufficient on Intel but not AMD for atomicity guarantees.

I think you're supposed to assume that for any lock-free atomic, it needs alignas(sizeof(T)), i.e. natural alignment, otherwise you can't safely use __atomic builtins on it. This isn't explicitly documented in https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html but perhaps is somewhere else.

See also https://stackoverflow.com/questions/61996108/atomic-ref-when-external-underlying-type-is-not-aligned-as-requested re: implementation design considerations for that case, whether to check alignment and make things slow, or whether to let the user shoot themselves in the foot like you're doing, by making the access non-atomic.

GCC could detect this and warn, which would be good, but I wouldn't expect them to add compiler back-end support for x86's ability to do misaligned atomic accesses (with a lock prefix for an RMW instruction, or xchg) at the cost of extremely bad performance which locks the bus and thus slows down other cores. That's a disaster on modern many-core servers, so nobody wants that, the right fix is to fix your code.

Most other ISAs can't do misaligned atomic operations at all.

Semi-related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146#c4 - even in non-packed structs, GCC under-aligned C11 _Atomic members for a long time, e.g. keeping the default alignof(uint64_t)==4 on some 32-bit ISAs like x86 -m32, not promoting to the necessary alignas(sizeof(T)). _Atomic uint64_t a8 doesn't change GCC's code-gen, even with direct direct for a load, and clang refused to compile it.

Interesting clang output

As you note, it warns, unlike GCC. With __attribute__((packed)) on the struct rather than #pragma pack, we also get warnings for taking the address at all. (Godbolt)

    return __atomic_load_n(&(t->a8), __ATOMIC_RELAXED);
                             ^~~~~
<source>:41:12: warning: misaligned atomic operation may incur significant performance penalty; the expected alignment (8 bytes) exceeds the actual alignment (1 bytes) [-Watomic-alignment]
    return __atomic_load_n(&(t->a8), __ATOMIC_RELAXED);

The __atomic_store_8 library function clang calls will actually give atomicity on x86-64; it ignores the memory_order parameter in RDX and assumes __ATOMIC_SEQ_CST - the implementation is just xchg [rdi],rsi / ret.

But __atomic_load_8 won't: its implementation is mov rax, [rdi] / ret (because C++ atomic mappings to x86 asm put the cost of blocking StoreLoad reordering between seq_cst ops onto stores, leaving SC loads the same as acquire.) So clang isn't gaining anything by choosing not to inline __atomic_load_n for a known-misaligned 8-byte load.

OTOH it doesn't hurt, and a custom implementation of libatomic could do something about it, e.g. with lock cmpxchg, or whatever else if you're running in some emulator or other weird environment.

Interesting that clang chooses not to inline based on misalignment. But its warning only makes sense for atomic RMW ops on x86-64, where it is a performance penalty rather than lack of atomicity. Or SC stores, as long as libatomic implements that with xchg rather than mov + mfence.

英文:

Correct, the store isn't atomic in that case, misaligned atomic operations aren't supported in GNU C.

You created a misaligned uint64_t and took its address. That's not safe in general. Packed structs only work reliably when you access their misaligned members through the struct directly. You can also create crashes with misaligned-pointer undefined behaviour, e.g. with a packed struct { char a; int arr[1024]; } and then passing the pointer as a plain int* to a function that might auto-vectorize.

If you use __atomic_store_n on variables that aren't sufficiently aligned, it's undefined behaviour AFAIK. I don't think it supports typedef __attribute__((aligned(1), may_alias)) int *unaligned_int; producing different asm.

GCC's __atomic builtins don't have a way to query the required alignment like we can with alignas(std::atomic_ref<uint64_t>::required_alignment) uint64_t foo;

Most other ISAs can't do misaligned atomic operations at all.

Interesting clang output

As you note, it warns, unlike GCC. With __attribute__((packed)) on the struct rather than #pragma pack, we also get warnings for taking the address at all. (Godbolt)

&lt;source&gt;:41:30: warning: taking address of packed member &#39;a8&#39; of class or structure &#39;test_s&#39; may result in an unaligned pointer value [-Waddress-of-packed-member]
    return __atomic_load_n(&amp;(t-&gt;a8), __ATOMIC_RELAXED);
                             ^~~~~
&lt;source&gt;:41:12: warning: misaligned atomic operation may incur significant performance penalty; the expected alignment (8 bytes) exceeds the actual alignment (1 bytes) [-Watomic-alignment]
    return __atomic_load_n(&amp;(t-&gt;a8), __ATOMIC_RELAXED);

OTOH it doesn't hurt, and a custom implementation of libatomic could do something about it, e.g. with lock cmpxchg, or whatever else if you're running in some emulator or other weird environment.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么跨缓存行边界的原子存储编译为普通的MOV存储指令？

问题

答案1

Interesting clang output

Interesting clang output

libevent是否同时处理两个事件，这是否意味着我需要互斥锁？

在Java的原子compareAndSet方法中是否有锁定？

关于OpenMP并行SIMD归约

The release system call is called when 释放系统调用被调用时。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论