英文:
Why atomic store on variable that cross cache-line boundaries compiles to normal MOV store instruction?
问题
以下是您要翻译的代码部分:
let's look on the code
#include <stdint.h>
#pragma pack (push,1)
typedef struct test_s
{
uint64_t a1;
uint64_t a2;
uint64_t a3;
uint64_t a4;
uint64_t a5;
uint64_t a6;
uint64_t a7;
uint8_t b1;
uint64_t a8;
}test;
int main()
{
test t;
__atomic_store_n(&(t.a8), 1, __ATOMIC_RELAXED);
}
对于您提到的代码部分,我已将其翻译为中文。
英文:
let's look on the code
#include <stdint.h>
#pragma pack (push,1)
typedef struct test_s
{
uint64_t a1;
uint64_t a2;
uint64_t a3;
uint64_t a4;
uint64_t a5;
uint64_t a6;
uint64_t a7;
uint8_t b1;
uint64_t a8;
}test;
int main()
{
test t;
__atomic_store_n(&(t.a8), 1, __ATOMIC_RELAXED);
}
since we have packed structure, the a8 is not naturally aligned and also should be split between different 64 bytes cache boundaries, but the generated assembly GCC 12.2 is
main:
push rbp
mov rbp, rsp
mov eax, 1
mov QWORD PTR [rbp-23], rax
mov eax, 0
pop rbp
ret
why does it translate to simple MOV? Doesn't the MOV not atomic in that case?
Addition:
Same code on clang 16 call to atomic function and translates to
main: # @main
push rbp
mov rbp, rsp
sub rsp, 80
lea rdi, [rbp - 72]
add rdi, 57
mov qword ptr [rbp - 80], 1
mov rsi, qword ptr [rbp - 80]
xor edx, edx
call __atomic_store_8@PLT
xor eax, eax
add rsp, 80
pop rbp
ret
答案1
得分: 5
Correct, the store isn't atomic in that case, misaligned atomic operations aren't supported in GNU C.
You created a misaligned uint64_t
and took its address. That's not safe in general. Packed structs only work reliably when you access their misaligned members through the struct directly. You can also create crashes with misaligned-pointer undefined behavior, e.g. with a packed struct { char a; int arr[1024]; }
and then passing the pointer as a plain int*
to a function that might auto-vectorize.
If you use __atomic_store_n
on variables that aren't sufficiently aligned, it's undefined behavior AFAIK. I don't think it supports typedef __attribute__((aligned(1), may_alias)) int *unaligned_int;
producing different asm.
GCC's __atomic
builtins don't have a way to query the required alignment like we can with alignas(std::atomic_ref<uint64_t>::required_alignment) uint64_t foo;
There is bool __atomic_is_lock_free (size_t size, void *ptr)
which takes a pointer arg to check the alignment (0
for typical / default alignment for the type), but it returns 1
for size=8 even with a guaranteed-cache-line-split object like the a8
member of _Alignas(64) test global_t;
. (Without known alignment for the start of the struct, a8
in a pointed-to object might happen to be fully within one cache line, which is sufficient on Intel but not AMD for atomicity guarantees.
I think you're supposed to assume that for any lock-free atomic, it needs alignas(sizeof(T))
, i.e. natural alignment, otherwise you can't safely use __atomic
builtins on it. This isn't explicitly documented in https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html but perhaps is somewhere else.
See also https://stackoverflow.com/questions/61996108/atomic-ref-when-external-underlying-type-is-not-aligned-as-requested re: implementation design considerations for that case, whether to check alignment and make things slow, or whether to let the user shoot themselves in the foot like you're doing, by making the access non-atomic.
GCC could detect this and warn, which would be good, but I wouldn't expect them to add compiler back-end support for x86's ability to do misaligned atomic accesses (with a lock
prefix for an RMW instruction, or xchg
) at the cost of extremely bad performance which locks the bus and thus slows down other cores. That's a disaster on modern many-core servers, so nobody wants that, the right fix is to fix your code.
Most other ISAs can't do misaligned atomic operations at all.
Semi-related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146#c4 - even in non-packed structs, GCC under-aligned C11 _Atomic
members for a long time, e.g. keeping the default alignof(uint64_t)==4 on some 32-bit ISAs like x86 -m32
, not promoting to the necessary alignas(sizeof(T))
. _Atomic uint64_t a8
doesn't change GCC's code-gen, even with direct direct for a load, and clang refused to compile it.
Interesting clang output
As you note, it warns, unlike GCC. With __attribute__((packed))
on the struct rather than #pragma pack
, we also get warnings for taking the address at all. (Godbolt)
return __atomic_load_n(&(t->a8), __ATOMIC_RELAXED);
^~~~~
<source>:41:12: warning: misaligned atomic operation may incur significant performance penalty; the expected alignment (8 bytes) exceeds the actual alignment (1 bytes) [-Watomic-alignment]
return __atomic_load_n(&(t->a8), __ATOMIC_RELAXED);
The __atomic_store_8
library function clang calls will actually give atomicity on x86-64; it ignores the memory_order parameter in RDX and assumes __ATOMIC_SEQ_CST
- the implementation is just xchg [rdi],rsi
/ ret
.
But __atomic_load_8
won't: its implementation is mov rax, [rdi]
/ ret
(because C++ atomic mappings to x86 asm put the cost of blocking StoreLoad reordering between seq_cst ops onto stores, leaving SC loads the same as acquire.) So clang isn't gaining anything by choosing not to inline __atomic_load_n
for a known-misaligned 8-byte load.
OTOH it doesn't hurt, and a custom implementation of libatomic could do something about it, e.g. with lock cmpxchg
, or whatever else if you're running in some emulator or other weird environment.
Interesting that clang chooses not to inline based on misalignment. But its warning only makes sense for atomic RMW ops on x86-64, where it is a performance penalty rather than lack of atomicity. Or SC stores, as long as libatomic implements that with xchg
rather than mov
+ mfence
.
英文:
Correct, the store isn't atomic in that case, misaligned atomic operations aren't supported in GNU C.
You created a misaligned uint64_t
and took its address. That's not safe in general. Packed structs only work reliably when you access their misaligned members through the struct directly. You can also create crashes with misaligned-pointer undefined behaviour, e.g. with a packed struct { char a; int arr[1024]; }
and then passing the pointer as a plain int*
to a function that might auto-vectorize.
If you use __atomic_store_n
on variables that aren't sufficiently aligned, it's undefined behaviour AFAIK. I don't think it supports typedef __attribute__((aligned(1), may_alias)) int *unaligned_int;
producing different asm.
GCC's __atomic
builtins don't have a way to query the required alignment like we can with alignas(std::atomic_ref<uint64_t>::required_alignment) uint64_t foo;
There is bool __atomic_is_lock_free (size_t size, void *ptr)
which takes a pointer arg to check the alignment (0
for typical / default alignment for the type), but it returns 1
for size=8 even with a guaranteed-cache-line-split object like the a8
member of _Alignas(64) test global_t;
. (Without known alignment for the start of the struct, a8
in a pointed-to object might happen to be fully within one cache line, which is sufficient on Intel but not AMD for atomicity guarantees.)
I think you're supposed to assume that for any lock-free atomic, it needs alignas(sizeof(T))
, i.e. natural alignment, otherwise you can't safely use __atomic
builtins on it. This isn't explicitly documented in https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html but perhaps is somewhere else.
See also https://stackoverflow.com/questions/61996108/atomic-ref-when-external-underlying-type-is-not-aligned-as-requested re: implementation design considerations for that case, whether to check alignment and make things slow, or whether to let the user shoot themselves in the foot like you're doing, by making the access non-atomic.
GCC could detect this and warn, which would be good, but I wouldn't expect them to add compiler back-end support for x86's ability to do misaligned atomic accesses (with a lock
prefix for an RMW instruction, or xchg
) at the cost of extremely bad performance which locks the bus and thus slows down other cores. That's a disaster on modern many-core servers, so nobody wants that, the right fix is to fix your code.
Most other ISAs can't do misaligned atomic operations at all.
Semi-related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146#c4 - even in non-packed structs, GCC under-aligned C11 _Atomic
members for a long time, e.g. keeping the default alignof(uint64_t)==4 on some 32-bit ISAs like x86 -m32
, not promoting to the necessary alignas(sizeof(T))
. _Atomic uint64_t a8
doesn't change GCC's code-gen, even with direct direct for a load, and clang refused to compile it.
Interesting clang output
As you note, it warns, unlike GCC. With __attribute__((packed))
on the struct rather than #pragma pack
, we also get warnings for taking the address at all. (Godbolt)
<source>:41:30: warning: taking address of packed member 'a8' of class or structure 'test_s' may result in an unaligned pointer value [-Waddress-of-packed-member]
return __atomic_load_n(&(t->a8), __ATOMIC_RELAXED);
^~~~~
<source>:41:12: warning: misaligned atomic operation may incur significant performance penalty; the expected alignment (8 bytes) exceeds the actual alignment (1 bytes) [-Watomic-alignment]
return __atomic_load_n(&(t->a8), __ATOMIC_RELAXED);
The __atomic_store_8
library function clang calls will actually give atomicity on x86-64; it ignores the memory_order parameter in RDX and assumes __ATOMIC_SEQ_CST
- the implementation is just xchg [rdi],rsi
/ ret
.
But __atomic_load_8
won't: its implementation is mov rax, [rdi]
/ ret
(because C++ atomic mappings to x86 asm put the cost of blocking StoreLoad reordering between seq_cst ops onto stores, leaving SC loads the same as acquire.) So clang isn't gaining anything by choosing not to inline __atomic_load_n
for a known-misaligned 8-byte load.
OTOH it doesn't hurt, and a custom implementation of libatomic could do something about it, e.g. with lock cmpxchg
, or whatever else if you're running in some emulator or other weird environment.
Interesting that clang chooses not to inline based on misalignment. But its warning only makes sense for atomic RMW ops on x86-64, where it is a performance penalty rather than lack of atomicity. Or SC stores, as long as libatomic implements that with xchg
rather than mov
+ mfence
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论