有没有性能上的差异仅仅读取一个原子变量与普通变量相比?

huangapple go评论86阅读模式
英文:

Is there any performance difference in just reading an atomic variable compared to a normal variable?

问题

1 中的语句在多线程环境中是否比 [2] 和 [3] 中的语句更快?假设在执行 [2] 和 [3] 时,ai 可能会在另一个线程中被写入。

附加说明:假设不需要准确的底层整数值,哪种方式是读取原子变量的最快方法?


int i = 0;
if(i == 10)  {...}  // [1]

std::atomic<int> ai{0};
if(ai == 10) {...}  // [2]
if(ai.load(std::memory_order_relaxed) == 10) {...}  // [3]
英文:
int i = 0;
if(i == 10)  {...}  // [1]

std::atomic&lt;int&gt; ai{0};
if(ai == 10) {...}  // [2]
if(ai.load(std::memory_order_relaxed) == 10) {...}  // [3]

Is the statement 1 any faster than the statements [2] & [3] in a multithreaded environment?
Assume that ai may or may not be written in another thread, when [2] & [3] are executing.

Add-on: Provided that accurate value of the underlying integer is not a necessity, which is the fastest way to read an atomic variable?

答案1

得分: 10

根据架构而定,但总体来说,负载操作通常廉价,但与具有严格内存排序的存储操作一起使用可能会显得昂贵。

在x86_64上,64位以下的负载和存储操作本身是原子的(但是读-修改-写操作显然不是)。

就C++中的默认内存排序而言,它是std::memory_order_seq_cst,它提供了顺序一致性,即:所有线程都将看到加载/存储操作按某种顺序发生。要在x86上实现这一点(实际上是在所有多核系统上实现),需要在存储操作之后插入一个内存屏障,以确保在存储后发生的加载操作读取新值。

在这种情况下的“读取”在强顺序的x86上不需要内存屏障,但写入需要。在大多数弱排序的指令集架构中,即使是seq_cst读取也需要一些屏障指令,但不是完全的屏障。如果我们看一下这段代码:

#include <atomic>
#include <stdlib.h>

int main(int argc, const char* argv[]) {
    std::atomic<int> num;

    num = 12;
    if (num == 10) {
        return 0;
    }
    return 1;
}

使用-O3编译:

0x0000000000000560 <+0>:    sub    $0x18,%rsp
0x0000000000000564 <+4>:    mov    %fs:0x28,%rax
0x000000000000056d <+13>:    mov    %rax,0x8(%rsp)
0x0000000000000572 <+18>:    xor    %eax,%eax
0x0000000000000574 <+20>:    movl   $0xc,0x4(%rsp)
0x000000000000057c <+28>:    mfence
0x000000000000057f <+31>:    mov    0x4(%rsp),%eax
0x0000000000000583 <+35>:    cmp    $0xa,%eax
0x0000000000000586 <+38>:    setne  %al
0x0000000000000589 <+41>:    mov    0x8(%rsp),%rdx
0x000000000000058e <+46>:    xor    %fs:0x28,%rdx
0x0000000000000597 <+55>:    jne    0x5a1 <main+65>
0x0000000000000599 <+57>:    movzbl %al,%eax
0x000000000000059c <+60>:    add    $0x18,%rsp
0x00000000000005a0 <+64>:    retq

我们可以看到在+31处对原子变量的“读取”不需要任何特殊操作,但因为我们在+20处写入了原子变量,编译器必须在此后插入一个mfence指令,以确保此线程在执行任何后续加载操作之前等待其存储变得可见。这是昂贵的,会使该核心停顿,直到存储缓冲区排空为止。(某些x86 CPU上仍然可能会对后续非内存指令进行乱序执行。)

如果我们在写入时使用更弱的排序(例如std::memory_order_release):

#include <atomic>
#include <stdlib.h>

int main(int argc, const char* argv[]) {
    std::atomic<int> num;

    num.store(12, std::memory_order_release);
    if (num == 10) {
        return 0;
    }
    return 1;
}

然后在x86上,我们不需要屏障:

0x0000000000000560 <+0>:    sub    $0x18,%rsp
0x0000000000000564 <+4>:    mov    %fs:0x28,%rax
0x000000000000056d <+13>:    mov    %rax,0x8(%rsp)
0x0000000000000572 <+18>:    xor    %eax,%eax
0x0000000000000574 <+20>:    movl   $0xc,0x4(%rsp)
0x000000000000057c <+28>:    mov    0x4(%rsp),%eax
0x0000000000000580 <+32>:    cmp    $0xa,%eax
0x0000000000000583 <+35>:    setne  %al
0x0000000000000586 <+38>:    mov    0x8(%rsp),%rdx
0x000000000000058b <+43>:    xor    %fs:0x28,%rdx
0x0000000000000594 <+52>:    jne    0x59e <main+62>
0x0000000000000596 <+54>:    movzbl %al,%eax
0x0000000000000599 <+57>:    add    $0x18,%rsp
0x000000000000059d <+61>:    retq

请注意,如果我们为AArch64编译相同的代码:

0x0000000000400530 <+0>:    stp    x29, x30, [sp,#-32]!
0x0000000000400534 <+4>:    adrp   x0, 0x411000
0x0000000000400538 <+8>:    add    x0, x0, #0x30
0x000000000040053c <+12>:   mov    x2, #0xc
0x0000000000400540 <+16>:   mov    x29, sp
0x0000000000400544 <+20>:   ldr    x1, [x0]
0x0000000000400548 <+24>:   str    x1, [x29,#24]
0x000000000040054c <+28>:   mov    x1, #0x0
0x0000000000400550 <+32>:   add    x1, x29, #0x10
0x0000000000400554 <+36>:   stlr   x2, [x1]
0x0000000000400558 <+40>:   ld

<details>
<summary>英文:</summary>

It depends on the architecture, but in general loads are cheap, paired with a store with a strict memory ordering can be expensive though.

On x86_64, loads and stores of up to 64-bits are atomic on their own (but read-modify-write is decidedly _not_). 

As you have it, the default memory ordering in C++ is `std::memory_order_seq_cst`, which gives you sequential consistency, ie: there&#39;s some order that all threads will see loads/stores occurring in. To accomplish this on x86 (and indeed all multi-core systems) requires a memory fence on stores to ensure that loads occurring after the store read the new value.

_Reading_ in this case does _not_ require a memory fence on strongly-ordered x86, but writing does. [On most weakly-ordered ISAs, even seq_cst reading would require some barrier instructions][1], but not a *full* barrier. If we look at this code:

    #include &lt;atomic&gt;
    #include &lt;stdlib.h&gt;
    
    int main(int argc, const char* argv[]) {
        std::atomic&lt;int&gt; num;
    
        num = 12;
        if (num == 10) {
            return 0;
        }
        return 1;
    }

compiled with -O3:

       0x0000000000000560 &lt;+0&gt;:	    sub    $0x18,%rsp
       0x0000000000000564 &lt;+4&gt;:	    mov    %fs:0x28,%rax
       0x000000000000056d &lt;+13&gt;:	mov    %rax,0x8(%rsp)
       0x0000000000000572 &lt;+18&gt;:	xor    %eax,%eax
       0x0000000000000574 &lt;+20&gt;:	movl   $0xc,0x4(%rsp)
       0x000000000000057c &lt;+28&gt;:	mfence 
       0x000000000000057f &lt;+31&gt;:	mov    0x4(%rsp),%eax
       0x0000000000000583 &lt;+35&gt;:	cmp    $0xa,%eax
       0x0000000000000586 &lt;+38&gt;:	setne  %al
       0x0000000000000589 &lt;+41&gt;:	mov    0x8(%rsp),%rdx
       0x000000000000058e &lt;+46&gt;:	xor    %fs:0x28,%rdx
       0x0000000000000597 &lt;+55&gt;:	jne    0x5a1 &lt;main+65&gt;
       0x0000000000000599 &lt;+57&gt;:	movzbl %al,%eax
       0x000000000000059c &lt;+60&gt;:	add    $0x18,%rsp
       0x00000000000005a0 &lt;+64&gt;:	retq

We can see that the _read_ from the atomic variable at +31 doesn&#39;t require anything special, but because we wrote to the atomic at +20, the compiler had to insert an `mfence` instruction afterwards which ensures that this thread waits for its store to become visible before doing any later loads.  This is _expensive_, stalling this core until the store buffer drains.  (Out-of-order exec of later non-memory instructions is still possible on some x86 CPUs.)

If we instead us a weaker ordering (such as `std::memory_order_release`) on the write:

    #include &lt;atomic&gt;
    #include &lt;stdlib.h&gt;

    int main(int argc, const char* argv[]) {
        std::atomic&lt;int&gt; num;
    
        num.store(12, std::memory_order_release);
        if (num == 10) {
            return 0;
        }
        return 1;
    }

Then on x86 we don&#39;t need the fence:

       0x0000000000000560 &lt;+0&gt;:	    sub    $0x18,%rsp
       0x0000000000000564 &lt;+4&gt;:	    mov    %fs:0x28,%rax
       0x000000000000056d &lt;+13&gt;:	mov    %rax,0x8(%rsp)
       0x0000000000000572 &lt;+18&gt;:	xor    %eax,%eax
       0x0000000000000574 &lt;+20&gt;:	movl   $0xc,0x4(%rsp)
       0x000000000000057c &lt;+28&gt;:	mov    0x4(%rsp),%eax
       0x0000000000000580 &lt;+32&gt;:	cmp    $0xa,%eax
       0x0000000000000583 &lt;+35&gt;:	setne  %al
       0x0000000000000586 &lt;+38&gt;:	mov    0x8(%rsp),%rdx
       0x000000000000058b &lt;+43&gt;:	xor    %fs:0x28,%rdx
       0x0000000000000594 &lt;+52&gt;:	jne    0x59e &lt;main+62&gt;
       0x0000000000000596 &lt;+54&gt;:	movzbl %al,%eax
       0x0000000000000599 &lt;+57&gt;:	add    $0x18,%rsp
       0x000000000000059d &lt;+61&gt;:	retq   

Note though, if we compile this same code for AArch64:

       0x0000000000400530 &lt;+0&gt;:	    stp	 x29, x30, [sp,#-32]!
       0x0000000000400534 &lt;+4&gt;:	    adrp x0, 0x411000
       0x0000000000400538 &lt;+8&gt;:	    add	 x0, x0, #0x30
       0x000000000040053c &lt;+12&gt;:	mov	 x2, #0xc
       0x0000000000400540 &lt;+16&gt;:	mov	 x29, sp
       0x0000000000400544 &lt;+20&gt;:	ldr	 x1, [x0]
       0x0000000000400548 &lt;+24&gt;:	str	 x1, [x29,#24]
       0x000000000040054c &lt;+28&gt;:	mov	 x1, #0x0
       0x0000000000400550 &lt;+32&gt;:	add	 x1, x29, #0x10
       0x0000000000400554 &lt;+36&gt;:	stlr x2, [x1]
       0x0000000000400558 &lt;+40&gt;:	ldar x2, [x1]
       0x000000000040055c &lt;+44&gt;:	ldr	 x3, [x29,#24]
       0x0000000000400560 &lt;+48&gt;:	ldr	 x1, [x0]
       0x0000000000400564 &lt;+52&gt;:	eor	 x1, x3, x1
       0x0000000000400568 &lt;+56&gt;:	cbnz x1, 0x40057c &lt;main+76&gt;
       0x000000000040056c &lt;+60&gt;:	cmp  x2, #0xa
       0x0000000000400570 &lt;+64&gt;:	cset w0, ne
       0x0000000000400574 &lt;+68&gt;:	ldp	 x29, x30, [sp],#32
       0x0000000000400578 &lt;+72&gt;:	ret

When we write to the variable at +36, we use a Store-Release instruction (stlr), and loading at +40 uses a Load-Acquire (ldar).  These each provide a partial memory fence (and together form a full fence).

You should only use atomic when you _have_ to reason about the access ordering on the variable.  To answer your add-on question, use `std::memory_order_relaxed` for the memory to read the atomic, with no guarantees on synchronizing with writes.  Only atomicity is guaranteed.


  [1]: https://preshing.com/20120930/weak-vs-strong-memory-models/

</details>



# 答案2
**得分**: 2

以下是您要翻译的内容:

**Case 1:**

```c
int i = 0;
if(i == 10)  {...}  // may actually be optimized away since `i` is clearly 0 now

如果i被多个线程访问,其中包括操作,其行为是未定义的。

在没有同步的情况下,编译器可以假定没有其他线程可以修改i,并且可以重新排列/优化对它的访问。例如,它可以将i加载到寄存器中一次,并且永远不会重新从内存中重新读取,或者它可以将写操作提升到循环外,并且仅在最后写入一次。

Case 2:

std::atomic<int> ai{0};
if(ai == 10) {...}  // [2]

默认情况下,对std::atomic的读取和写入是以std::memory_order_seq_cst(顺序一致性)内存顺序执行的。这意味着对ai的读/写不仅是原子的,而且还及时可见于其他线程,包括在它之前/之后的其他变量的读取/写入。

因此,读取/写入atomic就像是一个内存栅栏。然而,这要慢得多,因为(1)SMP系统必须在处理器之间同步缓存,(2)编译器在围绕原子访问的代码优化方面的自由度要少得多。

Case 3:

std::atomic<int> ai{0};
if(ai.load(std::memory_order_relaxed) == 10) {...}  // [3]

这种模式允许并保证了ai的读取/写入具有原子性。因此,编译器再次可以重新排列对它的访问,并仅保证写入在合理时间内对其他线程可见。

它的适用性非常有限,因为它使得很难推理程序中事件的顺序。例如

std::atomic<int> ai{0}, aj{0};

// 线程 1
aj.store(1, std::memory_order_relaxed);
ai.store(10, std::memory_order_relaxed);

// 线程 2
if(ai.load(std::memory_order_relaxed) == 10) {
  aj.fetch_add(1, std::memory_order_relaxed);
  // aj现在是1还是2?无法确定。
}

这种模式通常(而且经常)比第1种情况要慢,因为编译器必须确保每个读/写实际上都传输到缓存/RAM中,但比第2种情况要快,因为仍然可以优化其他变量。

有关原子操作和内存顺序的更多详细信息,请参阅Herb Sutter的优秀讲座atomic<> weapons talk

英文:

The 3 cases presented have different semantics, so it may be pointless to reason about their relative performance, unless the value is never written after the threads have started.

Case 1:

int i = 0;
if(i == 10)  {...}  // may actually be optimized away since `i` is clearly 0 now

If i is accessed by more than one thread, which includes a write, the behavior is undefined.

In the absence of synchronization, the compiler is free to assume no other thread can modify i, and may reorder/optimize access to it. For example, it may load i into a register once and never re-read it from memory, or it may hoist writes out of the loop and only write once at the end.

Case 2:

std::atomic&lt;int&gt; ai{0};
if(ai == 10) {...}  // [2]

By default reads and writes to an atomic are done in std::memory_order_seq_cst (sequentially-consistent) memory order. This means that not only are reads/writes to ai atomic, but they are also visible to other threads in a timely manner, including any other variable's reads/writes before/after it.

So reading/writing an atomic acts as a memory fence. This however, is much slower since (1) an SMP system must synchronize caches between processors and (2) the compiler has much less freedom in optimizing code around the atomic access.

Case 3:

std::atomic&lt;int&gt; ai{0};
if(ai.load(std::memory_order_relaxed) == 10) {...}  // [3]

This mode allows and guarantees atomicity of ai reads/writes only. So the compiler is again free to reorder access to it, and only guarantees that writes are visible to other threads in a reasonable amount of time.

It's applicability is very limited, as it makes it very hard to reason about the order of events in a program. For example

std::atomic&lt;int&gt; ai{0}, aj{0};

// thread 1
aj.store(1, std::memory_order_relaxed);
ai.store(10, std::memory_order_relaxed);

// thread 2
if(ai.load(std::memory_order_relaxed) == 10) {
  aj.fetch_add(1, std::memory_order_relaxed);
  // is aj 1 or 2 now??? no way to tell.
}

This mode is potentially (and often) slower than case 1 since the compiler must ensure each read/write actually goes out to cache/RAM, but is faster than case 2, since it's still possible to optimize other variables around it.

For more details about atomics and memory ordering, see Herb Sutter's excellent atomic&lt;&gt; weapons talk.

答案3

得分: 1

关于你对UB的评论,它会影响数据的准确性还是可能导致系统崩溃(一种UB)?

通常情况下,如果你在读取时不使用atomic<>,会出现类似这样的情况:MCU编程 - C++ O2优化中断了while循环

例如,一个while(!read){}循环通过提升加载变成了if(!ready) infinite_loop();

只是不要这样做;如果可以的话,在源代码中手动提升原子加载,就像这样:int localtmp = shared_var.load(std::memory_order_relaxed);

英文:

> Regarding your comment on UB, will it be only the accuracy of the data be affected OR it can crash a system (kind of UB)?

The usual consequence if you don't use atomic&lt;&gt; when you should for reads is stuff like MCU programming - C++ O2 optimization breaks while loop

e.g. a while(!read){} loop turns into if(!ready) infinite_loop(); by hoisting the load.

Just don't do it; manually hoist the atomic load in the source if / when that's ok, like int localtmp = shared_var.load(std::memory_order_relaxed);

huangapple
  • 本文由 发表于 2020年1月3日 23:33:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/59581260.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定