这个C++代码为什么会因为明显的内存顺序竞争而崩溃?

huangapple go评论82阅读模式
英文:

Why does this C++ code crash with an apparent memory ordering race condition?

问题

这段代码在x64上偶尔崩溃的原因是由于多线程访问导致的竞态条件。在Thread 1中,当设置了object->m_finished为true后,Thread 2中的条件if (object->m_finished)可能会成立,然后尝试删除object,但此时Thread 1还没有执行完object->manipulate()。这可能导致在Thread 1中执行manipulate()时,object已经被删除,从而导致内存破坏。

为了避免这种情况,你可以使用互斥锁(mutex)或std::atomic来确保对m_finished的读写操作是线程安全的,这样可以防止竞态条件的发生。这种问题通常在多线程环境中出现,因为不同线程的执行顺序是不确定的,所以需要进行同步控制以避免数据竞争问题。

英文:

Why does this kind of code crash (very occasionally!) on x64?

class Object
{
   public:
   void manipulate(); // reads and writes object
   bool m_finished = false; // note this is a regular bool and NOT atomic
}

Thread 1

object->manipulate();
object->m_finished = true;
// no further access of object

Thread 2

if (object->m_finished)
   delete object;

In two separate applications I saw crashes due to this pattern (this is not the exact code, just an example of a common pattern).

The crash takes the form of heap corruption where it looks very much like the the memory occupied by 'object' was modified after it was freed. It's as if Thread 1 did:

object->m_finished = true;
object->manipulate()

and the object got deleted between the setting of m_finished and manipulate().

In the first app, the fix was to put a mutex around both setting and testing of m_finished (actually it already had a mutex around the setting part). It involved an OS bug in aio_error and is described in this question I asked

The second time I saw this, it was fixed by changing m_finished to a std::atomic.

In both cases it was an extremely rare heisencrash that would disappear with any attempt to use valgrind, ASAN etc. My assumption is that it was due to the CPUs re-ordering the writes in object->manipulate() and delete object.

Note that I am 95% sure there was no compiler level re-ordering going on, as the various methods involved were virtual functions etc.

I get that it's good to use atomics when doing inter-thread signalling (and I promise I will forever more). As far as I understand x86 (Intel and AMD, we run AMD) does not allow for store-store re-ordering. So if the deleting thread witnesses m_finished == true, all the preceding stores performed in object->manipulate() should be committed to memory right?

I understand that in general the behaviour of this code is undefined. But specifically why, on x64, does it appear that use of a mutex or atomic for m_finished is required when the cpu specs guarantee ordering of stores and loads?

答案1

得分: 2

在现代处理器上,有多个核心,每个核心都有自己的L1缓存。当内存被写入然后重新读取时,这可能发生在它传递到主内存之前,这是缓存的主要目的。

当内存在多个核心之间共享时,一个核心的写操作可能不会被另一个核心立即看到。这实际上可能导致核心之间的读写被重新排序,这正是你描述的问题。

因此,当在多个核心(线程)之间共享内存时,需要添加内存屏障以确保两个缓存的状态相同。这就是std::atomic所做的事情。

ARM采用了非常宽松的方法,这意味着在所有情况下都需要内存屏障。x86/amd64提供了更强的保证,但仍然有LOCK指令前缀 MFENCE指令来处理一些病态情况,比如你观察到的情况。

编辑
如果你查看下面提到的规范,似乎amd64在强制正确顺序方面表现得很好,除了一种情况:加载操作可能与对不同位置的旧存储器重新排序。

https://www.cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdf

英文:

On modern processors there are multiple cores, each core has its own L1 cache. When memory written and then reread, this may happen well before it passes down to main memory and is the main point of caches.

When memory is shared between cores, a write by one core may not be seen by another core until later. This effectively may result in reads and writes between cores being reordered which is extactly the problem you describe.

Therefore when sharing memory between cores (threads), one will need to add a memory barrier(s) to ensure that the state of the two cache are the same. This is what std::atomic does.

ARM has a very relaxed approach meaning that memory barriers are needed in all scenario. x86/amd64 make stronger guarantees but still has the <s>LOCK instruction prefix</s> MFENCE instruction to deal with some pathological cases like the one you observed.

EDIT
If you take a look at the spec mentioned below, it seems amd64 does a good job of enforcing proper ordering except in one case: loads may be reordered with older stores to a different location.

https://www.cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdf

答案2

得分: 2

I think, the most probable reason of crashing on your code is compiler optimizations. Does your code crash in builds without optimization (e.g. with -O0)?

例如,让我们看一下godbolt上的以下代码。

#include &lt;atomic&gt;

class Object
{
public:
   std::atomic&lt;bool&gt; m_finished_atomic = false;
   bool m_finished_non_atomic = false;
};

void delete_without_sync(Object* ob){
    while (true){
        if (ob-&gt;m_finished_non_atomic){
            delete ob;
            break;
        }
    }
}

void delete_with_sync(Object* ob){
    while (true){
        if (ob-&gt;m_finished_atomic.load(std::memory_order::relaxed)){
            delete ob;
            break;
        }
    }
}

正如您所看到的,这两种方法都执行相同的操作,std::memory_order::relaxed对于x86_64 CPU没有意义。但是对于编译器来说,这两种方法非常不同。让我们看一下汇编代码:

delete_without_sync(Object*):        # @delete_without_sync(Object*)
        jmp     operator delete(void*)@PLT                      # TAILCALL
delete_with_sync(Object*):           # @delete_with_sync(Object*)
.LBB1_1:                                # =>This Inner Loop Header: Depth=1
        movzx   eax, byte ptr [rdi]
        test    al, 1
        je      .LBB1_1
        jmp     operator delete(void*)@PLT                      # TAILCALL

如您所见,编译器将delete_without_sync 优化为直接删除我们的结构,而不需要等待,而 delete_with_sync 不断测试变量,并仅在其为true时才删除对象。那么为什么会这样?

从编译器的角度来看,在 delete_without_sync 中,字段 m_finished_non_atomic 不能更改,因为它认为它不能在线程之间共享,因为没有任何同步。因此,编译器将代码转换为类似于以下内容:

void delete_without_sync(Object* ob){
    if (ob-&gt;m_finished_non_atomic) {
       delete ob;
       return;
    }
    while (true) {}
}

然后它发现有一个没有任何副作用的无限循环。这种循环被视为未定义行为,因此无法到达。因此,字段 m_finished_non_atomic 必须为true,因此编译器甚至删除了对它的读取。

现在我们有了使用后释放的问题,因为存在数据竞争。

delete_with_sync 中,编译器不能假定变量不会更改,因此生成了我们要求的“诚实”循环。

在大型项目中,可能会发生许多事情,这对编译器来说很难跟踪。也许有一些循环在等待布尔标志更新,或者有一些其他未定义行为,因为编译器假定变量不会更改。与编译器相比,对于人来说很难跟踪这些事情。

另一件可能发生的事情是,编译器只是在 manipulate() 调用之前将 object-&gt;m_finished = true 移动了,因为它注意到 manipulate 从不更新或读取它。

总之:我认为触发此行为的原因是编译器优化,而不是CPU本身。这就是为什么您应该始终在某个地方使用同步来防止错误编译的原因。C++内存模型与CPU相比要宽松得多,正是为了使这些优化可用。

英文:

I think, the most probable reason of crashing on your code is compiler optimizations. Does your code crash in builds without optimization (e.g. with -O0)?

For example, let see code below on godbolt.

#include &lt;atomic&gt;

class Object
{
public:
   std::atomic&lt;bool&gt; m_finished_atomic = false;
   bool m_finished_non_atomic = false;
};

void delete_without_sync(Object* ob){
    while (true){
        if (ob-&gt;m_finished_non_atomic){
            delete ob;
            break;
        }
    }
}

void delete_with_sync(Object* ob){
    while (true){
        if (ob-&gt;m_finished_atomic.load(std::memory_order::relaxed)){
            delete ob;
            break;
        }
    }
}

As you see, both methods do same thing, std::memory_order::relaxed means nothing for x86_64 CPU. But for compiler, this methods are VERY different. Let's see assembly:

delete_without_sync(Object*):        # @delete_without_sync(Object*)
        jmp     operator delete(void*)@PLT                      # TAILCALL
delete_with_sync(Object*):           # @delete_with_sync(Object*)
.LBB1_1:                                # =&gt;This Inner Loop Header: Depth=1
        movzx   eax, byte ptr [rdi]
        test    al, 1
        je      .LBB1_1
        jmp     operator delete(void*)@PLT                      # TAILCALL

As you can see, delete_without_sync was optimized away by compiler to direct delete of our struct without any waiting while delete_with_sync repeatedly tests variable and deletes object only when it is true. So why?

From compiler point of view, in delete_without_sync, field m_finished_non_atomic cannot change because it thinks that it cannot be shared between threads because there is no any synchronization. So compiler converts code to something like this:

void delete_without_sync(Object* ob){
    if (ob-&gt;m_finished_non_atomic) {
       delete ob;
       return;
    }
    while (true) {}
}

Then it sees that there is an infinite loop without any side effects. Such loops are considered undefined behaviour so they cannot be reached. Therefore, field m_finished_non_atomic must be true so compiler removes even reads from it.

And now we have use after free because of data race.

In delete_with_sync compiler cannot assume that variable doesn't change so it generated "honest" loop we asked.

In large project, there can be a lot of things happen around which is hard to track for a compiler. Maybe there was some loop which wait until boolean flag updates, or some other undefined behaviour that occur because compiler assumes that variable cannot change. It is hard to track for a person compared to a program like compiler.

Another thing that could happen, that compiler just moved object-&gt;m_finished = true before manipulate() call because it noticed that manipulate never updates or reads it.

In conclusion: I think that it is compiler optimizations trigger that behaviour here, not CPU itself. This is a reason why you should always use synchronization at some point to prevent miscompilations. C++ memory model is way more relaxed compared to CPUs exactly to make such optimizations available.

huangapple
  • 本文由 发表于 2023年6月9日 06:17:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76436042.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定