如何在C++中编写Cortex-M4上的原子RMW序列

huangapple go评论106阅读模式
英文:

How to write atomic RMW-sequences on Cortex-M4 in C++

问题

在下面的示例中,有4个版本用于原子递增(或使用其他形式的原子操作)变量a1a2(取决于版本)。变量a1a2可能与某种形式的ISR共享。

问题是关于Cortex-M4(STM32G431)。编译器是g++(见下文)。

版本1
据我了解,进入ISR会自动发出clrex指令,因此如果序列被中断,第一个strex总是失败的。正确吗?因此,ISR是否也不需要使用ldrex/strex?隐式的clrex起到了一种全局内存破坏的作用:是否可能将破坏限制在ISR中的a2上?

版本2
__disable_irq()/enable_irq()是否包含编译时屏障?因此,显式的屏障是否是不必要的?只禁用可能修改变量a2的IRQ是否更好(性能)?

比较版本1和2:
如果没有IRQ命中序列,两者应该使用相同数量的CPU周期,但如果出现任何IRQ,版本1会使用更多周期吗?

版本3
这会产生额外的dmb屏障指令。但据我了解,在单核M4上,这些dmb是不必要的?

版本4
不会像版本3那样生成dmb。这应该是单核上的首选方式吗?

#include <stm32g4xx.h>
#include <atomic>

namespace  {
    std::atomic_uint32_t a1;
    uint32_t a2;
}

int main(){
    while(true) {
        // 1
        uint32_t val;                                            
        do {                                                     
            val = __LDREXW(&a2); 
            val += 1;
        } while ((__STREXW(val, &a2)) != 0U); 
        
        // 2
        __disable_irq();
        std::atomic_signal_fence(std::memory_order_seq_cst);    // 编译时屏障真的必要吗?
        ++a2;          
        std::atomic_signal_fence(std::memory_order_seq_cst);    // 编译时屏障真的必要吗?
        __enable_irq();
        
        // 3
        std::atomic_fetch_add(&a1, 1);
        
        // 4
        std::atomic_signal_fence(std::memory_order_seq_cst);    // 编译时屏障
        std::atomic_fetch_add_explicit(&a1, 1, std::memory_order_relaxed);
        std::atomic_signal_fence(std::memory_order_seq_cst);
    }
}

使用以下命令编译上述代码:
arm-none-eabi-g++ -I../../../STM32CubeG4/Drivers/CMSIS/Core/Include -I../../../STM32CubeG4/Drivers/CMSIS/Device/ST/STM32G4xx/Include -I../../../STM32CubeG4/Drivers/STM32G4xx_HAL_Driver/Inc -DSTM32G431xx -O3 -std=c++23 -fno-exceptions -fno-unwind-tables -fno-rtti -fno-threadsafe-statics -funsigned-char -funsigned-bitfields -fshort-enums -ffunction-sections -fdata-sections -fconcepts -ftemplate-depth=2048 -fstrict-aliasing -Wstrict-aliasing=1 -Wall -Wextra -I. -mthumb -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mfloat-abi=hard -fverbose-asm -Wa,-adhln -S -o test99.s test.cc

英文:

In the following example are 4 versions to atomically increment (or use other form of rmw-statements) on a variable a1 or a2 (depending on the version). The variable a1 or a2 may be shared with some form of ISR.

The question is according to a Cortex-M4 (STM32G431). The compiler is g++ (see below).

Version 1:
As I understand, entering a ISR issues a clrex automatically, so that the first strex always fails, if the sequence is interrupted. Correct? And therefore the ISR does not have to use ldrex/strex also? The implicit clrex works as sort of global memory clobber: would be possible to limit the clobber to a2 in the ISR?

Version 2:
Do the __disable_irq()/enable_irq() contain a compile-time barrier? So, are the explicit barries unneccessary? Would it be better (performance) to disable only the IRQ that could modify the variable a2?

Comparing Version 1 und 2:
If no IRQ hits the sequence, both should use the same number of CPU-cycles, but if any IRQ arises, Version 1 uses more cycles?

Version 3:
This produces additional dmb barrier instructions. But as I understand, these dmb are not neccessary on single-core M4?

Version 4:
Does not generate the dmb as in Version 3. Should this be the preferred way on single-core?

#include &lt;stm32g4xx.h&gt;
#include &lt;atomic&gt;

namespace  {
    std::atomic_uint32_t a1;
    uint32_t a2;
}

int main(){
    while(true) {
        // 1
        uint32_t val;                                            
        do {                                                     
            val = __LDREXW(&amp;a2); 
            val += 1;
        } while ((__STREXW(val, &amp;a2)) != 0U); 
        
        // 2
        __disable_irq();
        std::atomic_signal_fence(std::memory_order_seq_cst);    // compile-time barrier really neccessary?
        ++a2;          
        std::atomic_signal_fence(std::memory_order_seq_cst);    // compile-time barrier really neccessary?
        __enable_irq();
        
        // 3
        std::atomic_fetch_add(&amp;a1, 1);
        
        // 4
        std::atomic_signal_fence(std::memory_order_seq_cst);    // compile-time barrier
        std::atomic_fetch_add_explicit(&amp;a1, 1, std::memory_order_relaxed);
        std::atomic_signal_fence(std::memory_order_seq_cst);
    }
}

Compile the above with
arm-none-eabi-g++ -I../../../STM32CubeG4/Drivers/CMSIS/Core/Include -I../../../STM32CubeG4/Drivers/CMSIS/Device/ST/STM32G4xx/Include -I../../../STM32CubeG4/Drivers/STM32G4xx_HAL_Driver/Inc -DSTM32G431xx -O3 -std=c++23 -fno-exceptions -fno-unwind-tables -fno-rtti -fno-threadsafe-statics -funsigned-char -funsigned-bitfields -fshort-enums -ffunction-sections -fdata-sections -fconcepts -ftemplate-depth=2048 -fstrict-aliasing -Wstrict-aliasing=1 -Wall -Wextra -I. -mthumb -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mfloat-abi=hard -fverbose-asm -Wa,-adhln -S -o test99.s test.cc

答案1

得分: 1

让我们回顾一下历史...

版本2:__disable_irq()/enable_irq()函数是否包含编译时的屏障?因此,显式的屏障是否是不必要的?只禁用可能修改变量a2的IRQ是否会更好(性能)?

在单核上,禁用中断是有效的。然而,这会增加中断延迟(全局性)。这很难测试,因为你需要在这些代码行上恰好发生中断。在起搏器中错过一个中断可能不好,并且你已经创建了一个难以复现的延迟。此外,所有中断都要付出代价,即使它们不访问该变量。所以,对于这种情况,这样做是可以的,但是也许你的代码中有80-90个类似的区域。所以,它也不具有可扩展性。这通常是在2000年初的CPU设计之前的唯一方法。

版本1、3和4都是同一主题的变体。这里只有与原子操作相关的代码会有一些延迟。这更具可扩展性。如果代码从未激活,就没有人需要付出代价。不涉及的中断不会被延迟。

问题的一部分是a1a2。更改内存元素可能需要多个周期。例如,你有一个16位总线,需要先写入低位,然后写入高位。因此,C/C++ API旨在处理这种情况,但在ARM CPU上,32位对齐的类型是原子的(与在同一核上的访问相比)。如果你使用C++,std::atomic_fetch_add()是一个收益。

你可能正确地指出版本3执行了一个不必要的dmb,但代码简洁,它可能会在未来的gcc/g++版本中得到更新,并且在某些具有MPU和/或缓存结构的Cortex-M4上可能会有所影响。除了单核,你还可以考虑具有DMA引擎的外设。如果你选择使用C++,std::atomic_fetch_add()是一个好选择。

你所做的简单分析没有考虑到更复杂的代码和高级优化。编译器可以进行内联、复制和代码移动。所有这些都可能创建新的情况,简单的汇编测试用例无法暴露出来。版本1和4的微观优化在这个有限的测试中可能会有好处。然而,在实际代码中,有多个访问需要管理。编译器(现代寄存器分配、SSA和数据流)在优化这些方面非常出色。仅仅看有限的访问模式无法考虑其他优化。鉴于此,我建议使用std::atomic_fetch_add(),因为工具制造商对这些事情有更好的理解。还有一个对于“过早优化”的一般警告。

另一个警告是你的代码可能会移植到其他ARM CPU或完全不同的架构上。这些微观优化在这些情况下是否有效?抱歉,我不知道。也没有人知道,因为你无法预测未来的CPU设计。

英文:

Let's go through history...

> Version 2: Do the __disable_irq()/enable_irq() contain a compile-time barrier? So, are the explicit barries unneccessary? Would it be better (performance) to disable only the IRQ that could modify the variable a2?

On a single core, disabling interrupts works. However, this will increase interrupt latency (globally). This can be hard to test, because you need an interrupt to happen exactly on these lines. It may not be good to miss an interrupt in a pace maker and you have created a latency which is hard to reproduce. As well, ALL interrupts pay the price even if they do not access the variable. So, for this case, it is fine, but then maybe you have 80-90 areas like this in the code. So, as well, it does not scale. This was typically the only way to do things before the early 2000 CPU designs.

Version 1, 3, and 4 are all variations on the same theme. Here only the code that is involved with the atomics will have some delays. This is more scalable. <strike>If the code is never active, no one pays the price.</strike> Un-involve interrupts will not be delayed.

Part of the issue is a1 versus a2. It may take multiple cycles to change a memory element. For instance, you have a 16 bit bus and need to write the low part and then the high part. So the C/C++ API is geared to handle this case, but 32 bit aligned types are atomic (versus access on the same core) on the ARM CPUs. This might not translate if you have a 16 bit CPU.

You might be correct that version 3 is performing an unneeded dmb, but the code is concise, it could be updated in a future version of gcc/g++ and there can be some Cortex-M4 with MPUs and/or cache structure where this might be relevant. As well as the single core, you can also think about peripherals with DMA engines. If you have made the choice to use C++, std::atomic_fetch_add() is a gain.

The simple analysis you have done does not take into account more complex code and advanced optimization. The compiler can inline, duplicate and perform code motion. All of these can create new situation that a simple test case of the generated assembler will not expose. The micro-optimization of version 1 and 4 may benefit in this limited test. However, in real code, there are multiple accesses to manage. The compiler (modern register allocation, SSA and data flow) are very good at optimizing these. Looking at limited access patterns will not take other optimizations into account. Given this, I would stick to std::atomic_fetch_add() as the tool makers have a better understanding of these things. And the general caveat against 'pre-mature optimization'.

The other caveat is that your code might be ported to some other ARM cpu or a completely different architecture. Do these micro-optimizations hold in these cases? Sorry, I don't know. And no one does, because you can not fathom some future CPU design.

huangapple
  • 本文由 发表于 2023年8月8日 21:24:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/76860001.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定