2023年2月6日 20:45:56go评论64阅读模式

英文:

Moving data into __uint24 with assembly

问题

我最初有以下的C代码：
```C
volatile register uint16_t counter asm(&quot;r12&quot;);

__uint24 getCounter() {
  __uint24 res = counter;
  res = (res &lt;&lt; 8) | TCNT0;
  return res;
}

这个函数在一些热点位置运行并且被内联，我试图将许多东西压缩到一个 ATtiny13 中，所以到了优化的时候。

这个函数编译后的结果是：

getCounter:
        movw r24,r12
        ldi r26,0
        clr r22
        mov r23,r24
        mov r24,r25
        in r25,0x32
        or r22,r25
        ret

我想出了以下的汇编代码：

inline __uint24 getCounter() {
  //__uint24 res = counter;
  //res = (res &lt;&lt; 8) | TCNT0;
  
  uint32_t result;
  asm(
    &quot;in %A[result],0x32&quot; &quot;\n\t&quot;
    &quot;movw %C[result],%0
+
网站访问量
&quot; &quot;\n\t&quot;
    &quot;mov %B[result],%C[result]&quot; &quot;\n\t&quot;
    &quot;mov %C[result],%D[result]&quot; &quot;\n\t&quot;
    : [result] &quot;=r&quot; (result)
    : 0
+
网站访问量
 &quot;r&quot; (counter)
    :
  );
  return (__uint24) result;
}

使用 uint32_t 的原因是为了“分配”第四个连续的寄存器，并让编译器理解它已被破坏（因为我不能在破坏列表中使用 "%D[result]" 这样的写法）。

我的汇编代码正确吗？从我的测试中看，似乎是正确的。
有没有一种方法让编译器更好地优化 getCounter()，以避免需要混淆的汇编代码？
在汇编中有没有更好的方法来实现这个？
编辑：movw 的整个想法是保持读取的原子性，因为 counter 变量在中断内部被递增。


<details>
<summary>英文:</summary>

I originally had the following C code:
```C
volatile register uint16_t counter asm(&quot;r12&quot;);

__uint24 getCounter() {
  __uint24 res = counter;
  res = (res &lt;&lt; 8) | TCNT0;
  return res;
}

This function runs in some hot places and is inlined, and I'm trying to cram a lot of stuff into an ATtiny13, so it came time to optimize it.

That function compiles to:

getCounter:
        movw r24,r12
        ldi r26,0
        clr r22
        mov r23,r24
        mov r24,r25
        in r25,0x32
        or r22,r25
        ret

I came up with this assembly:

inline __uint24 getCounter() {
  //__uint24 res = counter;
  //res = (res &lt;&lt; 8) | TCNT0;
  
  uint32_t result;
  asm(
    &quot;in %A[result],0x32&quot; &quot;\n\t&quot;
    &quot;movw %C[result],%0
+
网站访问量
&quot; &quot;\n\t&quot;
    &quot;mov %B[result],%C[result]&quot; &quot;\n\t&quot;
    &quot;mov %C[result],%D[result]&quot; &quot;\n\t&quot;
    : [result] &quot;=r&quot; (result)
    : 0
+
网站访问量
 &quot;r&quot; (counter)
    :
  );
  return (__uint24) result;
}

The reason for uint32_t is to "allocate" the fourth consecutive register and for the compiler to understand it is clobbered (since I cannot do something like "%D[result]" in the clobber list)

Is my assembly correct? From my testing it seems like it is.
Is there a way to allow the compiler to optimize getCounter() better so there's not need for confusing assembly?
Is there a better way to do this in assembly?

EDIT: The whole idea with the movw is to keep the read atomic, since the counter variable is incremented inside of an interrupt.

答案1

得分: 3

以下是您要求的代码部分的翻译：

As it seems from my experiments in [GodBolt](https://godbolt.org/z/h3nT4or97), even with the `-O3` flag avr-gcc optimizer is just not sophisticated enough. I doubt there are any other flags that can trick it into optimizing this specific code more (I tried some but none helped).

但是有一种使用 `union` 的替代方式可以编写相同的代码，这种情况下编译器会更好地优化汇编代码。因此，无需使用内联汇编。

原始代码分析
----------------------
1. `counter` 变量存储在 `r12`（LSB）和 `r13`（MSB）寄存器中。
2. `TCNT0` 从 I/O 空间地址 0x32 读取（使用 `in Rd, 0x32` 指令）。
3. 根据 [avr-gcc ABI](https://gcc.gnu.org/wiki/avr-gcc)，24位值存储在 `r22(LSB):r23:r24(MSB)` 中返回。
4. 因此，我们希望进行以下转移：

r24 <-- r13
r23 <-- r12
r22 <-- TCNT0


# 更新的解决方案（无内联汇编！）
仔细查看代码，我猜测您可能有某种定时器中断，当定时器达到某个上限时，会增加 `counter`。如果是这种情况，即使在纯 C 版本中，代码也存在更深层的问题。重要的部分是**同时以单一单元的形式读取 `TCNT0` 和 `counter`**！否则，如果在 `movw` 和 `in` 指令之间发生中断，结果将不准确。以下是演示错误的情景示例：

counter = 0x0010, TCNT0 = 0xff
MOVW 复制了 0x0010
中断发生 => 处理程序设置 counter = 0x0011 并将 TCNT0 = 0
IN 指令读取 TCNT0 = 0
结果 = 0x0010_00（而不是预期的 0x0010_ff）


有两种方法可以解决这个问题：
1. 在两次读取之间添加 `CLI / SEI` 以确保它们在中断之间一起执行，避免中断的可能干扰。
2. 在读取计数器之前和之后分别读取 `TCNT0`。如果第二次读取的结果较小，表示中间发生了中断，我们无法信任这些值，需要重新进行整个读取。

因此，一个没有错误的解决方案可能如下（根据需要在函数上添加内联规范）：

__uint24 getCounter() {
union
{
__uint24 result;

struct {
  uint8_t lo;
  uint16_t hi;
} parts;

} u;

__builtin_avr_cli();
u.parts.hi = counter;
u.parts.lo = TCNT0;
__builtin_avr_sei();

return u.result;
}

生成的汇编代码如下：

getCounter:
cli
mov r23,r12
mov r24,r13
in r22,0x32
sei
ret


**Godbolt:** https://godbolt.org/z/YrWrT8sT4

# 新解决方案（更少的汇编，部分原子性）
由于需要原子性，我们必须使用 `movw` 指令。以下是一种最小化内联汇编量，尽可能使用 C 的版本：

__uint24 getCounter() {
union
{
__uint24 result;

struct {
  uint8_t lo;
  uint16_t hi;
} parts;

} u;

uint16_t tmp;

// 确保使用 movw 指令原子性地读取计数器
asm("movw %C[tmp],%

网站访问量

\n\t" : [tmp] "=r" (tmp) :

网站访问量

"r" (counter));

u.parts.hi = tmp;
u.parts.lo = TCNT0;

return u.result;
}

**Godbolt:** https://godbolt.org/z/P9a9K6n76

# 旧解决方案（没有原子性）

问题作者的汇编分析
-----------------------------------
看起来是正确的，并且提供了正确的结果。但是，我可以提出两点建议以改进：
1. 它有3个 `mov` 指令，需要3个时钟周期来执行。gcc 生成了类似的代码，因为 `movw` 只能在偶数对齐的寄存器上运行。但是您可以用只有2个 `mov` 指令来替代它们，并且还会消除对更大的 `uint32_t` 变量的需求。
2. 我建议避免在代码中硬编码 `TCNT0` 地址，以提高代码的可移植性。

建议的汇编代码
------------------
因此，这是稍微修改后的代码版本：

inline __uint24 getCounter() {
__uint24 result;
asm(
"in %A[result], %[tcnt0]" "\n\t"
"mov %B[result], %A

网站访问量

" "\n\t"
"mov %C[result], %B

网站访问量

" "\n\t"
: [result] "=r" (result)
:

网站访问量

"r" (counter)
, [tcnt0] "I" (_SFR_IO_ADDR(TCNT0))
);
return result;
}

但是，请注意此解决方案的一个缺点，即在读取计数器时失去了原子性。如果在两个 `mov` 指令之间发生中断，并且中断内部修改了 `counter`，我们可能得到正确的结果。但是，

<details>
<summary>英文:</summary>

As it seems from my experiments in [GodBolt](https://godbolt.org/z/h3nT4or97), even with the `-O3` flag avr-gcc optimizer is just not sophisticated enough. I doubt there are any other flags that can trick it into optimizing this specific code more (I tried some but none helped).

But there is an alternative way to write the some code using `union` and in that case compiler optimizes the assembly better. Thus, no need to resort to inline assembly.

Original code analysis
----------------------
1. The `counter` variable is stored in `r12` (LSB) and `r13` (MSB) registers.
2. `TCNT0` is read from I/O space address 0x32 (by `in Rd, 0x32` instruction). 
3. According to the [avr-gcc ABI](https://gcc.gnu.org/wiki/avr-gcc), the 24-bit value is returned in `r22(LSB):r23:r24(MSB)`.
4. So to summarize, we want the following transfer to occur:&lt;br/&gt;

r24 <-- r13
r23 <-- r12
r22 <-- TCNT0


# Even newer solution (no inline assembly!)
Looking into the code, I guess you have some kind of timer interrupt incrementing `counter` when the timer reaches some upper threshold. If that&#39;s the case, the code has a deeper problem, even in the pure C version. The important part is that the **read of both `TCNT0` and `counter` should be atomic together as single unit**! Otherwise, if the interrupt occurs between the `movw` and `in` instructions, your result will be inaccurate. Example of scenario demonstrating the bug:

counter = 0x0010, TCNT0 = 0xff
MOVW copies 0x0010
Interrupt occurs => handler sets counter = 0x0011 and TCNT0 = 0
IN instruction reads TCNT0 = 0
result = 0x0010_00 (instead of expected 0x0010_ff)


There are two ways for to solve this:
1. Wrap `CLI / SEI` around the two reads to get them together without possible interrupt in the middle.
2. Read `TCNT0` twice, before and after reading the counter. If the second read gives smaller result, it means an interrupt in between and we can&#39;t trust the values, repeat the whole read.

Thus, a correct solution, without the bug might be like this (add inline specification on the function as needed):

__uint24 getCounter() {
union
{
__uint24 result;

struct {
  uint8_t lo;
  uint16_t hi;
} parts;

} u;

__builtin_avr_cli();
u.parts.hi = counter;
u.parts.lo = TCNT0;
__builtin_avr_sei();

return u.result;
}

Producing:

getCounter:
cli
mov r23,r12
mov r24,r13
in r22,0x32
sei
ret


**Godbolt:** https://godbolt.org/z/YrWrT8sT4

# Newer solution (less assembly, partial atomicity)
With the atomicity requirement added, we must use the `movw` instruction. Here is a version that minimizes the amount of inline assembly and uses as much C as possible:

__uint24 getCounter() {
union
{
__uint24 result;

struct {
  uint8_t lo;
  uint16_t hi;
} parts;

} u;

uint16_t tmp;

// Ensure the counter is read atomically with movw instruction
asm("movw %C[tmp],%

网站访问量

\n\t" : [tmp] "=r" (tmp) :

网站访问量

"r" (counter));

u.parts.hi = tmp;
u.parts.lo = TCNT0;

return u.result;
}

**Godbolt:** https://godbolt.org/z/P9a9K6n76

# Old solution (without atomicity)

Question author&#39;s assembly analysis
-----------------------------------
It looks correct and provides the right results. However, there are two things I can suggest to improve:
1. It has 3 `mov` instructions, taking 3 cycles to execute. gcc generated similar code because `movw` operates only on evenly aligned registers. But you can replace these with just 2 `mov` instructions and it will also remove the need for the larger `uint32_t` variable.
2. I would avoid hardcoding `TCNT0` address for better code portability.

Suggested assembly
------------------
So here is a slightly modified version of your code:

inline __uint24 getCounter() {
__uint24 result;
asm(
"in %A[result], %[tcnt0]" "\n\t"
"mov %B[result], %A

网站访问量

" "\n\t"
"mov %C[result], %B

网站访问量

" "\n\t"
: [result] "=r" (result)
:

网站访问量

"r" (counter)
, [tcnt0] "I" (_SFR_IO_ADDR(TCNT0))
);
return result;
}

However, note a downside of this solution &amp;ndash; we loose atomicity on reading the counter. If an interrupt occurs between the two `mov` instructions and `counter` is modified inside the interrupt, we might get correct results. But if `counter` is never modified by interrupts, I would prefer using the two separate `mov` instructions for performance benefits. 

**Godbolt:** https://godbolt.org/z/h3nT4or97
(I removed `inline` keywords to show the generated assembly)

</details>



# 答案2
**得分**: 3

以下是要翻译的内容：

```c++
#include <avr/io.h>;

register uint16_t counter asm("r12");

static inline __attribute__((__always_inline__))
__uint24 getCounter (void)
{
    __uint24 result;

    __asm ("mov %B0, %A1" "\n\t"
           "mov %C0, %B1"
           : "=r" (result)
           : "r" (counter), "0" (TCNT0));

    return result;
}

一些关于这个解决方案的注意事项：

使用static inline和always_inline可以实现最大的内联化。
TCNT0在C/C++代码中读取，而不是在汇编中，因此编译器可以选择最佳的指令来读取该SFR（根据架构选择IN或LDS）。这也更方便，因为不需要使用AVR-LibC中的__SFR_IO_ADDR混乱的东西。
GCC将分配用于读取TCNT0的寄存器到与result相同的寄存器。由于avr-gcc的ABI是小端的，所以它将分配到result的LSB。这在GCC内联汇编中是完全可以接受的，尽管TCNT0和result具有不兼容的类型。
类似counter这样的全局寄存器变量不能是volatile，GCC会警告：
```
warning: optimization may eliminate reads and/or writes to register variables [-Wvolatile-register-var]
volatile register uint16_t counter asm("r12");
^~~~~~~~
```
原因是历史表示，内部表示的REG甚至没有volatile属性。因此，您可能需要重新考虑您的代码。例如，像while (counter != 0) ...这样的循环可能不会产生您期望的结果。
使用类似counter的全局寄存器变量会带来一些注意事项：对于每个模块/编译单元，编译器必须知道它不能分配变量到一些否则是自由可用的寄存器。因此，您可以在每个模块中包含counter的声明，甚至包括那些根本不使用counter的模块。或者更好的办法是使用-ffixed-12 -ffixed-13编译所有模块。为了减少与调用约定的干扰，最好使用R2而不是R12。请注意，R12可能用于传递参数，而来自libc / libgcc的代码也可能使用R12，因为这些库无法知道R12（或R2）是禁止的。

使用上述代码并显示生成的汇编的示例是使用-Os -save-temps编译以下代码。

void f (int, __int24);

int main (void)
{
    f (0, getCounter() /* in R22:R20 */);
}

.s文件将显示：

main:
	in r20,0x32
/* #APP */
	mov r21, r12
	mov r22, r13
/* #NOAPP */
	ldi r25,0
	ldi r24,0
	rcall f
...

阅读`counter`的原子性

正如在评论中提到的，应该以原子方式读取counter。使用movw仅需1个时钟周期，因此比cli / sei序列更快。这足以使用一个24位变量。虽然我不确定寄存器压力中减少了一个寄存器是否会产生差异。无论如何，以下是使用movw的解决方案。该SFR在汇编中读取，因此它变得volatile：

static inline __attribute__((__always_inline__))
__uint24 getCounter (void)
{
    __uint24 result;
    __asm volatile ("movw %A0, %A1" "\n\t" // 原子读取counter。
                    "mov  %C0, %B0" "\n\t"
                    "mov  %B0, %A0" "\n\t"
                    "in   %A0, %i2"
                    : "=r" (result)
                    : "r" (counter), "n" (&TCNT0));
    return result;
}

请注意，内联汇编操作数打印修饰符i是在v4.7引入的，这是引入__uint24的相同版本，因此无需担心%i。

英文:

You will read the value of counter in R13:R12, so you need two MOV's and one IN to read TCNT0. So a working version using inline assembly is:

#include &lt;avr/io.h&gt;

register uint16_t counter asm(&quot;r12&quot;);

static inline __attribute__((__always_inline__))
__uint24 getCounter (void)
{
    __uint24 result;

    __asm (&quot;mov %B0, %A1&quot; &quot;\n\t&quot;
           &quot;mov %C0, %B1&quot;
           : &quot;=r&quot; (result)
           : &quot;r&quot; (counter), &quot;0&quot; (TCNT0));

    return result;
}

Some notes on that solution:

Maximal inlining is achieved with static inline and always_inline.
TCNT0 is read in the C/C++ code, not in the assembly so the complier can chose the best instruction to read that SFR (IN or LDS depending on arch). It's also more convenient as there's no need for __SFR_IO_ADDR gobbledegook from AVR-LibC.
GCC will allocate the reg which reads TCNT0 to the same register like result. As avr-gcc ABI is little endian, so it will be allocated to the LSB of result. This is all fine with GCC inline assembly, even though TCNT0 and result have incompatible types.
Global register variables like count can't be volatile, and GCC will warn:
```
warning: optimization may eliminate reads and/or writes to register variables [-Wvolatile-register-var]
volatile register uint16_t counter asm(&quot;r12&quot;);
^~~~~~~~
```
Reason is historical representation where internal representation of REG doesn't even have a volatile property. So you might rethink your code. For example, looping like while (counter != 0) ... might not do what you are expecting.
Using global register variables like counter comes with some caveats: For every module / compilation unit the compiler must know that it must not allocate variables to some register that are otherwise freely available. Hence, you can include the decl of counter in each and ever module, including the ones that don't even use counter. Or better still, compile all modules with -ffixed-12 -ffixed-13. To reduce interference with the calling convention, better use R2 instead of R12. Notice that R12 might be used to pass parameters, and code from libc / libgcc might also use R12, because there's no way for these libs to know that R12 (or R2 for that matter) is forbidden.

An example that uses the code above and shows generated assembly, is to compile the following code with -Os -save-temps.

void f (int, __int24);

int main (void)
{
    f (0, getCounter() /* in R22:R20 */);
}

*.s will read:

main:
	in r20,0x32
/* #APP */
	mov r21, r12
	mov r22, r13
/* #NOAPP */
	ldi r25,0
	ldi r24,0
	rcall f
...

Reading `counter` atomically

As mentioned in a comment, counter sould be read atomically. Using movw is 1 tick, thus faster than cli / sei sequence. It's enough to use a 24-bit variable. Though I am not sure whether that one register less in register pressure would even make a difference. Anyways, here is a solution with movw. The SFR is read in the assembly, so it turns volatile:

static inline __attribute__((__always_inline__))
__uint24 getCounter (void)
{
    __uint24 result;
    __asm volatile (&quot;movw %A0, %A1&quot; &quot;\n\t&quot; // Atomic read of counter.
                    &quot;mov  %C0, %B0&quot; &quot;\n\t&quot;
                    &quot;mov  %B0, %A0&quot; &quot;\n\t&quot;
                    &quot;in   %A0, %i2&quot;
                    : &quot;=r&quot; (result)
                    : &quot;r&quot; (counter), &quot;n&quot; (&amp;TCNT0));
    return result;
}

Notice that inline assembly operand print modifier i was introduced in v4.7 which is the same version that brought __uint24; so no head scratching about %i.


</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将数据移入 __uint24 中，使用汇编。

问题

答案1

阅读`counter`的原子性

Reading `counter` atomically

使用for循环作为延迟的替代方式？

为什么我的Go程序在以下场景中表现得比预期的要差很多？

内存访问错误在排序结构字段时发生。

除了复杂性之外，是否存在技术障碍来实现一个“跨平台的autotools”？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论

问题

答案1

阅读counter的原子性

Reading counter atomically

发表评论

阅读`counter`的原子性

Reading `counter` atomically