英文:
How can I elegantly take advantage of ARM instructions like REV and RBIT when writing C code?
问题
我正在编写C代码,该代码可能会被编译为Arm Cortex-M3微控制器。
这个微控制器支持多个有用的指令,用于高效地操作寄存器中的位,包括REV*、RBIT和SXT*。
当编写C代码时,如果我需要这些特定的功能,我该如何利用这些指令呢?例如,我该如何完成这段代码?
#define REVERSE_BIT_ORDER(x) { /* 在这里写什么? */ }
我希望不使用内联汇编来实现这个,以便这段代码既具有可移植性又易读。
补充:
部分原因是我想要优雅地用C来表达这样的函数。例如,用C来表达位移非常容易,因为它内置在语言中。同样,设置或清除位也很容易。但是位反转在C中是未知的,因此很难表达。例如,这是我如何反转位的方式:
unsigned int ReverseBits(unsigned int x)
{
unsigned int ret = 0;
for (int i=0; i<32; i++)
{
ret <<= 1;
if (x & (1<<i))
ret |= 1;
}
return ret;
}
编译器是否会将其识别为位反转并发出正确的指令?
英文:
I am writing C code which may be compiled for the Arm Cortex-M3 microcontroller.
This microcontroller supports several useful instructions for efficiently manipulating bits in registers, including REV*, RBIT, SXT*.
When writing C code, how can I take advantage of these instructions if I need those specific functions? For example, how can I complete this code?
#define REVERSE_BIT_ORDER(x) { /* what to write here? */ }
I would like to do this without using inline assembler so that this code is both portable, and readable.
Added:
In part, I am asking how to express such a function in C elegantly. For example, it's easy to express bit shifting in C, because it's built into the language. Likewise, setting or clearing bits. But bit reversal is unknown in C, and so is very hard to express. For example, this is how I would reverse bits:
unsigned int ReverseBits(unsigned int x)
{
unsigned int ret = 0;
for (int i=0; i<32; i++)
{
ret <<= 1;
if (x & (1<<i))
ret |= 1;
}
return ret;
}
Would the compiler recognise this as bit reversal, and issue the correct instruction?
答案1
得分: 4
反转32位整数中的位是一种非常特殊的指令,这可能是你无法重现它的原因。我能够生成使用 REV
(反转_字节_顺序)的代码,这是一个更常见的用例:
#include <stdint.h>
uint32_t endianize(uint32_t input)
{
return ((input >> 24) & 0x000000FF) |
((input >> 8) & 0x0000FF00) |
((input << 8) & 0x00FF0000) |
((input << 24) & 0xFF000000);
}
使用 gcc -O3 -mcpu=cortex-m3 -ffreestanding
(针对ARM32,版本11.2.1 "none"):
endianize:
rev r0, r0
bx lr
https://godbolt.org/z/odGqzjTGz
它也适用于 clang armv7-a 15.0.0,只要你使用 -mcpu=cortex-m3
。
因此,这支持了避免手动优化的想法,让编译器来处理这样的优化。
英文:
Reversing bits in a 32 bit integer is such an exotic instruction so that might be why you can't reproduce it. I was able to generate code that utilizes REV
(reverse byte order) however, which is a far more common use-case:
#include <stdint.h>
uint32_t endianize (uint32_t input)
{
return ((input >> 24) & 0x000000FF) |
((input >> 8) & 0x0000FF00) |
((input << 8) & 0x00FF0000) |
((input << 24) & 0xFF000000) ;
}
With gcc -O3 -mcpu=cortex-m3 -ffreestanding
(for ARM32, vers 11.2.1 "none"):
endianize:
rev r0, r0
bx lr
https://godbolt.org/z/odGqzjTGz
It works for clang armv7-a 15.0.0 too, long as you use -mcpu=cortex-m3
.
So this would support the idea of avoiding manual optimizations and let the compiler worry about such.
答案2
得分: 1
使用CMSIS内嵌函数最好。
__REV、__REV16等等。这些CMSIS头文件包含更多内容。
你可以从这里获取它们:
https://github.com/ARM-software/CMSIS_5
你需要寻找cmsis_gcc.h
文件(如果你使用其他编译器可能有类似的文件)。
英文:
It would be best if you used CMSIS intrinsic.
__REV, __REV16 etc. Those CMSIS header files contain much much more.
You can get them from here:
https://github.com/ARM-software/CMSIS_5
and you are looking for cmsis_gcc.h
file (or similar if you use another compiler).
答案3
得分: 1
@Lundin的答案展示了一个纯C的移位/掩码位操作技巧,clang识别并编译成单个rev
指令。(或者在针对x86的情况下可能是x86的bswap
,或者在其他具有这些指令的ISA上是等效的指令。)
在可移植的ISO C中,希望进行模式识别很不幸是你所能做的最好的事情,因为他们尚未添加可移植的方式来暴露CPU功能;甚至C++也要到C++20才添加了<bit>
头文件,用于诸如std::popcount
和C++23 std::byteswap
之类的功能。
(一些相对可移植的C库/头文件具有字节反转功能,例如在网络编程中有ntohl
用于将网络字节序转换为主机字节序,在小端机器上它是一个端序交换操作。或者有GCC的(或glibc的?)endian.h
,其中htobe32
用于主机字节序到大端字节序的32位转换。Man页面。这些通常是使用内置函数实现的,在高质量的实现中会编译为单个指令。
当然,如果你绝对想要进行字节交换,而不管主机的字节序如何,你可以使用htole32(be32toh(x))
,因为其中一个是无操作,另一个是字节交换,因为ARM要么是大端,要么是小端。(即使在PDP或其他混合端序的机器上,它仍然是字节交换,但可能有更有效的方法来实现。)
还有一些“有用函数集合”的头文件,其中包含不同编译器的内置函数,包括字节交换等函数。这些函数的效率和可能性正确性各不相同。
你可以看到,不,无论是GCC还是clang都没有将你的代码优化为ARM或AArch64的rbit
指令。https://godbolt.org/z/Y7noP61dE。循环遍历位的方向也不会更好。也许可以使用一些位操作技巧,就像这里提到的https://stackoverflow.com/questions/2602823/in-c-c-whats-the-simplest-way-to-reverse-the-order-of-bits-in-a-byte或https://stackoverflow.com/questions/746171/efficient-algorithm-for-bit-reversal-from-msb-lsb-to-lsb-msb-in-c。
CC和clang可以识别popcount的标准位操作技巧,但我没有检查位反转问题的任何答案。
一些语言,特别是Rust,更关心能够可移植地表达现代CPU的功能。foo.reverse_bits()
(自Rust 1.37起)和foo.swap_bytes()
适用于任何类型和任何ISA。对于u32
具体而言,可以查看https://doc.rust-lang.org/std/primitive.u32.html#method.reverse_bits(这相当于C的uint32_t
)。
大多数主流C实现都有可移植(跨ISA)的内置函数或(特定于目标的)内联函数(例如,用于此类操作的__REV()
或__REV16()
)。C的GNU方言(GCC/clang/ICC等)包括__builtin_bswap32(input)
。请参见https://stackoverflow.com/questions/35133829/does-arm-gcc-have-a-builtin-function-for-the-assembly-rev-instruction。它以x86的bswap
指令命名,但它只是一个字节反转,GCC/clang会将其编译成目标ISA上高效执行字节反转的指令。
还有一个用于交换16位整数字节的__builtin_bswap16(uint16_t)
,类似于revsh
,但C的语义不包括保留32位整数的高16位(因为通常情况下你不关心那部分)。请参阅GNU C手册,了解不特定于目标的GNU C内置函数的可用性。
我在GNU C的手册或GCC arm-none-eabi 12.2头文件中没有找到位反转的GNU C内置函数或内联函数。
ARM文档提供了一个__rbit()
内置函数,适用于他们自己的编译器,但我认为这是Keil的ARMCC,因此对于GCC/clang可能没有等效的内置函数。
@0___________建议查看https://github.com/ARM-software/CMSIS_5,其中定义了执行这种操作的函数。
如果情况变得最糟,对于GCC/clang,可以使用内联asm
,并在适当的#ifdef
下进行设置。你可能还想要使用if (__builtin_constant_p(x))
来在编译时常量上执行纯C位反转,只在运行时变量值上使用内联asm。
uint32_t output, input=...;
#if defined(__arm__) || defined (__aarch64__)
// 对于两者都有效的相同指令
asm("rbit %0,%1" : "=r"(output) : "r"(input));
#else
... // 纯C回退或其他操作
#endif
注意,它不需要是volatile
,因为rbit
是输入操作数的纯函数。如果GCC/clang能够将这个操作提升出循环,那是一件好事。而且它只是一个汇编指令,所以我们不需要早期占位。
这样做的缺点是编译器无法将移位操作折叠到其中,例如,如果你想要字节反转,__rbit(x) >> 24
等于__rbit(x<<24)
英文:
@Lundin's answer shows a pure-C shift/mask bithack that clang recognizes and compiles to a single rev
instruction. (Or presumably to x86 bswap
if targeting x86, or equivalent instructions on other ISAs that have them.)
In portable ISO C, hoping for pattern-recognition is unfortunately the best you can do, because they haven't added portable ways to expose CPU functionality; even C++ took until C++20 to add the <bit>
header for things like std::popcount
and C++23 std::byteswap
.
(Some fairly-portable C libraries / headers have byte-reversal, e.g. as part of networking there's ntohl
net-to-host which is an endian-swap on little-endian machines. Or there's GCC's (or glibc's?) endian.h
, with htobe32
being host to big-endian 32-bit. Man page. These are usually implemented with intrinsics that compile to a single instruction in good-quality implementations.
Of course, if you definitely want a byte swap regardless of host endianness, you could do htole32(be32toh(x))
because one of them's a no-op and the other's a byte-swap, since ARM is either big or little endian. (It's still a byte-swap even if neither of them are NOPs, even on PDP or other mixed-endian machines, but there might be more efficient ways to do it.)
There are also some "collections of useful functions" headers with intrinsics for different compilers, with functions like byte swap. These can be of varying quality in terms of efficiency and maybe even correctness.
You can see that no, neither GCC nor clang optimize your code to rbit
for ARM or AArch64. https://godbolt.org/z/Y7noP61dE . Presumably looping over bits in the other direction isn't any better. Perhaps a bithack as in https://stackoverflow.com/questions/2602823/in-c-c-whats-the-simplest-way-to-reverse-the-order-of-bits-in-a-byte or https://stackoverflow.com/questions/746171/efficient-algorithm-for-bit-reversal-from-msb-lsb-to-lsb-msb-in-c .
CC and clang recognize the standard bithack for popcount, but I didn't check any of the answers on the bit-reverse questions.
Some languages, notably Rust, do care more about making it possible to portably express what modern CPUs can do. foo.reverse_bits()
(since Rust 1.37) and foo.swap_bytes()
just work for any type on any ISA. For u32
specifically, https://doc.rust-lang.org/std/primitive.u32.html#method.reverse_bits (That's Rust's equivalent of C uint32_t
.)
Most mainstream C implementations have portable (across ISAs) builtins or (target-specific) intrinsics (like __REV()
or __REV16()
for stuff like this.
The GNU dialect of C (GCC/clang/ICC and some others) includes __builtin_bswap32(input)
. See https://stackoverflow.com/questions/35133829/does-arm-gcc-have-a-builtin-function-for-the-assembly-rev-instruction. It's named after the x86 bswap
instruction, but it's just a byte-reverse that GCC / clang compile to whatever instructions can do it efficiently on the target ISA.
There's also a __builtin_bswap16(uint16_t)
for swapping the bytes of a 16-bit integer, like revsh
except the C semantics don't include preserving the upper 16 bits of a 32-bit integer. (Because normally you don't care about that part.) See the GCC manual
for the available GNU C builtins that aren't target-specific.
There isn't a GNU C builtin or intrinsic for bitwise reverse that I could find in the manual or GCC arm-none-eabi 12.2 headers.
ARM documents an __rbit()
intrinsic for their own compiler, but I think that's Keil's ARMCC, so there might not be any equivalent of that for GCC/clang.
@0___________ suggests https://github.com/ARM-software/CMSIS_5 for headers that define a function for that.
If worst comes to worst, GNU C inline asm
is possible for GCC/clang, given appropriate #ifdef
s. You might also want if (__builtin_constant_p(x))
to use a pure-C bit-reversal so constant-propagation can happen on compile-time constants, only using inline asm on runtime-variable values.
uint32_t output, input=...;
#if defined(__arm__) || defined (__aarch64__)
// same instruction is valid for both
asm("rbit %0,%1" : "=r"(output) : "r"(input));
#else
... // pure C fallback or something
#endif
Note that it doesn't need to be volatile
because rbit
is a pure function of the input operand. It's a good thing if GCC/clang are able to hoist this out of a loop. And it's a single asm instruction so we don't need an early-clobber.
This has the downside that the compiler can't fold a shift into it, e.g. if you wanted a byte-reverse, __rbit(x) >> 24
equals __rbit(x<<24)
, which could be done with rbit r0, r1, lsl #24
. (I think).
With inline asm I don't think there's a way to tell the compiler that a r1, lsl #24
is a valid expansion for the %1
input operand. Hmm, unless there's a machine-specific constraint for that? https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html - no, no mention of "shifted" or "flexible" source operand in the ARM section.
https://stackoverflow.com/questions/746171/efficient-algorithm-for-bit-reversal-from-msb-lsb-to-lsb-msb-in-c/64080281#64080281 shows an #ifdef
ed version with a working fallback that uses a bithack to reverse bits within a byte, then __builtin_bswap32
or MSVC _byteswap_ulong
to reverse bytes.
答案4
得分: 1
Interestingly, ARM gcc似乎最近改进了其字节顺序翻转的检测能力。在版本11中,如果通过位移或通过指针进行字节交换,它将检测到字节翻转。然而,在版本10及更早版本中,指针方法未能发出REV
指令。
uint32_t endianize1 (uint32_t input)
{
return ((input >> 24) & 0x000000FF) |
((input >> 8) & 0x0000FF00) |
((input << 8) & 0x00FF0000) |
((input << 24) & 0xFF000000) ;
}
uint32_t endianize2 (uint32_t input)
{
uint32_t output;
uint8_t *in8 = (uint8_t*)&input;
uint8_t *out8 = (uint8_t*)&output;
out8[0] = in8[3];
out8[1] = in8[2];
out8[2] = in8[1];
out8[3] = in8[0];
return output;
}
endianize1:
rev r0, r0
bx lr
endianize2:
mov r3, r0
movs r0, #0
lsrs r2, r3, #24
bfi r0, r2, #0, #8
ubfx r2, r3, #16, #8
bfi r0, r2, #8, #8
ubfx r2, r3, #8, #8
bfi r0, r2, #16, #8
bfi r0, r3, #24, #8
bx lr
https://godbolt.org/z/E3xGvG9qq
所以,在等待优化器改进的同时,肯定有办法可以帮助编译器理解您的意图并充分利用指令集(而不必诉诸微优化或内联汇编)。但这可能需要程序员对体系结构有很好的理解,并检查输出的汇编代码。
利用http://godbolt.org 来帮助检查编译器输出,看看哪种方法能产生最佳输出。
英文:
Interestingly, ARM gcc seems to have improved its detection of byte order reversing recently. With version 11, it would detect byte reversal if done by bit shifting, or by byte swapping through a pointer. However, from version 10 and backwards, the pointer method failed to issue the REV
instruction.
uint32_t endianize1 (uint32_t input)
{
return ((input >> 24) & 0x000000FF) |
((input >> 8) & 0x0000FF00) |
((input << 8) & 0x00FF0000) |
((input << 24) & 0xFF000000) ;
}
uint32_t endianize2 (uint32_t input)
{
uint32_t output;
uint8_t *in8 = (uint8_t*)&input;
uint8_t *out8 = (uint8_t*)&output;
out8[0] = in8[3];
out8[1] = in8[2];
out8[2] = in8[1];
out8[3] = in8[0];
return output;
}
endianize1:
rev r0, r0
bx lr
endianize2:
mov r3, r0
movs r0, #0
lsrs r2, r3, #24
bfi r0, r2, #0, #8
ubfx r2, r3, #16, #8
bfi r0, r2, #8, #8
ubfx r2, r3, #8, #8
bfi r0, r2, #16, #8
bfi r0, r3, #24, #8
bx lr
https://godbolt.org/z/E3xGvG9qq
So, as we wait for optimisers to improve, there are certainly ways you can help the compiler understand your intent and take good advantage of the instruction set (without resorting to micro optimisations or inline assembler). But it's likely that this will involve a good understanding of the architecture by the programmer, and examination of the output assembler.
Take advantage of http://godbolt.org to help examine the compiler output, and see what produces the best output.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论