英文:
memcpy on AARCH64 yielding unaligned Data Abort Exception, ARM GNU Toolchain or newlibc Bug?
问题
我理解你的问题。你的问题似乎涉及到在ARM64架构上使用memcpy函数时可能导致未对齐访问异常的情况。你还提到在裸机项目中很难确保每次memcpy调用都具有64位对齐的大小。
为了提供临时修复方案,你可以尝试自定义memcpy函数的实现,确保它对32位对齐的大小使用32位寄存器。你可以基于标准的memcpy实现进行修改,以适应你的需求。
关于在哪里找到替代memcpy的实现,你可以考虑查看GNU C库(glibc)的源代码或其他与你的工具链兼容的C库。然后,你可以根据需要进行修改,以确保对32位对齐的数据使用32位寄存器。
请注意,这只是一个可能的解决方案,具体取决于你的项目和工具链的要求。你可能需要谨慎测试和验证这些更改,以确保不会引入其他问题。希望这可以帮助你解决问题。
英文:
I've been using the ARM GCC release aarch64-none-elf-gcc-11.2.1 in a baremetal project for some time in a large project that has successfully used libc functions (malloc/memcpy) many times without issue using these options:
-L$AARCH64_GCC_PATH/aarch64-none-elf/lib -lc -lnosys -lg
I recently saw an exception due to an unaligned access during memcpy despite compiling with -mstrict-align.
After isolating the issue and creating a unit test I believe I've found a bug, please ignore the addresses from the objdump and memcpy call, just made them up for this test.
//unit test
#include <stdlib.h>
#include <string.h>
volatile int bssTest;
void swap(int a, int b) {
memcpy((void*)0x500,(void*)0x1000,0xc);
}
0000000000060040 <memcpy>:
60040: f9800020 prfm pldl1keep, [x1]
60044: 8b020024 add x4, x1, x2
60048: 8b020005 add x5, x0, x2
6004c: f100405f cmp x2, #0x10
60050: 54000209 b.ls 60090 <memcpy+0x50> // b.plast
60054: f101805f cmp x2, #0x60
60058: 54000648 b.hi 60120 <memcpy+0xe0> // b.pmore
6005c: d1000449 sub x9, x2, #0x1
60060: a9401c26 ldp x6, x7, [x1]
60064: 37300469 tbnz w9, #6, 600f0 <memcpy+0xb0>
60068: a97f348c ldp x12, x13, [x4, #-16]
6006c: 362800a9 tbz w9, #5, 60080 <memcpy+0x40>
60070: a9412428 ldp x8, x9, [x1, #16]
60074: a97e2c8a ldp x10, x11, [x4, #-32]
60078: a9012408 stp x8, x9, [x0, #16]
6007c: a93e2caa stp x10, x11, [x5, #-32]
60080: a9001c06 stp x6, x7, [x0]
60084: a93f34ac stp x12, x13, [x5, #-16]
60088: d65f03c0 ret
6008c: d503201f nop
60090: f100205f cmp x2, #0x8
60094: 540000e3 b.cc 600b0 <memcpy+0x70> // b.lo, b.ul, b.last
60098: f9400026 ldr x6, [x1]
6009c: f85f8087 ldur x7, [x4, #-8]
600a0: f9000006 str x6, [x0]
600a4: f81f80a7 stur x7, [x5, #-8]
600a8: d65f03c0 ret
600ac: d503201f nop
600b0: 361000c2 tbz w2, #2, 600c8 <memcpy+0x88>
600b4: b9400026 ldr w6, [x1]
600b8: b85fc087 ldur w7, [x4, #-4]
600bc: b9000006 str w6, [x0]
600c0: b81fc0a7 stur w7, [x5, #-4]
600c4: d65f03c0 ret
600c8: b4000102 cbz x2, 600e8 <memcpy+0xa8>
600cc: d341fc49 lsr x9, x2, #1
600d0: 39400026 ldrb w6, [x1]
600d4: 385ff087 ldurb w7, [x4, #-1]
600d8: 38696828 ldrb w8, [x1, x9]
600dc: 39000006 strb w6, [x0]
600e0: 38296808 strb w8, [x0, x9]
600e4: 381ff0a7 sturb w7, [x5, #-1]
600e8: d65f03c0 ret
600ec: d503201f nop
600f0: a9412428 ldp x8, x9, [x1, #16]
600f4: a9422c2a ldp x10, x11, [x1, #32]
600f8: a943342c ldp x12, x13, [x1, #48]
600fc: a97e0881 ldp x1, x2, [x4, #-32]
60100: a97f0c84 ldp x4, x3, [x4, #-16]
60104: a9001c06 stp x6, x7, [x0]
60108: a9012408 stp x8, x9, [x0, #16]
6010c: a9022c0a stp x10, x11, [x0, #32]
60110: a903340c stp x12, x13, [x0, #48]
60114: a93e08a1 stp x1, x2, [x5, #-32]
60118: a93f0ca4 stp x4, x3, [x5, #-16]
6011c: d65f03c0 ret
60120: 92400c09 and x9, x0, #0xf
60124: 927cec03 and x3, x0, #0xfffffffffffffff0
60128: a940342c ldp x12, x13, [x1]
6012c: cb090021 sub x1, x1, x9
60130: 8b090042 add x2, x2, x9
60134: a9411c26 ldp x6, x7, [x1, #16]
60138: a900340c stp x12, x13, [x0]
6013c: a9422428 ldp x8, x9, [x1, #32]
60140: a9432c2a ldp x10, x11, [x1, #48]
60144: a9c4342c ldp x12, x13, [x1, #64]!
60148: f1024042 subs x2, x2, #0x90
6014c: 54000169 b.ls 60178 <memcpy+0x138> // b.plast
60150: a9011c66 stp x6, x7, [x3, #16]
60154: a9411c26 ldp x6, x7, [x1, #16]
60158: a9022468 stp x8, x9, [x3, #32]
6015c: a9422428 ldp x8, x9, [x1, #32]
60160: a9032c6a stp x10, x11, [x3, #48]
60164: a9432c2a ldp x10, x11, [x1, #48]
60168: a984346c stp x12, x13, [x3, #64]!
6016c: a9c4342c ldp x12, x13, [x1, #64]!
60170: f1010042 subs x2, x2, #0x40
60174: 54fffee8 b.hi 60150 <memcpy+0x110> // b.pmore
60178: a97c0881 ldp x1, x2, [x4, #-64]
6017c: a9011c66 stp x6, x7, [x3, #16]
60180: a97d1c86 ldp x6, x7, [x4, #-48]
60184: a9022468 stp x8, x9, [x3, #32]
60188: a97e2488 ldp x8, x9, [x4, #-32]
6018c: a9032c6a stp x10, x11, [x3, #48]
60190: a97f2c8a ldp x10, x11, [x4, #-16]
60194: a904346c stp x12, x13, [x3, #64]
60198: a93c08a1 stp x1, x2, [x5, #-64]
6019c: a93d1ca6 stp x6, x7, [x5, #-48]
601a0: a93e24a8 stp x8, x9, [x5, #-32]
601a4: a93f2caa stp x10, x11, [x5, #-16]
601a8: d65f03c0 ret
601ac: 00000000 udf #0
When performing a memcpy on device type memory where size = 0x8 + 0x4n where n is any natural number, an exception will be thrown as even though care may be taken to have src/dst pointers aligned, the instruction seen on 6009c from the below objdump of memcpy on aarch64 leads to ldur x7, [x4, #-8]. Which in the case of a size 0xc copy would do an LDUR of a 32bit aligned address ending in 0x4 to a 64 bit x register, which results in a Data Abort on system type memory.
While I understand that care must be taken when using stdlib functions in a baremetal application, due to the nature of our codebase it would be very difficult to ensure that every call to memcpy has a size that is 64bit aligned. Shouldn't newlib/compiler take care to ensure that memcpy will use 32bit w registers for any 32bit aligned memcpy anyway? Especially with -mstrict-align?
What are my options as far as providing an immediate fix in the meantime, I suppose I could try to override the definition of memcpy but what source should I base the replacement implementation on in that case.
Any help on this is appreciated, thanks.
答案1
得分: 3
我明白了,你只需要翻译代码部分。以下是翻译好的代码:
实际上,我认为更大的“bug”在于你的期望。在设备内存上,你根本不能使用`memcpy`或任何其他库函数。
现代优化编译器和库的默认假设是它们在正常内存上操作,其访问没有任何副作用,并且没有被任何其他软件或硬件同时访问(*)。所以,不对齐的访问(gcc和newlib默认假设是可以的)是你最不用担心的问题。`memcpy`完全可以用任何组合的加载或存储来完成其工作。包括:
- 三次4字节访问
- 一个8字节和一个4字节访问
- 十二次一字节访问
- 两个重叠的八字节访问
- 一个超出源缓冲区边界的16字节加载,如果它能证明不会越过页面边界
- 对同一地址的多次加载
- 对同一地址的多次存储,除了最后一个之外,任何一个都可能是错误的值
使用`-mstrict-align`实际上并没有什么帮助。首先,正如你已经注意到的,它只影响你实际编译的代码;它对已经构建的库代码毫无作用。你将不得不使用此选项重新构建所有的newlib,并单独审查newlib中的所有汇编代码。但它对上述任何其他问题都没有帮助,这些问题对设备内存来说都可能是灾难性的。(正如amonakov所指出的,由于很少使用`-mstrict-align`,它可能容易受到编译器错误的影响。)
在设备内存中,你需要精确控制进行多少次加载和存储,以及是哪些地址,以哪些大小和以哪种顺序进行的。C/C++中只有一个机制可以做到这一点,那就是`volatile`。因此,对设备内存的所有访问都需要通过`volatile`指针显式完成,或者使用汇编语言。
如果你需要进行32位访问,我认为编写你的示例代码的唯一安全方式是:
```c
volatile uint32_t *dest = (volatile uint32_t *)0x500;
volatile uint32_t *src = (volatile uint32_t *)0x1000;
for (int i = 0; i < 3; i++)
dest[i] = src[i];
如果你对所有设备内存都这样做,那么你可以安全地在正常内存上使用编译代码和库函数,而无需使用-mstrict-align
。(前提是你在页表中正确地标记了所有正常内存,并且SCTLR_ELx.A
位被清除。)
() C/C++数据竞争规则允许多个读取者*同时访问相同的内存。因此,你可以假设你没有显式写入的内存将不会被写入。除此之外,编译器几乎完全可以自由地以任何方式发明/丢弃/组合/重新排序加载和存储。```
英文:
Actually, I think the larger "bug" is in your expectations. You simply can't use memcpy
or any other library function on device memory.
The default assumption of modern optimizing compilers and libraries is that they are operating on normal memory, whose access has no side effects and which is not being concurrently accessed by any other software or hardware (*). So unaligned access (which gcc and newlib assume by default is okay) is the least of your worries. It is totally fair game for memcpy
to do its work with any combination of loads or stores whatsoever. Including:
-
Three 4-byte accesses
-
An 8-byte and a 4-byte access
-
Twelve one-byte accesses
-
Two overlapping eight-byte accesses
-
A 16-byte load beyond the bounds of the source buffer, if it can prove that it will not cross a page boundary
-
Multiple loads of the same address
-
Multiple stores to the same address, of which any but the last could be the wrong values
Using -mstrict-align
doesn't really help. First, as you already noticed, it only affects the code which you actually compile with it; it does nothing about library code that was already built. You would have to rebuild all of newlib with this option, and then audit all the assembly code in newlib separately. But it doesn't help with any of the other issues above, all of which are potentially disastrous for device memory. (And as amonakov noted, since -mstrict-align
is rarely used, it can be prone to compiler bugs.)
With device memory, you need exact control over how many loads and stores are done, to which addresses, of which sizes, and in which order. There is only one mechanism in C/C++ to get that, namely volatile
. So all accesses to device memory need to be done explicitly through volatile
pointers, or using assembly.
If you need 32-bit accesses done, I think the only safe way to write your example code is:
volatile uint32_t *dest = (volatile uint32_t *)0x500;
volatile uint32_t *src = (volatile uint32_t *)0x1000;
for (int i = 0; i < 3; i++)
dest[i] = src[i];
And if you do this for all device memory, then you can safely use compiled code and library functions on your normal memory, without needing -mstrict-align
either. (Provided that you properly marked all normal memory as such in the page tables, and that the SCTLR_ELx.A
bit is cleared.)
(*) The C/C++ data race rules do allow multiple readers to concurrently access the same memory. So you can assume that memory which you do not explicitly write, will not be written at all. Beyond that, the compiler has nearly complete liberty to invent / discard / combine / reorder loads and stores in any fashion.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论