如何使用内联汇编将字符串从源复制到目标?

huangapple go评论64阅读模式
英文:

How can I copy a string from a source to destination using in line assembly?

问题

我是您的中文翻译,以下是代码的翻译:

我对汇编语言还很陌生,我正在尝试将一个字符串从输入字符 `const char* source` 复制到另一个输入参数中的字符串 `char* destination` 中,我必须通过内联汇编 x86 来实现,以下是我的代码:

注意:**volatile** 表示变量/代码区域可能会意外地从外部源更改。

void samestring(const char* start, char* end) {
    asm volatile (
        "mov %[src], %%rsi\n"   
        "mov %[dest], %%rdi\n"  
        "xor %%al, %%al\n"      
        "inc %%rdi\n"           
        "cmpb $0, %%dl\n"       
        "jne copy_loop\n"       
    
        :                                                  
        : "memory", "%rsi", "%rdi", "%rax", "%rdx"          
        );
}
这是我从一个Reddit帖子中找到的与类似问题有关的代码。由于我对汇编语言很陌生,我不太确定这种方法是否有效,或者是否有办法可以改进这段代码。因此,我想请教汇编专家,告诉我我可以和应该在上述代码中进行哪些编辑以减少时间消耗,非常感谢任何帮助。
英文:

I'm new to assembly and I'm trying to copy a string from an input char const char* source into another string given in the input parameter, char* destination and I have to do it via in line assembly x86, and here is my code:

Note: volatile marks that the variable/code
region can change unexpectedly from an external source.

void samestring(const char* start, char* end) {
	asm volatile (
		"mov %[src], %%rsi\n"   
		"mov %[dest], %%rdi\n"  
		"xor %%al, %%al\n"      
          
		"inc %%rdi\n"           
		"cmpb $0, %%dl\n"       
		"jne copy_loop\n"       

		:                                                  
		: "memory", "%rsi", "%rdi", "%rax", "%rdx"          
		);
}

This is the code that I found from a reddit post about a similar problem, and since I'm new to assembly, I don't really know if this method is efficient or whether there are ways I can improve this code or not, so I would like to consult experts of assembly to help tell me about what I can and should edit in the code above to make it less time consuming,

any help would be greatly appreciated.

答案1

得分: 2

这是极其低效的,包括将操作数传递给asm语句的方式,以及循环本身每次复制1个字节。

如果您关心x86-64的效率,应该使用SSE2一次加载和检查16个字节,就像glibc的手写strcpy的汇编代码一样。 (或者AVX2用于32字节)。 https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/strcpy.S.html - 请注意,它必须先达到对齐边界,例如检查指针是否不在页面的最后16字节内,然后执行一个不对齐的向量,就像strlen一样

除非您要优化的字符串长度可能为0到5个字节,而且对于长字符串的性能不关心。使用AVX-512掩码存储(在Intel上高效,在AMD Zen 4上非常慢)时,矢量可能是处理短字符串的有效方式,而不会因不同的短长度而导致分支错误预测,因为每个小于32字节的字符串都以相同的方式分支。

内联汇编细节

这会强制编译器将指向内存的指针("m"约束)存储,以便asm模板可以重新加载它们,而不是在" + S "(RSI)和" + D "(RDI)寄存器中要求它们,或者更好的是编译器选择的寄存器与 [src] "+r"(source)等。

此外,它不必要地将AL清零,并且通过使用movb而不是movzbl(%[src]),%%edx如何在汇编中从地址加载单个字节)加载时具有假依赖于RDX。

test %dl,%dlcmpb $0,%dl更有效地设置FLAGS。

除此之外,如果您只想将每次复制1个字节保持简单作为初学者练习,那么循环本身是天真的,但不是太糟糕的。

英文:

That's hilariously inefficient, including the way it gets operands into the asm statement, but also the loop itself copying 1 byte at a time.

If you care about efficiency for x86-64, you should be using SSE2 to load and check 16 bytes at a time like glibc's hand-written asm for strcpy. (Or AVX2 for 32 bytes). https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/strcpy.S.html - note that it has to reach an alignment boundary first, e.g. check that the pointer isn't in the last 16 bytes of a page and then do one unaligned vec, as with strlen

Unless you're optimizing for string lengths of maybe 0 to 5 bytes, without caring at all about performance for long strings. With AVX-512 masked stores (efficient on Intel, very slow on AMD Zen 4), vectors might be an efficient way to handle even short strings, with no risk of branch mispredict based on different short lengths since every string less than 32 bytes branches the same way.


Inline asm details

This forces the compiler to store the pointers to memory ("m" constraint) so the asm template can reload them, instead of asking for them in "+S" (RSI) and "+D" (RDI) registers, or better the compiler's choice of registers with [src] "+r"(source) etc.

It also zeros AL inefficiently for no reason, and has a false dependency on RDX by loading with movb instead of movzbl (%[src]), %%edx (How to load a single byte from address in assembly)

test %dl, %dl is a more efficient way to set FLAGS than cmpb $0, %dl.

Other than that, the loop itself is naive but not too bad if you want to keep it simple as a beginner exercise and only copy 1 byte at a time.

huangapple
  • 本文由 发表于 2023年7月12日 23:31:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76672242.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定