英文:
Assembly handwritten function slower than GCC compiled function
问题
I decide to create a string-length function in Assembly (using FASM
).
我决定在汇编中创建一个字符串长度函数(使用FASM
)。
My function takes a string (no matter aligned at 8 bytes or not) and checks if it's aligned at 8 bytes. If it's aligned, the main process (loop) will be begun. Otherwise, first 8 characters will be checked one-by-one, then the string will be aligned at 8 bytes and continue ...
我的函数接受一个字符串(无论是否按8字节对齐),并检查它是否按8字节对齐。如果对齐了,主要过程(循环)将开始。否则,首先会逐个检查前8个字符,然后将字符串对齐到8字节并继续...
There will be no "end of the memory page" problem since the string will be aligned at 8 bytes boundary anyway and by this alignment, it will never face the end of memory page problem.
由于字符串无论如何都将对齐到8字节边界,所以不会出现“内存页面结束”的问题,通过这种对齐方式,它永远不会面临内存页面结束的问题。
But the problem is that I decided to implement its C version too, and I compiled it, and now I have 2 assembly codes, the one I wrote it and the one is written in C and compiled to assembly. The problem is the C version is up to 1.5x faster than my handwritten assembly !!!!!!! In my code, everything is just fine, and I even aligned the jump-points to 16 bytes and there is no nop
running (except one, out of the loop which is kinda nothing (.align8
to .loop
)) !!!
但问题是,我决定也要实现它的C版本,然后编译了它,现在我有两个汇编代码,一个是我自己写的,另一个是用C编写的并编译成汇编。问题是C版本比我手写的汇编快1.5倍!!!!在我的代码中,一切都很好,我甚至将跳转点对齐到16字节,并且没有nop
运行(除了一个,在循环之外,这几乎什么都没有(从.align8
到.loop
))!!!
I can't find why my pure assembly code is 1.5x slower than the GCC version !!!
我找不到为什么我的纯汇编代码比GCC版本慢1.5倍!!!
My Assembly source-code :
我的汇编源代码:
The GCC version :
GCC版本:
My function test result :
我的函数测试结果:
string length => 336
字符串长度 => 336
loop execution times => 10000000
循环执行次数 => 10000000
total execution time => 0.772015
总执行时间 => 0.772015
GCC function test result :
GCC函数测试结果:
string length => 336
字符串长度 => 336
loop execution times => 10000000
循环执行次数 => 10000000
total execution time => 0.522015
总执行时间 => 0.522015
What is the problem ? Why my function is 1.5x slower when everything is kinda looks fine?
问题是什么?为什么我的函数要慢1.5倍,当一切似乎都很正常?
My string is aligned at 8 bytes, so you can skip the first one-by-one process and alignment.
我的字符串对齐到了8字节,所以你可以跳过第一个逐个处理和对齐的过程。
Is there any problem with my label aligning ? Or the problem is from somewhere else?
我的标签对齐有问题吗?还是问题来自其他地方?
ABI -> x64 (Windows)
ABI -> x64(Windows)
CPU (Test) -> i7-7800X
CPU(测试)-> i7-7800X
My C test application source-code :
我的C测试应用程序源代码:
My object file (with these 2 slen
functions to link to that C tester) creator in FASM :
我的目标文件(带有这两个slen
函数,用于链接到C测试程序)是在FASM中创建的:
Also the C version of slen
还有slen的C版本
英文:
I decide to create a string-length function in Assembly (using FASM
).
My function takes a string (no matter aligned at 8 bytes or not) and checks if it's aligned at 8 bytes. If it's aligned, the main process (loop) will be begun. Otherwise, first 8 characters will be checked one-by-one, then the string will be aligned at 8 bytes and continue ...
There will be no "end of the memory page" problem since the string will be aligned at 8 bytes boundary anyway and by this alignment, it will never face the end of memory page problem.
But the problem is that I decided to implement its C version too, and I compiled it, and now I have 2 assembly codes, the one I wrote it and the one is written in C and compiled to assembly. The problem is the C version is up to 1.5x faster than my handwritten assembly !!!!!!! In my code, everything is just fine, and I even aligned the jump-points to 16 bytes and there is no nop
running (except one, out of the loop which is kinda nothing (.align8
to .loop
)) !!!
I can't find why my pure assembly code is 1.5x slower than the GCC version !!!
My Assembly source-code :
align 16
slen:
mov r8, rcx
test cl, 7
jz .loop
xor eax, eax
cmp BYTE [rcx], al
je SHORT .ret
cmp BYTE [rcx+1], al
je SHORT .ret1
cmp BYTE [rcx+2], al
je SHORT .ret2
cmp BYTE [rcx+3], al
je SHORT .ret3
cmp BYTE [rcx+4], al
je SHORT .ret4
cmp BYTE [rcx+5], al
je SHORT .ret5
cmp BYTE [rcx+6], al
je SHORT .ret6
cmp BYTE [rcx+7], al
jne SHORT .align8
mov al, 7
ret
align 16
.ret: ret
align 16
.ret1: mov al, 1
ret
align 16
.ret2: mov al, 2
ret
align 16
.ret3: mov al, 3
ret
align 16
.ret4: mov al, 4
ret
align 16
.ret5: mov al, 5
ret
align 16
.ret6: mov al, 6
ret
align 16
.align8:
lea rcx, [rcx+7]
and rcx, (-8)
align 16
.loop: mov rax, QWORD [rcx]
test al, al
jz SHORT .end
test ah, ah
jz SHORT .end.1
test eax, 0x00ff0000
jz SHORT .end.2
test eax, 0xff000000
jz SHORT .end.3
shr rax, 32
test al, al
jz SHORT .end.4
test ah, ah
jz SHORT .end.5
test eax, 0x00ff0000
jz SHORT .end.6
test eax, 0xff000000
jz SHORT .end.7
add rcx, 8
jmp SHORT .loop
align 16
.end: mov rax, rcx
sub rax, r8
ret
align 16
.end.1:
lea rax, [rcx+1]
sub rax, r8
ret
.end.2:
lea rax, [rcx+2]
sub rax, r8
ret
.end.3:
lea rax, [rcx+3]
sub rax, r8
ret
.end.4:
lea rax, [rcx+4]
sub rax, r8
ret
.end.5:
lea rax, [rcx+5]
sub rax, r8
ret
.end.6:
lea rax, [rcx+6]
sub rax, r8
ret
.end.7:
lea rax, [rcx+7]
sub rax, r8
ret
The GCC version :
align 16
slen:
test cl, 7
je .L18
xor eax, eax
cmp BYTE [rcx], 0
je .L1
cmp BYTE [rcx+1], 0
mov eax, 1
je .L1
cmp BYTE [rcx+2], 0
mov eax, 2
je .L1
cmp BYTE [rcx+3], 0
mov eax, 3
je .L1
cmp BYTE [rcx+4], 0
mov eax, 4
je .L1
cmp BYTE [rcx+5], 0
mov eax, 5
je .L1
cmp BYTE [rcx+6], 0
mov eax, 6
je .L1
cmp BYTE [rcx+7], 0
mov eax, 7
je .L1
lea rax, [rcx+7]
and rax, -8
jmp .L47
align 16
.L18:
mov rax, rcx
jmp .L47
align 16
.L40:
test dh, dh
je .L49
test edx, 16711680
je .L50
test edx, 4278190080
je .L51
shr rdx, 32
test dl, dl
je .L52
test dh, dh
je .L53
test edx, 16711680
je .L54
test edx, 4278190080
je .L55
add rax, 8
.L47:
mov rdx, QWORD [rax]
test dl, dl
jne .L40
sub eax, ecx
.L1:
ret
align 16
.L49:
sub rax, rcx
add eax, 1
ret
align 16
.L50:
sub rax, rcx
add eax, 2
ret
align 16
.L51:
sub rax, rcx
add eax, 3
ret
align 16
.L52:
sub rax, rcx
add eax, 4
ret
align 16
.L53:
sub rax, rcx
add eax, 5
ret
align 16
.L54:
sub rax, rcx
add eax, 6
ret
align 16
.L55:
sub rax, rcx
add eax, 7
ret
My function test result :
string length => 336
loop execution times => 10000000
total execution time => 0.772015
GCC function test result :
string length => 336
loop execution times => 10000000
total execution time => 0.522015
What is the problem ? Why my function is 1.5x slower when everything is kinda looks fine?
My string is aligned at 8 bytes, so you can skip the first one-by-one process and alignment.
Is there any problem with my label aligning ? Or the problem is from somewhere else?
ABI -> x64 (Windows)
CPU (Test) => i7-7800X
My C test application source-code :
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>
unsigned int
slen_by_me(const char *);
unsigned int
slen_gcc(const char *);
int main() {
static const char *str="WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW";
LARGE_INTEGER frequency;
LARGE_INTEGER start;
LARGE_INTEGER end;
double interval;
unsigned int l = 0;
QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&start);
for (int i = 0; i < 10000000; i++) {
l += slen_gcc(str);
}
QueryPerformanceCounter(&end);
interval = (double) (end.QuadPart - start.QuadPart) / frequency.QuadPart;
printf("%f\n%u\n", interval, l);
return 0;
}
My object file (with these 2 slen
functions to link to that C tester) creator in FASM :
format MS64 COFF
public slen_gcc
public slen_by_me
section '.text' code readable executable align 64
align 16
slen_gcc:
test cl, 7
je .L18
xor eax, eax
cmp BYTE [rcx], 0
je .L1
cmp BYTE [rcx+1], 0
mov eax, 1
je .L1
cmp BYTE [rcx+2], 0
mov eax, 2
je .L1
cmp BYTE [rcx+3], 0
mov eax, 3
je .L1
cmp BYTE [rcx+4], 0
mov eax, 4
je .L1
cmp BYTE [rcx+5], 0
mov eax, 5
je .L1
cmp BYTE [rcx+6], 0
mov eax, 6
je .L1
cmp BYTE [rcx+7], 0
mov eax, 7
je .L1
lea rax, [rcx+7]
and rax, -8
jmp .L47
align 16
.L18:
mov rax, rcx
jmp .L47
align 16
.L40:
test dh, dh
je .L49
test edx, 16711680
je .L50
test edx, 4278190080
je .L51
shr rdx, 32
test dl, dl
je .L52
test dh, dh
je .L53
test edx, 16711680
je .L54
test edx, 4278190080
je .L55
add rax, 8
.L47:
mov rdx, QWORD [rax]
test dl, dl
jne .L40
sub eax, ecx
.L1:
ret
align 16
.L49:
sub rax, rcx
add eax, 1
ret
align 16
.L50:
sub rax, rcx
add eax, 2
ret
align 16
.L51:
sub rax, rcx
add eax, 3
ret
align 16
.L52:
sub rax, rcx
add eax, 4
ret
align 16
.L53:
sub rax, rcx
add eax, 5
ret
align 16
.L54:
sub rax, rcx
add eax, 6
ret
align 16
.L55:
sub rax, rcx
add eax, 7
ret
align 16
slen_by_me:
mov r8, rcx
test cl, 7
jz .loop
xor eax, eax
cmp BYTE [rcx], al
je SHORT .ret
cmp BYTE [rcx+1], al
je SHORT .ret1
cmp BYTE [rcx+2], al
je SHORT .ret2
cmp BYTE [rcx+3], al
je SHORT .ret3
cmp BYTE [rcx+4], al
je SHORT .ret4
cmp BYTE [rcx+5], al
je SHORT .ret5
cmp BYTE [rcx+6], al
je SHORT .ret6
cmp BYTE [rcx+7], al
jne SHORT .align8
mov al, 7
ret
align 16
.ret: ret
align 16
.ret1: mov al, 1
ret
align 16
.ret2: mov al, 2
ret
align 16
.ret3: mov al, 3
ret
align 16
.ret4: mov al, 4
ret
align 16
.ret5: mov al, 5
ret
align 16
.ret6: mov al, 6
ret
align 16
.align8:
lea rcx, [rcx+7]
and rcx, (-8)
align 16
.loop: mov rax, QWORD [rcx]
test al, al
jz SHORT .end
test ah, ah
jz SHORT .end.1
test eax, 0x00ff0000
jz SHORT .end.2
test eax, 0xff000000
jz SHORT .end.3
shr rax, 32
test al, al
jz SHORT .end.4
test ah, ah
jz SHORT .end.5
test eax, 0x00ff0000
jz SHORT .end.6
test eax, 0xff000000
jz SHORT .end.7
add rcx, 8
jmp SHORT .loop
align 16
.end: mov rax, rcx
sub rax, r8
ret
align 16
.end.1:
lea rax, [rcx+1]
sub rax, r8
ret
.end.2:
lea rax, [rcx+2]
sub rax, r8
ret
.end.3:
lea rax, [rcx+3]
sub rax, r8
ret
.end.4:
lea rax, [rcx+4]
sub rax, r8
ret
.end.5:
lea rax, [rcx+5]
sub rax, r8
ret
.end.6:
lea rax, [rcx+6]
sub rax, r8
ret
.end.7:
lea rax, [rcx+7]
sub rax, r8
ret
Also the C version of slen
int
slen(const char *str) {
const char *start=str;
if(((unsigned long long)str & 7) != 0) {
if(str[0] == 0x00)
return 0;
if(str[1] == 0x00)
return 1;
if(str[2] == 0x00)
return 2;
if(str[3] == 0x00)
return 3;
if(str[4] == 0x00)
return 4;
if(str[5] == 0x00)
return 5;
if(str[6] == 0x00)
return 6;
if(str[7] == 0x00)
return 7;
str=(const char *)(((unsigned long long)str + 7) & (-8));
}
do {
unsigned long long bytes=(*(unsigned long long*)(str));
if((unsigned char)bytes==0x00)
return (int)(str-start);
if((bytes & 0x0000ff00)==0)
return (int)(str-start+1);
if((bytes & 0x00ff0000)==0)
return (int)(str-start+2);
if((bytes & 0xff000000)==0)
return (int)(str-start+3);
bytes >>= 32;
if((unsigned char)bytes==0x00)
return (int)(str-start+4);
if((bytes & 0x0000ff00)==0)
return (int)(str-start+5);
if((bytes & 0x00ff0000)==0)
return (int)(str-start+6);
if((bytes & 0xff000000)==0)
return (int)(str-start+7);
str+=8;
} while (1);
}
答案1
得分: 3
请问您需要哪部分内容进行翻译?
英文:
Allow me to refer you to one of my Pure Assembly library function (coming soon).
According to your question, it's about strlen (which named "str_length" in my library and developed for both Microsoft x64 ABI and System-v AMD64 ABI).
I remember (a few years ago) there was a C/C++ function about this type of string length calculator function.
size_t my_strlen(const char *s) {
size_t len = 0;
for(;;) {
unsigned x = *(unsigned*)s;
if((x & 0xFF) == 0) return len;
if((x & 0xFF00) == 0) return len + 1;
if((x & 0xFF0000) == 0) return len + 2;
if((x & 0xFF000000) == 0) return len + 3;
s += 4, len += 4;
}
}
Even named "FAST strlen" which it's really not that fast. So, i decided to write my own "FAST strlen" in Assembly.
In x86-64, it's possible to load a 8-BYTE chunk into a 64-bit register so why 4-BYTE loading ? (As 'size_t my_strlen(const char *s)'
did)
JCC Erratum
About 'JCC Erratum', still there are too many Skylake CPUs in the world (and by world, i mean DataCenters (check Hetzner datacenter and you find too many Skylake and old CPUs)). It's not optional, you MUST take care of this bad boy. But, it's very important to handle it without adding a NOP
or even prefixes. Because by doing this, you make new problems for other CPUs. You can handle it by creating small new branches and putting some codes into a fresh 32-BYTE chunk (But don't make it too heavy).
TAKE CARE OF LOOP TAIL JUMP
Another subject, is taking care about the loops tail jump. Also again, you MUST make a tail for your loop and using jmp
(unconditional jump) to jump to that tail (because of predictable branches subject (read the Agner Fog document about this bad boy (I love this guy for no reason xD)). Also, As Mr. Peter Cordes mentioned, and if you check the GCC jump method, you find the solution about loop creation and jumps.
TAKE CARE OF BRANCH ALIGNMENT
Yes, take care of branch alignment (16-BYTE boundaries), specially those you jump to, too many times (well, good boy|girl (sorry, no name), it's handled by you).
GCC REALLY ?! WHY NOT MACRO-FUSED ?
Well you (the question starter) did a right thing. You used a register for unaligned condition cmp
so you have the benefit of macro-fusing. But in the code generated by GCC, you can see that cmp BYTE PTR [rcx], 0
is used. This will removes the benefit of macro-fusing from your code (its code actually (GCC)).
Of course, GCC done it to handle the padding but it's really not acceptable.
An example of this situation in uiCA test tool:
0000000000000000 <.text>:
0: 80 39 00 cmp BYTE PTR [rcx],0x0
3: 0f 84 00 00 00 00 je 0x9
9: 38 01 cmp BYTE PTR [rcx],al
b: 0f 84 00 00 00 00 je 0x11
The second cmp
got M flag which stands for 'Macro-fused with previous instruction'.
> Macro Fusion is restricted to 16-bit and 32-bit mode only (including
> 32-bit compatibility sub-mode in x86-64). CMP and TEST can fuse when
> comparing:
>
> REG-REG. (e.g, CMP EAX,ECX; JZ label)
> REG-IMM. (e.g., CMP EAX,0x80; JZ label)
> REG-MEM. (e.g., CMP EAX,[ECX]; JZ label)
> MEM-REG. (e.g., CMP [EAX],ECX; JZ label)
>
> CMP and TEST can not be fused when comparing MEM-IMM (e.g. CMP
> [EAX],0x80; JZ label)
And finally about the function and its performance test (according to your needs). I bring you the one with Microsoft x64 ABI.
; libASM, independent standard libraries in Assembly (programming-language).
; For more information, please visit the libASM website (www.libasm.com).
; Copyright (C) 2023 Mr. Alireza Saeidipour. All rights reserved.
; Published by SOURCEBRING, under its international legal terms and conditions.
; For more information, please visit the SOURCEBRING website (www.sourcebring.com).
; “FAILURE GUARANTEES SUCCESS”
; — Alireza Saeidipour
align.function
str_length:
mov r8, rcx
test cl, 7
jz @f
xor eax, eax
cmp BYTE [rcx], al
je SHORT .len0
cmp BYTE [rcx+1], al
jne SHORT .unaligned_continue
mov al, 1
ret
align.branch32
.unaligned_continue:
cmp BYTE [rcx+2], al
je SHORT .len2
cmp BYTE [rcx+3], al
je SHORT .len3
cmp BYTE [rcx+4], al
je SHORT .len4
cmp BYTE [rcx+5], al
je SHORT .len5
cmp BYTE [rcx+6], al
je .len6
cmp BYTE [rcx+7], al
je .len7
lea r8, [rcx+7]
and r8, (-8)
jmp @f
align.branch
.len0: ret
align.branch
.len2: mov eax, 2
ret
align.branch
.len3: mov eax, 3
ret
align.branch
.len4: mov eax, 4
ret
align.branch
.len5: mov eax, 5
ret
align.branch
.len6: mov eax, 6
ret
align.branch
.len7: mov eax, 7
ret
align.branch
.return_add7:
lea rax, [r8+7]
sub rax, r9
ret
align.branch
@@: mov r9, rcx
mov ecx, 0x00ff0000
mov edx, 0xff000000
jmp SHORT @f
align.branch32
.loop: test eax, ecx
jz SHORT .return_add2
test eax, edx
jz SHORT .return_add3
shr rax, 32
test al, al
jz SHORT .return_add4
test ah, ah
jz SHORT .return_add5
test eax, ecx
jz SHORT .return_add6
test eax, edx
jz SHORT .return_add7
add r8, 8
@@: mov rax, QWORD [r8]
test al, al
jz SHORT .return
test ah, ah
jnz SHORT .loop
lea rax, [r8+1]
sub rax, r9
ret
align.branch
.return:
mov rax, r8
sub rax, r9
ret
align.branch
.return_add2:
lea rax, [r8+2]
sub rax, r9
ret
align.branch
.return_add3:
lea rax, [r8+3]
sub rax, r9
ret
align.branch
.return_add4:
lea rax, [r8+4]
sub rax, r9
ret
align.branch
.return_add5:
lea rax, [r8+5]
sub rax, r9
ret
align.branch
.return_add6:
lea rax, [r8+6]
sub rax, r9
ret
.size = $ - str_length
And Macros in this source-code:
macro align.function { align 32 }
macro align.branch { align 16 }
macro align.branch32 { align 32 }
This function considered as high-end solution (non-SIMD). You can find SIMD version of this function (9 functions are created only for string length operation) in my library soon (The library will be released by the end of June (2023)).
str_length
str_length_sse2
str_length_avx
str_length_avx2
str_length_avx512bw
str_length_long_sse2
str_length_long_avx
str_length_long_avx2
str_length_long_avx512bw
Test results (based on your parameters and your test tools (function (C)):
string length => 336
loop execution times => 10000000
total execution time => 0.430173
Yes, even faster than the one generated by GCC (0.522015). You will get same result for an unaligned string too.
Also, there is no 'JCC Erratum' problem in my code (The hex string of my function for you to check it).
49 89 c8 f6 c1 07 0f 84 c4 00 00 00 31 c0 38 01
74 3e 38 41 01 75 09 b0 01 c3 90 90 90 90 90 90
38 41 02 74 3b 38 41 03 74 46 38 41 04 74 51 38
41 05 74 5c 38 41 06 74 67 38 41 07 74 72 4c 8d
41 07 49 83 e0 f8 e9 85 00 00 00 90 90 90 90 90
c3 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
b8 02 00 00 00 c3 90 90 90 90 90 90 90 90 90 90
b8 03 00 00 00 c3 90 90 90 90 90 90 90 90 90 90
b8 04 00 00 00 c3 90 90 90 90 90 90 90 90 90 90
b8 05 00 00 00 c3 90 90 90 90 90 90 90 90 90 90
b8 06 00 00 00 c3 90 90 90 90 90 90 90 90 90 90
b8 07 00 00 00 c3 90 90 90 90 90 90 90 90 90 90
49 8d 40 07 4c 29 c8 c3 90 90 90 90 90 90 90 90
49 89 c9 b9 00 00 ff 00 ba 00 00 00 ff eb 21 90
85 c8 74 4c 85 d0 74 58 48 c1 e8 20 84 c0 74 60
84 e4 74 6c 85 c8 74 78 85 d0 74 c4 49 83 c0 08
49 8b 00 84 c0 74 19 84 e4 75 d5 49 8d 40 01 4c
29 c8 c3 90 90 90 90 90 90 90 90 90 90 90 90 90
4c 89 c0 4c 29 c8 c3 90 90 90 90 90 90 90 90 90
49 8d 40 02 4c 29 c8 c3 90 90 90 90 90 90 90 90
49 8d 40 03 4c 29 c8 c3 90 90 90 90 90 90 90 90
49 8d 40 04 4c 29 c8 c3 90 90 90 90 90 90 90 90
49 8d 40 05 4c 29 c8 c3 90 90 90 90 90 90 90 90
49 8d 40 06 4c 29 c8 c3
Warning: Please attention that FASM uses too many NOP
for 'align' directive (instead of using a long NOP
) so don't use this directive when there is no jmp
above of it (As you say, direct access).
Warning: For old CPUs sake, keep your jumps body short and use registers instead of imm
. And always handle 'JCC Erratum' (You lose 1.3x performance for that).
With best-regards.
答案2
得分: 2
I changed my code from:
.loop: mov rax, QWORD [rcx]
test al, al
jz SHORT .end
test ah, ah
jz SHORT .end.1
test eax, 0x00ff0000
jz SHORT .end.2
test eax, 0xff000000
jz SHORT .end.3
shr rax, 32
test al, al
jz SHORT .end.4
test ah, ah
jz SHORT .end.5
test eax, 0x00ff0000
jz SHORT .end.6
test eax, 0xff000000
jz SHORT .end.7
add rcx, 8
jmp SHORT .loop
To (first, we jump to the '.loop' label):
.loop.continue:
test ah, ah
jz SHORT .end1
test eax, 0x00ff0000
jz SHORT .end2
test eax, 0xff000000
jz SHORT .end3
shr rax, 32
test al, al
jz SHORT .end4
test ah, ah
jz SHORT .end5
test eax, 0x00ff0000
jz SHORT .end6
test eax, 0xff000000
jz .end7
lea rcx, [rcx+8]
.loop: mov rax, QWORD [rcx]
test al, al
jnz SHORT .loop.continue
mov rax, rcx
sub rax, rdx
ret
And even with 'JCC Erratum' problem, I get amazing result (0.532015
).
There was something wrong with my loop. In the first one, we jumped to the loop, a QWORD was taken, and we started to search for 0x00
, and at the end of the loop, 8
was added to rcx
(string memory address), and we had to jump to the loop (top) again.
But in the solution, we jump to the end of the loop and handle the first check, then we jump back to handle the others, and by doing this, the speed problem is fixed !!!
UPDATED
I just tried to make the loop body smaller (in size, like my first code), and the result was amazing:
strl:
push rdi
push rsi
mov rdi, rcx
mov rsi, rcx
mov ecx, 0x00ff0000
mov edx, 0xff000000
mov r8, 0x000000ff00000000
mov r9, 0x0000ff0000000000
mov r10, 0x00ff000000000000
mov r11, 0xff00000000000000
test dil, 7
jz @f
; handle unaligned
align 32
@@: mov rax, QWORD [rdi]
test al, al
jz SHORT .end
test ah, ah
jz SHORT .end1
test eax, ecx
jz SHORT .end2
test eax, edx
jz SHORT .end3
test rax, r8
jz SHORT .end4
test rax, r9
jz SHORT .end5
test rax, r10
jz SHORT .end6
test rax, r11
jz SHORT .end7
add rdi, 8
jmp @b
align 16
.end: mov rax, rdi
sub rax, rsi
pop rsi
pop rdi
ret
align 16
.end1: lea rax, [rdi+1]
sub rax, rsi
pop rsi
pop rdi
ret
align 16
.end2: lea rax, [rdi+2]
sub rax, rsi
pop rsi
pop rdi
ret
align 16
.end3: lea rax, [rdi+3]
sub rax, rsi
pop rsi
pop rdi
ret
align 16
.end4: lea rax, [rdi+4]
sub rax, rsi
pop rsi
pop rdi
ret
align 16
.end5: lea rax, [rdi+5]
sub rax, rsi
pop rsi
pop rdi
ret
align 16
.end6: lea rax, [rdi+6]
sub rax, rsi
pop rsi
pop rdi
ret
align 16
.end7: lea rax, [rdi+7]
sub rax, rsi
pop rsi
pop rdi
ret
英文:
I changed my code from
.loop: mov rax, QWORD [rcx]
test al, al
jz SHORT .end
test ah, ah
jz SHORT .end.1
test eax, 0x00ff0000
jz SHORT .end.2
test eax, 0xff000000
jz SHORT .end.3
shr rax, 32
test al, al
jz SHORT .end.4
test ah, ah
jz SHORT .end.5
test eax, 0x00ff0000
jz SHORT .end.6
test eax, 0xff000000
jz SHORT .end.7
add rcx, 8
jmp SHORT .loop
To (first, we jump to the '.loop' label):
.loop.continue:
test ah, ah
jz SHORT .end1
test eax, 0x00ff0000
jz SHORT .end2
test eax, 0xff000000
jz SHORT .end3
shr rax, 32
test al, al
jz SHORT .end4
test ah, ah
jz SHORT .end5
test eax, 0x00ff0000
jz SHORT .end6
test eax, 0xff000000
jz .end7
lea rcx, [rcx+8]
.loop: mov rax, QWORD [rcx]
test al, al
jnz SHORT .loop.continue
mov rax, rcx
sub rax, rdx
ret
And even with 'JCC Erratum' problem, I get amazing result (0.532015
).
There was something wrong with my loop. In the first one, we jumped to loop and a QWORD taken and we started to search for 0x00 and at the end of the loop, 8 added to rcx (string memory address) and we have to jump to the loop (top) again.
But in solution, we jump to the end of loop and we handle the first check then we jump top to handle the others and by doing this, the speed problem fixed !!!
UPDATED
I just tried to make loop body smaller (in size (my first code)) and the result was amazing !!!!!
strl:
push rdi
push rsi
mov rdi, rcx
mov rsi, rcx
mov ecx, 0x00ff0000
mov edx, 0xff000000
mov r8, 0x000000ff00000000
mov r9, 0x0000ff0000000000
mov r10, 0x00ff000000000000
mov r11, 0xff00000000000000
test dil, 7
jz @f
; handle unaligned
align 32
@@: mov rax, QWORD [rdi]
test al, al
jz SHORT .end
test ah, ah
jz SHORT .end1
test eax, ecx
jz SHORT .end2
test eax, edx
jz SHORT .end3
test rax, r8
jz SHORT .end4
test rax, r9
jz SHORT .end5
test rax, r10
jz SHORT .end6
test rax, r11
jz SHORT .end7
add rdi, 8
jmp @b
align 16
.end: mov rax, rdi
sub rax, rsi
pop rsi
pop rdi
ret
align 16
.end1: lea rax, [rdi+1]
sub rax, rsi
pop rsi
pop rdi
ret
align 16
.end2: lea rax, [rdi+2]
sub rax, rsi
pop rsi
pop rdi
ret
align 16
.end3: lea rax, [rdi+3]
sub rax, rsi
pop rsi
pop rdi
ret
align 16
.end4: lea rax, [rdi+4]
sub rax, rsi
pop rsi
pop rdi
ret
align 16
.end5: lea rax, [rdi+5]
sub rax, rsi
pop rsi
pop rdi
ret
align 16
.end6: lea rax, [rdi+6]
sub rax, rsi
pop rsi
pop rdi
ret
align 16
.end7: lea rax, [rdi+7]
sub rax, rsi
pop rsi
pop rdi
ret
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论