汇编手写函数比GCC编译的函数慢。

huangapple go评论86阅读模式
英文:

Assembly handwritten function slower than GCC compiled function

问题

I decide to create a string-length function in Assembly (using FASM).
我决定在汇编中创建一个字符串长度函数(使用FASM)。

My function takes a string (no matter aligned at 8 bytes or not) and checks if it's aligned at 8 bytes. If it's aligned, the main process (loop) will be begun. Otherwise, first 8 characters will be checked one-by-one, then the string will be aligned at 8 bytes and continue ...
我的函数接受一个字符串(无论是否按8字节对齐),并检查它是否按8字节对齐。如果对齐了,主要过程(循环)将开始。否则,首先会逐个检查前8个字符,然后将字符串对齐到8字节并继续...

There will be no "end of the memory page" problem since the string will be aligned at 8 bytes boundary anyway and by this alignment, it will never face the end of memory page problem.
由于字符串无论如何都将对齐到8字节边界,所以不会出现“内存页面结束”的问题,通过这种对齐方式,它永远不会面临内存页面结束的问题。

But the problem is that I decided to implement its C version too, and I compiled it, and now I have 2 assembly codes, the one I wrote it and the one is written in C and compiled to assembly. The problem is the C version is up to 1.5x faster than my handwritten assembly !!!!!!! In my code, everything is just fine, and I even aligned the jump-points to 16 bytes and there is no nop running (except one, out of the loop which is kinda nothing (.align8 to .loop)) !!!
但问题是,我决定也要实现它的C版本,然后编译了它,现在我有两个汇编代码,一个是我自己写的,另一个是用C编写的并编译成汇编。问题是C版本比我手写的汇编快1.5倍!!!!在我的代码中,一切都很好,我甚至将跳转点对齐到16字节,并且没有nop运行(除了一个,在循环之外,这几乎什么都没有(从.align8.loop))!!!

I can't find why my pure assembly code is 1.5x slower than the GCC version !!!
我找不到为什么我的纯汇编代码比GCC版本慢1.5倍!!!

My Assembly source-code :
我的汇编源代码:

The GCC version :
GCC版本:

My function test result :
我的函数测试结果:

string length => 336
字符串长度 => 336
loop execution times => 10000000
循环执行次数 => 10000000
total execution time => 0.772015
总执行时间 => 0.772015

GCC function test result :
GCC函数测试结果:

string length => 336
字符串长度 => 336
loop execution times => 10000000
循环执行次数 => 10000000
total execution time => 0.522015
总执行时间 => 0.522015

What is the problem ? Why my function is 1.5x slower when everything is kinda looks fine?
问题是什么?为什么我的函数要慢1.5倍,当一切似乎都很正常?

My string is aligned at 8 bytes, so you can skip the first one-by-one process and alignment.
我的字符串对齐到了8字节,所以你可以跳过第一个逐个处理和对齐的过程。

Is there any problem with my label aligning ? Or the problem is from somewhere else?
我的标签对齐有问题吗?还是问题来自其他地方?

ABI -> x64 (Windows)
ABI -> x64(Windows)

CPU (Test) -> i7-7800X
CPU(测试)-> i7-7800X

My C test application source-code :
我的C测试应用程序源代码:

My object file (with these 2 slen functions to link to that C tester) creator in FASM :
我的目标文件(带有这两个slen函数,用于链接到C测试程序)是在FASM中创建的:

Also the C version of slen
还有slen的C版本

英文:

I decide to create a string-length function in Assembly (using FASM).
My function takes a string (no matter aligned at 8 bytes or not) and checks if it's aligned at 8 bytes. If it's aligned, the main process (loop) will be begun. Otherwise, first 8 characters will be checked one-by-one, then the string will be aligned at 8 bytes and continue ...
There will be no "end of the memory page" problem since the string will be aligned at 8 bytes boundary anyway and by this alignment, it will never face the end of memory page problem.

But the problem is that I decided to implement its C version too, and I compiled it, and now I have 2 assembly codes, the one I wrote it and the one is written in C and compiled to assembly. The problem is the C version is up to 1.5x faster than my handwritten assembly !!!!!!! In my code, everything is just fine, and I even aligned the jump-points to 16 bytes and there is no nop running (except one, out of the loop which is kinda nothing (.align8 to .loop)) !!!
I can't find why my pure assembly code is 1.5x slower than the GCC version !!!

My Assembly source-code :

  1. align 16
  2. slen:
  3. mov r8, rcx
  4. test cl, 7
  5. jz .loop
  6. xor eax, eax
  7. cmp BYTE [rcx], al
  8. je SHORT .ret
  9. cmp BYTE [rcx+1], al
  10. je SHORT .ret1
  11. cmp BYTE [rcx+2], al
  12. je SHORT .ret2
  13. cmp BYTE [rcx+3], al
  14. je SHORT .ret3
  15. cmp BYTE [rcx+4], al
  16. je SHORT .ret4
  17. cmp BYTE [rcx+5], al
  18. je SHORT .ret5
  19. cmp BYTE [rcx+6], al
  20. je SHORT .ret6
  21. cmp BYTE [rcx+7], al
  22. jne SHORT .align8
  23. mov al, 7
  24. ret
  25. align 16
  26. .ret: ret
  27. align 16
  28. .ret1: mov al, 1
  29. ret
  30. align 16
  31. .ret2: mov al, 2
  32. ret
  33. align 16
  34. .ret3: mov al, 3
  35. ret
  36. align 16
  37. .ret4: mov al, 4
  38. ret
  39. align 16
  40. .ret5: mov al, 5
  41. ret
  42. align 16
  43. .ret6: mov al, 6
  44. ret
  45. align 16
  46. .align8:
  47. lea rcx, [rcx+7]
  48. and rcx, (-8)
  49. align 16
  50. .loop: mov rax, QWORD [rcx]
  51. test al, al
  52. jz SHORT .end
  53. test ah, ah
  54. jz SHORT .end.1
  55. test eax, 0x00ff0000
  56. jz SHORT .end.2
  57. test eax, 0xff000000
  58. jz SHORT .end.3
  59. shr rax, 32
  60. test al, al
  61. jz SHORT .end.4
  62. test ah, ah
  63. jz SHORT .end.5
  64. test eax, 0x00ff0000
  65. jz SHORT .end.6
  66. test eax, 0xff000000
  67. jz SHORT .end.7
  68. add rcx, 8
  69. jmp SHORT .loop
  70. align 16
  71. .end: mov rax, rcx
  72. sub rax, r8
  73. ret
  74. align 16
  75. .end.1:
  76. lea rax, [rcx+1]
  77. sub rax, r8
  78. ret
  79. .end.2:
  80. lea rax, [rcx+2]
  81. sub rax, r8
  82. ret
  83. .end.3:
  84. lea rax, [rcx+3]
  85. sub rax, r8
  86. ret
  87. .end.4:
  88. lea rax, [rcx+4]
  89. sub rax, r8
  90. ret
  91. .end.5:
  92. lea rax, [rcx+5]
  93. sub rax, r8
  94. ret
  95. .end.6:
  96. lea rax, [rcx+6]
  97. sub rax, r8
  98. ret
  99. .end.7:
  100. lea rax, [rcx+7]
  101. sub rax, r8
  102. ret

The GCC version :

  1. align 16
  2. slen:
  3. test cl, 7
  4. je .L18
  5. xor eax, eax
  6. cmp BYTE [rcx], 0
  7. je .L1
  8. cmp BYTE [rcx+1], 0
  9. mov eax, 1
  10. je .L1
  11. cmp BYTE [rcx+2], 0
  12. mov eax, 2
  13. je .L1
  14. cmp BYTE [rcx+3], 0
  15. mov eax, 3
  16. je .L1
  17. cmp BYTE [rcx+4], 0
  18. mov eax, 4
  19. je .L1
  20. cmp BYTE [rcx+5], 0
  21. mov eax, 5
  22. je .L1
  23. cmp BYTE [rcx+6], 0
  24. mov eax, 6
  25. je .L1
  26. cmp BYTE [rcx+7], 0
  27. mov eax, 7
  28. je .L1
  29. lea rax, [rcx+7]
  30. and rax, -8
  31. jmp .L47
  32. align 16
  33. .L18:
  34. mov rax, rcx
  35. jmp .L47
  36. align 16
  37. .L40:
  38. test dh, dh
  39. je .L49
  40. test edx, 16711680
  41. je .L50
  42. test edx, 4278190080
  43. je .L51
  44. shr rdx, 32
  45. test dl, dl
  46. je .L52
  47. test dh, dh
  48. je .L53
  49. test edx, 16711680
  50. je .L54
  51. test edx, 4278190080
  52. je .L55
  53. add rax, 8
  54. .L47:
  55. mov rdx, QWORD [rax]
  56. test dl, dl
  57. jne .L40
  58. sub eax, ecx
  59. .L1:
  60. ret
  61. align 16
  62. .L49:
  63. sub rax, rcx
  64. add eax, 1
  65. ret
  66. align 16
  67. .L50:
  68. sub rax, rcx
  69. add eax, 2
  70. ret
  71. align 16
  72. .L51:
  73. sub rax, rcx
  74. add eax, 3
  75. ret
  76. align 16
  77. .L52:
  78. sub rax, rcx
  79. add eax, 4
  80. ret
  81. align 16
  82. .L53:
  83. sub rax, rcx
  84. add eax, 5
  85. ret
  86. align 16
  87. .L54:
  88. sub rax, rcx
  89. add eax, 6
  90. ret
  91. align 16
  92. .L55:
  93. sub rax, rcx
  94. add eax, 7
  95. ret

My function test result :

  1. string length => 336
  2. loop execution times => 10000000
  3. total execution time => 0.772015

GCC function test result :

  1. string length => 336
  2. loop execution times => 10000000
  3. total execution time => 0.522015

What is the problem ? Why my function is 1.5x slower when everything is kinda looks fine?
My string is aligned at 8 bytes, so you can skip the first one-by-one process and alignment.

Is there any problem with my label aligning ? Or the problem is from somewhere else?

ABI -> x64 (Windows)

CPU (Test) => i7-7800X

My C test application source-code :

  1. #include <stdio.h>
  2. #include <stdlib.h>
  3. #include <windows.h>
  4. unsigned int
  5. slen_by_me(const char *);
  6. unsigned int
  7. slen_gcc(const char *);
  8. int main() {
  9. static const char *str="WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW";
  10. LARGE_INTEGER frequency;
  11. LARGE_INTEGER start;
  12. LARGE_INTEGER end;
  13. double interval;
  14. unsigned int l = 0;
  15. QueryPerformanceFrequency(&frequency);
  16. QueryPerformanceCounter(&start);
  17. for (int i = 0; i < 10000000; i++) {
  18. l += slen_gcc(str);
  19. }
  20. QueryPerformanceCounter(&end);
  21. interval = (double) (end.QuadPart - start.QuadPart) / frequency.QuadPart;
  22. printf("%f\n%u\n", interval, l);
  23. return 0;
  24. }

My object file (with these 2 slen functions to link to that C tester) creator in FASM :

  1. format MS64 COFF
  2. public slen_gcc
  3. public slen_by_me
  4. section '.text' code readable executable align 64
  5. align 16
  6. slen_gcc:
  7. test cl, 7
  8. je .L18
  9. xor eax, eax
  10. cmp BYTE [rcx], 0
  11. je .L1
  12. cmp BYTE [rcx+1], 0
  13. mov eax, 1
  14. je .L1
  15. cmp BYTE [rcx+2], 0
  16. mov eax, 2
  17. je .L1
  18. cmp BYTE [rcx+3], 0
  19. mov eax, 3
  20. je .L1
  21. cmp BYTE [rcx+4], 0
  22. mov eax, 4
  23. je .L1
  24. cmp BYTE [rcx+5], 0
  25. mov eax, 5
  26. je .L1
  27. cmp BYTE [rcx+6], 0
  28. mov eax, 6
  29. je .L1
  30. cmp BYTE [rcx+7], 0
  31. mov eax, 7
  32. je .L1
  33. lea rax, [rcx+7]
  34. and rax, -8
  35. jmp .L47
  36. align 16
  37. .L18:
  38. mov rax, rcx
  39. jmp .L47
  40. align 16
  41. .L40:
  42. test dh, dh
  43. je .L49
  44. test edx, 16711680
  45. je .L50
  46. test edx, 4278190080
  47. je .L51
  48. shr rdx, 32
  49. test dl, dl
  50. je .L52
  51. test dh, dh
  52. je .L53
  53. test edx, 16711680
  54. je .L54
  55. test edx, 4278190080
  56. je .L55
  57. add rax, 8
  58. .L47:
  59. mov rdx, QWORD [rax]
  60. test dl, dl
  61. jne .L40
  62. sub eax, ecx
  63. .L1:
  64. ret
  65. align 16
  66. .L49:
  67. sub rax, rcx
  68. add eax, 1
  69. ret
  70. align 16
  71. .L50:
  72. sub rax, rcx
  73. add eax, 2
  74. ret
  75. align 16
  76. .L51:
  77. sub rax, rcx
  78. add eax, 3
  79. ret
  80. align 16
  81. .L52:
  82. sub rax, rcx
  83. add eax, 4
  84. ret
  85. align 16
  86. .L53:
  87. sub rax, rcx
  88. add eax, 5
  89. ret
  90. align 16
  91. .L54:
  92. sub rax, rcx
  93. add eax, 6
  94. ret
  95. align 16
  96. .L55:
  97. sub rax, rcx
  98. add eax, 7
  99. ret
  100. align 16
  101. slen_by_me:
  102. mov r8, rcx
  103. test cl, 7
  104. jz .loop
  105. xor eax, eax
  106. cmp BYTE [rcx], al
  107. je SHORT .ret
  108. cmp BYTE [rcx+1], al
  109. je SHORT .ret1
  110. cmp BYTE [rcx+2], al
  111. je SHORT .ret2
  112. cmp BYTE [rcx+3], al
  113. je SHORT .ret3
  114. cmp BYTE [rcx+4], al
  115. je SHORT .ret4
  116. cmp BYTE [rcx+5], al
  117. je SHORT .ret5
  118. cmp BYTE [rcx+6], al
  119. je SHORT .ret6
  120. cmp BYTE [rcx+7], al
  121. jne SHORT .align8
  122. mov al, 7
  123. ret
  124. align 16
  125. .ret: ret
  126. align 16
  127. .ret1: mov al, 1
  128. ret
  129. align 16
  130. .ret2: mov al, 2
  131. ret
  132. align 16
  133. .ret3: mov al, 3
  134. ret
  135. align 16
  136. .ret4: mov al, 4
  137. ret
  138. align 16
  139. .ret5: mov al, 5
  140. ret
  141. align 16
  142. .ret6: mov al, 6
  143. ret
  144. align 16
  145. .align8:
  146. lea rcx, [rcx+7]
  147. and rcx, (-8)
  148. align 16
  149. .loop: mov rax, QWORD [rcx]
  150. test al, al
  151. jz SHORT .end
  152. test ah, ah
  153. jz SHORT .end.1
  154. test eax, 0x00ff0000
  155. jz SHORT .end.2
  156. test eax, 0xff000000
  157. jz SHORT .end.3
  158. shr rax, 32
  159. test al, al
  160. jz SHORT .end.4
  161. test ah, ah
  162. jz SHORT .end.5
  163. test eax, 0x00ff0000
  164. jz SHORT .end.6
  165. test eax, 0xff000000
  166. jz SHORT .end.7
  167. add rcx, 8
  168. jmp SHORT .loop
  169. align 16
  170. .end: mov rax, rcx
  171. sub rax, r8
  172. ret
  173. align 16
  174. .end.1:
  175. lea rax, [rcx+1]
  176. sub rax, r8
  177. ret
  178. .end.2:
  179. lea rax, [rcx+2]
  180. sub rax, r8
  181. ret
  182. .end.3:
  183. lea rax, [rcx+3]
  184. sub rax, r8
  185. ret
  186. .end.4:
  187. lea rax, [rcx+4]
  188. sub rax, r8
  189. ret
  190. .end.5:
  191. lea rax, [rcx+5]
  192. sub rax, r8
  193. ret
  194. .end.6:
  195. lea rax, [rcx+6]
  196. sub rax, r8
  197. ret
  198. .end.7:
  199. lea rax, [rcx+7]
  200. sub rax, r8
  201. ret

Also the C version of slen

  1. int
  2. slen(const char *str) {
  3. const char *start=str;
  4. if(((unsigned long long)str & 7) != 0) {
  5. if(str[0] == 0x00)
  6. return 0;
  7. if(str[1] == 0x00)
  8. return 1;
  9. if(str[2] == 0x00)
  10. return 2;
  11. if(str[3] == 0x00)
  12. return 3;
  13. if(str[4] == 0x00)
  14. return 4;
  15. if(str[5] == 0x00)
  16. return 5;
  17. if(str[6] == 0x00)
  18. return 6;
  19. if(str[7] == 0x00)
  20. return 7;
  21. str=(const char *)(((unsigned long long)str + 7) & (-8));
  22. }
  23. do {
  24. unsigned long long bytes=(*(unsigned long long*)(str));
  25. if((unsigned char)bytes==0x00)
  26. return (int)(str-start);
  27. if((bytes & 0x0000ff00)==0)
  28. return (int)(str-start+1);
  29. if((bytes & 0x00ff0000)==0)
  30. return (int)(str-start+2);
  31. if((bytes & 0xff000000)==0)
  32. return (int)(str-start+3);
  33. bytes >>= 32;
  34. if((unsigned char)bytes==0x00)
  35. return (int)(str-start+4);
  36. if((bytes & 0x0000ff00)==0)
  37. return (int)(str-start+5);
  38. if((bytes & 0x00ff0000)==0)
  39. return (int)(str-start+6);
  40. if((bytes & 0xff000000)==0)
  41. return (int)(str-start+7);
  42. str+=8;
  43. } while (1);
  44. }

答案1

得分: 3

请问您需要哪部分内容进行翻译?

英文:

Allow me to refer you to one of my Pure Assembly library function (coming soon).
According to your question, it's about strlen (which named "str_length" in my library and developed for both Microsoft x64 ABI and System-v AMD64 ABI).

I remember (a few years ago) there was a C/C++ function about this type of string length calculator function.

  1. size_t my_strlen(const char *s) {
  2. size_t len = 0;
  3. for(;;) {
  4. unsigned x = *(unsigned*)s;
  5. if((x & 0xFF) == 0) return len;
  6. if((x & 0xFF00) == 0) return len + 1;
  7. if((x & 0xFF0000) == 0) return len + 2;
  8. if((x & 0xFF000000) == 0) return len + 3;
  9. s += 4, len += 4;
  10. }
  11. }

Even named "FAST strlen" which it's really not that fast. So, i decided to write my own "FAST strlen" in Assembly.

In x86-64, it's possible to load a 8-BYTE chunk into a 64-bit register so why 4-BYTE loading ? (As 'size_t my_strlen(const char *s)' did)

JCC Erratum

About 'JCC Erratum', still there are too many Skylake CPUs in the world (and by world, i mean DataCenters (check Hetzner datacenter and you find too many Skylake and old CPUs)). It's not optional, you MUST take care of this bad boy. But, it's very important to handle it without adding a NOP or even prefixes. Because by doing this, you make new problems for other CPUs. You can handle it by creating small new branches and putting some codes into a fresh 32-BYTE chunk (But don't make it too heavy).

TAKE CARE OF LOOP TAIL JUMP

Another subject, is taking care about the loops tail jump. Also again, you MUST make a tail for your loop and using jmp (unconditional jump) to jump to that tail (because of predictable branches subject (read the Agner Fog document about this bad boy (I love this guy for no reason xD)). Also, As Mr. Peter Cordes mentioned, and if you check the GCC jump method, you find the solution about loop creation and jumps.

TAKE CARE OF BRANCH ALIGNMENT

Yes, take care of branch alignment (16-BYTE boundaries), specially those you jump to, too many times (well, good boy|girl (sorry, no name), it's handled by you).

GCC REALLY ?! WHY NOT MACRO-FUSED ?

Well you (the question starter) did a right thing. You used a register for unaligned condition cmp so you have the benefit of macro-fusing. But in the code generated by GCC, you can see that cmp BYTE PTR [rcx], 0 is used. This will removes the benefit of macro-fusing from your code (its code actually (GCC)).
Of course, GCC done it to handle the padding but it's really not acceptable.

An example of this situation in uiCA test tool:

  1. 0000000000000000 <.text>:
  2. 0: 80 39 00 cmp BYTE PTR [rcx],0x0
  3. 3: 0f 84 00 00 00 00 je 0x9
  4. 9: 38 01 cmp BYTE PTR [rcx],al
  5. b: 0f 84 00 00 00 00 je 0x11

The second cmp got M flag which stands for 'Macro-fused with previous instruction'.

> Macro Fusion is restricted to 16-bit and 32-bit mode only (including
> 32-bit compatibility sub-mode in x86-64). CMP and TEST can fuse when
> comparing:
>
> REG-REG. (e.g, CMP EAX,ECX; JZ label)
> REG-IMM. (e.g., CMP EAX,0x80; JZ label)
> REG-MEM. (e.g., CMP EAX,[ECX]; JZ label)
> MEM-REG. (e.g., CMP [EAX],ECX; JZ label)
>
> CMP and TEST can not be fused when comparing MEM-IMM (e.g. CMP
> [EAX],0x80; JZ label)

And finally about the function and its performance test (according to your needs). I bring you the one with Microsoft x64 ABI.

  1. ; libASM, independent standard libraries in Assembly (programming-language).
  2. ; For more information, please visit the libASM website (www.libasm.com).
  3. ; Copyright (C) 2023 Mr. Alireza Saeidipour. All rights reserved.
  4. ; Published by SOURCEBRING, under its international legal terms and conditions.
  5. ; For more information, please visit the SOURCEBRING website (www.sourcebring.com).
  6. ; FAILURE GUARANTEES SUCCESS
  7. ; Alireza Saeidipour
  8. align.function
  9. str_length:
  10. mov r8, rcx
  11. test cl, 7
  12. jz @f
  13. xor eax, eax
  14. cmp BYTE [rcx], al
  15. je SHORT .len0
  16. cmp BYTE [rcx+1], al
  17. jne SHORT .unaligned_continue
  18. mov al, 1
  19. ret
  20. align.branch32
  21. .unaligned_continue:
  22. cmp BYTE [rcx+2], al
  23. je SHORT .len2
  24. cmp BYTE [rcx+3], al
  25. je SHORT .len3
  26. cmp BYTE [rcx+4], al
  27. je SHORT .len4
  28. cmp BYTE [rcx+5], al
  29. je SHORT .len5
  30. cmp BYTE [rcx+6], al
  31. je .len6
  32. cmp BYTE [rcx+7], al
  33. je .len7
  34. lea r8, [rcx+7]
  35. and r8, (-8)
  36. jmp @f
  37. align.branch
  38. .len0: ret
  39. align.branch
  40. .len2: mov eax, 2
  41. ret
  42. align.branch
  43. .len3: mov eax, 3
  44. ret
  45. align.branch
  46. .len4: mov eax, 4
  47. ret
  48. align.branch
  49. .len5: mov eax, 5
  50. ret
  51. align.branch
  52. .len6: mov eax, 6
  53. ret
  54. align.branch
  55. .len7: mov eax, 7
  56. ret
  57. align.branch
  58. .return_add7:
  59. lea rax, [r8+7]
  60. sub rax, r9
  61. ret
  62. align.branch
  63. @@: mov r9, rcx
  64. mov ecx, 0x00ff0000
  65. mov edx, 0xff000000
  66. jmp SHORT @f
  67. align.branch32
  68. .loop: test eax, ecx
  69. jz SHORT .return_add2
  70. test eax, edx
  71. jz SHORT .return_add3
  72. shr rax, 32
  73. test al, al
  74. jz SHORT .return_add4
  75. test ah, ah
  76. jz SHORT .return_add5
  77. test eax, ecx
  78. jz SHORT .return_add6
  79. test eax, edx
  80. jz SHORT .return_add7
  81. add r8, 8
  82. @@: mov rax, QWORD [r8]
  83. test al, al
  84. jz SHORT .return
  85. test ah, ah
  86. jnz SHORT .loop
  87. lea rax, [r8+1]
  88. sub rax, r9
  89. ret
  90. align.branch
  91. .return:
  92. mov rax, r8
  93. sub rax, r9
  94. ret
  95. align.branch
  96. .return_add2:
  97. lea rax, [r8+2]
  98. sub rax, r9
  99. ret
  100. align.branch
  101. .return_add3:
  102. lea rax, [r8+3]
  103. sub rax, r9
  104. ret
  105. align.branch
  106. .return_add4:
  107. lea rax, [r8+4]
  108. sub rax, r9
  109. ret
  110. align.branch
  111. .return_add5:
  112. lea rax, [r8+5]
  113. sub rax, r9
  114. ret
  115. align.branch
  116. .return_add6:
  117. lea rax, [r8+6]
  118. sub rax, r9
  119. ret
  120. .size = $ - str_length

And Macros in this source-code:

  1. macro align.function { align 32 }
  2. macro align.branch { align 16 }
  3. macro align.branch32 { align 32 }

This function considered as high-end solution (non-SIMD). You can find SIMD version of this function (9 functions are created only for string length operation) in my library soon (The library will be released by the end of June (2023)).

  1. str_length
  2. str_length_sse2
  3. str_length_avx
  4. str_length_avx2
  5. str_length_avx512bw
  6. str_length_long_sse2
  7. str_length_long_avx
  8. str_length_long_avx2
  9. str_length_long_avx512bw

Test results (based on your parameters and your test tools (function (C)):

  1. string length => 336
  2. loop execution times => 10000000
  3. total execution time => 0.430173

Yes, even faster than the one generated by GCC (0.522015). You will get same result for an unaligned string too.

Also, there is no 'JCC Erratum' problem in my code (The hex string of my function for you to check it).

  1. 49 89 c8 f6 c1 07 0f 84 c4 00 00 00 31 c0 38 01
  2. 74 3e 38 41 01 75 09 b0 01 c3 90 90 90 90 90 90
  3. 38 41 02 74 3b 38 41 03 74 46 38 41 04 74 51 38
  4. 41 05 74 5c 38 41 06 74 67 38 41 07 74 72 4c 8d
  5. 41 07 49 83 e0 f8 e9 85 00 00 00 90 90 90 90 90
  6. c3 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
  7. b8 02 00 00 00 c3 90 90 90 90 90 90 90 90 90 90
  8. b8 03 00 00 00 c3 90 90 90 90 90 90 90 90 90 90
  9. b8 04 00 00 00 c3 90 90 90 90 90 90 90 90 90 90
  10. b8 05 00 00 00 c3 90 90 90 90 90 90 90 90 90 90
  11. b8 06 00 00 00 c3 90 90 90 90 90 90 90 90 90 90
  12. b8 07 00 00 00 c3 90 90 90 90 90 90 90 90 90 90
  13. 49 8d 40 07 4c 29 c8 c3 90 90 90 90 90 90 90 90
  14. 49 89 c9 b9 00 00 ff 00 ba 00 00 00 ff eb 21 90
  15. 85 c8 74 4c 85 d0 74 58 48 c1 e8 20 84 c0 74 60
  16. 84 e4 74 6c 85 c8 74 78 85 d0 74 c4 49 83 c0 08
  17. 49 8b 00 84 c0 74 19 84 e4 75 d5 49 8d 40 01 4c
  18. 29 c8 c3 90 90 90 90 90 90 90 90 90 90 90 90 90
  19. 4c 89 c0 4c 29 c8 c3 90 90 90 90 90 90 90 90 90
  20. 49 8d 40 02 4c 29 c8 c3 90 90 90 90 90 90 90 90
  21. 49 8d 40 03 4c 29 c8 c3 90 90 90 90 90 90 90 90
  22. 49 8d 40 04 4c 29 c8 c3 90 90 90 90 90 90 90 90
  23. 49 8d 40 05 4c 29 c8 c3 90 90 90 90 90 90 90 90
  24. 49 8d 40 06 4c 29 c8 c3

Warning: Please attention that FASM uses too many NOP for 'align' directive (instead of using a long NOP) so don't use this directive when there is no jmp above of it (As you say, direct access).

Warning: For old CPUs sake, keep your jumps body short and use registers instead of imm. And always handle 'JCC Erratum' (You lose 1.3x performance for that).

With best-regards.

答案2

得分: 2

I changed my code from:

  1. .loop: mov rax, QWORD [rcx]
  2. test al, al
  3. jz SHORT .end
  4. test ah, ah
  5. jz SHORT .end.1
  6. test eax, 0x00ff0000
  7. jz SHORT .end.2
  8. test eax, 0xff000000
  9. jz SHORT .end.3
  10. shr rax, 32
  11. test al, al
  12. jz SHORT .end.4
  13. test ah, ah
  14. jz SHORT .end.5
  15. test eax, 0x00ff0000
  16. jz SHORT .end.6
  17. test eax, 0xff000000
  18. jz SHORT .end.7
  19. add rcx, 8
  20. jmp SHORT .loop

To (first, we jump to the '.loop' label):

  1. .loop.continue:
  2. test ah, ah
  3. jz SHORT .end1
  4. test eax, 0x00ff0000
  5. jz SHORT .end2
  6. test eax, 0xff000000
  7. jz SHORT .end3
  8. shr rax, 32
  9. test al, al
  10. jz SHORT .end4
  11. test ah, ah
  12. jz SHORT .end5
  13. test eax, 0x00ff0000
  14. jz SHORT .end6
  15. test eax, 0xff000000
  16. jz .end7
  17. lea rcx, [rcx+8]
  18. .loop: mov rax, QWORD [rcx]
  19. test al, al
  20. jnz SHORT .loop.continue
  21. mov rax, rcx
  22. sub rax, rdx
  23. ret

And even with 'JCC Erratum' problem, I get amazing result (0.532015).
There was something wrong with my loop. In the first one, we jumped to the loop, a QWORD was taken, and we started to search for 0x00, and at the end of the loop, 8 was added to rcx (string memory address), and we had to jump to the loop (top) again.

But in the solution, we jump to the end of the loop and handle the first check, then we jump back to handle the others, and by doing this, the speed problem is fixed !!!

UPDATED

I just tried to make the loop body smaller (in size, like my first code), and the result was amazing:

  1. strl:
  2. push rdi
  3. push rsi
  4. mov rdi, rcx
  5. mov rsi, rcx
  6. mov ecx, 0x00ff0000
  7. mov edx, 0xff000000
  8. mov r8, 0x000000ff00000000
  9. mov r9, 0x0000ff0000000000
  10. mov r10, 0x00ff000000000000
  11. mov r11, 0xff00000000000000
  12. test dil, 7
  13. jz @f
  14. ; handle unaligned
  15. align 32
  16. @@: mov rax, QWORD [rdi]
  17. test al, al
  18. jz SHORT .end
  19. test ah, ah
  20. jz SHORT .end1
  21. test eax, ecx
  22. jz SHORT .end2
  23. test eax, edx
  24. jz SHORT .end3
  25. test rax, r8
  26. jz SHORT .end4
  27. test rax, r9
  28. jz SHORT .end5
  29. test rax, r10
  30. jz SHORT .end6
  31. test rax, r11
  32. jz SHORT .end7
  33. add rdi, 8
  34. jmp @b
  35. align 16
  36. .end: mov rax, rdi
  37. sub rax, rsi
  38. pop rsi
  39. pop rdi
  40. ret
  41. align 16
  42. .end1: lea rax, [rdi+1]
  43. sub rax, rsi
  44. pop rsi
  45. pop rdi
  46. ret
  47. align 16
  48. .end2: lea rax, [rdi+2]
  49. sub rax, rsi
  50. pop rsi
  51. pop rdi
  52. ret
  53. align 16
  54. .end3: lea rax, [rdi+3]
  55. sub rax, rsi
  56. pop rsi
  57. pop rdi
  58. ret
  59. align 16
  60. .end4: lea rax, [rdi+4]
  61. sub rax, rsi
  62. pop rsi
  63. pop rdi
  64. ret
  65. align 16
  66. .end5: lea rax, [rdi+5]
  67. sub rax, rsi
  68. pop rsi
  69. pop rdi
  70. ret
  71. align 16
  72. .end6: lea rax, [rdi+6]
  73. sub rax, rsi
  74. pop rsi
  75. pop rdi
  76. ret
  77. align 16
  78. .end7: lea rax, [rdi+7]
  79. sub rax, rsi
  80. pop rsi
  81. pop rdi
  82. ret
英文:

I changed my code from

  1. .loop: mov rax, QWORD [rcx]
  2. test al, al
  3. jz SHORT .end
  4. test ah, ah
  5. jz SHORT .end.1
  6. test eax, 0x00ff0000
  7. jz SHORT .end.2
  8. test eax, 0xff000000
  9. jz SHORT .end.3
  10. shr rax, 32
  11. test al, al
  12. jz SHORT .end.4
  13. test ah, ah
  14. jz SHORT .end.5
  15. test eax, 0x00ff0000
  16. jz SHORT .end.6
  17. test eax, 0xff000000
  18. jz SHORT .end.7
  19. add rcx, 8
  20. jmp SHORT .loop

To (first, we jump to the '.loop' label):

  1. .loop.continue:
  2. test ah, ah
  3. jz SHORT .end1
  4. test eax, 0x00ff0000
  5. jz SHORT .end2
  6. test eax, 0xff000000
  7. jz SHORT .end3
  8. shr rax, 32
  9. test al, al
  10. jz SHORT .end4
  11. test ah, ah
  12. jz SHORT .end5
  13. test eax, 0x00ff0000
  14. jz SHORT .end6
  15. test eax, 0xff000000
  16. jz .end7
  17. lea rcx, [rcx+8]
  18. .loop: mov rax, QWORD [rcx]
  19. test al, al
  20. jnz SHORT .loop.continue
  21. mov rax, rcx
  22. sub rax, rdx
  23. ret

And even with 'JCC Erratum' problem, I get amazing result (0.532015).
There was something wrong with my loop. In the first one, we jumped to loop and a QWORD taken and we started to search for 0x00 and at the end of the loop, 8 added to rcx (string memory address) and we have to jump to the loop (top) again.

But in solution, we jump to the end of loop and we handle the first check then we jump top to handle the others and by doing this, the speed problem fixed !!!

UPDATED

I just tried to make loop body smaller (in size (my first code)) and the result was amazing !!!!!

  1. strl:
  2. push rdi
  3. push rsi
  4. mov rdi, rcx
  5. mov rsi, rcx
  6. mov ecx, 0x00ff0000
  7. mov edx, 0xff000000
  8. mov r8, 0x000000ff00000000
  9. mov r9, 0x0000ff0000000000
  10. mov r10, 0x00ff000000000000
  11. mov r11, 0xff00000000000000
  12. test dil, 7
  13. jz @f
  14. ; handle unaligned
  15. align 32
  16. @@: mov rax, QWORD [rdi]
  17. test al, al
  18. jz SHORT .end
  19. test ah, ah
  20. jz SHORT .end1
  21. test eax, ecx
  22. jz SHORT .end2
  23. test eax, edx
  24. jz SHORT .end3
  25. test rax, r8
  26. jz SHORT .end4
  27. test rax, r9
  28. jz SHORT .end5
  29. test rax, r10
  30. jz SHORT .end6
  31. test rax, r11
  32. jz SHORT .end7
  33. add rdi, 8
  34. jmp @b
  35. align 16
  36. .end: mov rax, rdi
  37. sub rax, rsi
  38. pop rsi
  39. pop rdi
  40. ret
  41. align 16
  42. .end1: lea rax, [rdi+1]
  43. sub rax, rsi
  44. pop rsi
  45. pop rdi
  46. ret
  47. align 16
  48. .end2: lea rax, [rdi+2]
  49. sub rax, rsi
  50. pop rsi
  51. pop rdi
  52. ret
  53. align 16
  54. .end3: lea rax, [rdi+3]
  55. sub rax, rsi
  56. pop rsi
  57. pop rdi
  58. ret
  59. align 16
  60. .end4: lea rax, [rdi+4]
  61. sub rax, rsi
  62. pop rsi
  63. pop rdi
  64. ret
  65. align 16
  66. .end5: lea rax, [rdi+5]
  67. sub rax, rsi
  68. pop rsi
  69. pop rdi
  70. ret
  71. align 16
  72. .end6: lea rax, [rdi+6]
  73. sub rax, rsi
  74. pop rsi
  75. pop rdi
  76. ret
  77. align 16
  78. .end7: lea rax, [rdi+7]
  79. sub rax, rsi
  80. pop rsi
  81. pop rdi
  82. ret

huangapple
  • 本文由 发表于 2023年5月24日 23:35:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76325226.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定