英文:
x86 rep prefix with a count of zero: what happens?
问题
x86架构下的rep
前缀初始计数为零会发生什么?
Intel手册明确指出这是一个带有顶部测试的while count != 0
循环,这是合理的预期行为。
但我在其他地方看到的大多数模糊报告暗示没有零时的初始测试,因此它会像倒计时一样,最后才进行测试,如果是repeat
{… count -= 1; }
until count == 0;
,可能会出现灾难,谁知道。
英文:
What happens for an initial count of zero for an x86 rep
prefix?
Intel's manual says explicitly it’s a while count != 0
loop with the test at the top, which is the sane expected behaviour.
But most of the many vague reports I’ve seen elsewhere suggest that there’s no initial test for zero so it would be like a countdown with a test at the end and so disaster if it’s repeat
{… count —=1; }
until count == 0;
or who knows.
答案1
得分: 4
RCX=0 时不会发生任何事情;rep
前缀确实像伪代码所说的那样首先检查是否为零。 (与 loop
指令不同,它与 do{}while(--ecx)
底部完全相同,或者 dec rcx
/ jnz
但不影响 FLAGS。)
我认为我曾经听说过这很少被用作条件加载或存储的习语,使用 rep lodsw
或 rep stosw
,计数为0或1,尤其是在 cmov 出现之前的那些糟糕的日子。 (cmov
是一个无条件加载,提供 ALU 选择操作,因此它需要一个有效的地址,与计数为零的 rep lods
不同。)这在现代 x86 上尤其不高效,特别是在没有类似于 Fast Short Rep-Movs(如果我没记错的话,是 Ice Lake)的东西的情况下,使用 Fast Strings 微码(P6及以后的版本)的 rep stos
。
对于将前缀视为 repz
/ repnz
(cmps/scas)而不是无条件 rep
(lods/stos/movs)的指令也适用。零次迭代意味着它们不会修改 FLAGS。
如果您想在 repe/ne cmps/scas
之后检查 FLAGS,您需要确保计数不为零,或者 FLAGS 已经设置,以便在长度为零的缓冲区中以有用的方式进行分支。 (也许是通过异或将您以后会需要的寄存器置零。)
rep movs
和 rep stos
自 P6 以来的 CPU 上都具有快速字符串微码,但启动开销使它们很少值得使用,特别是在大小可能很小和/或数据可能不对齐的情况下。它们在内核代码中更有用,您不能自由使用 XMM 寄存器时。一些最近的 CPU,如 Ice Lake,具有快速短 rep 微码,我认为它们旨在减少小计数的启动开销。
大多数 CPU 上的 repe/ne scas/cmps
没有快速字符串微码,只有非常新的 CPU,如 Sapphire Rapids 和可能的 Alder Lake P 核。因此,它们非常慢,根据 https://agner.org/optimize/ 和 https://uops.info/ 的测试,每个时钟周期一个加载(因此 cmpsb/w/d/q
的每个计数需要 2 个周期)。
- https://stackoverflow.com/questions/33902068/what-setup-does-rep-do
- https://stackoverflow.com/questions/55563598/why-is-this-code-using-strlen-heavily-6-5x-slower-with-gcc-optimizations-enabled - GCC
-O1
曾经使用repne scasb
来内联strlen
。 这对于长字符串来说是一场灾难。 - https://stackoverflow.com/questions/75309389/which-processors-support-fast-short-rep-cmpsb-and-scasb(非常新的功能)
- https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy - 即使没有 ERMSB,
rep movs
对于大尺寸将使用无需 RFO 的存储,类似于 NT 存储,但不会绕过缓存。 关于内存带宽考虑的良好通用 Q&A。
英文:
Nothing happens with RCX=0; rep
prefixes do check for zero first like the pseudocode says. (Unlike the loop
instruction which is exactly like the bottom of a do{}while(--ecx)
, or a dec rcx
/jnz
but without affecting FLAGS.)
I think I've heard of this rarely being used as an idiom for a conditional load or store with rep lodsw
or rep stosw
with a count of 0 or 1, especially in the bad old days before cmov. (cmov
is an unconditional load feeding an ALU select operation, so it needs a valid address, unlike rep lods
with a count of zero.) This is not efficient especially for rep stos
on modern x86 with Fast Strings microcode (P6 and later), especially without anything like Fast Short Rep-Movs (Ice Lake IIRC.)
The same applies for instructions that treat the prefixes as repz
/ repnz
(cmps/scas) instead of unconditional rep
(lods/stos/movs). Doing zero iterations means they leave FLAGS umodified.
If you want to check FLAGS after a repe/ne cmps/scas
, you need to make sure the count was non-zero, or that FLAGS was already set such that you'll branch in a useful way for zero-length buffers. (Perhaps from xor-zeroing a register that you're going to want later.)
rep movs
and rep stos
have fast-strings microcode on CPUs since P6, but the startup overhead makes them rarely worth it, especially when sizes can be short and/or data might be misaligned. They're more useful in kernel code where you can't freely use XMM registers. Some recent CPUs like Ice Lake have fast-short-rep microcode that I think is supposed to reduce startup overhead for small counts.
repe/ne scas/cmps
do not have fast-strings microcode on most CPUs, only on very recent CPUs like Sapphire Rapids and maybe Alder Lake P-cores. So they're quite slow, like one load per clock cycle (so 2 cycles per count for cmpsb/w/d/q
) according to testing by https://agner.org/optimize/ and https://uops.info/.
-
https://stackoverflow.com/questions/33902068/what-setup-does-rep-do
-
https://stackoverflow.com/questions/55563598/why-is-this-code-using-strlen-heavily-6-5x-slower-with-gcc-optimizations-enabled - GCC
-O1
used to userepne scasb
to inlinestrlen
. This is a disaster for long strings. -
https://stackoverflow.com/questions/75309389/which-processors-support-fast-short-rep-cmpsb-and-scasb (very recent feature)
-
https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy - even without ERMSB,
rep movs
will use no-RFO stores for large sizes, similar to NT stores but not bypassing the cache. Good general Q&A about memory bandwidth considerations.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论