x86 rep前缀,计数为零:会发生什么?

huangapple go评论64阅读模式
英文:

x86 rep prefix with a count of zero: what happens?

问题

x86架构下的rep前缀初始计数为零会发生什么?

Intel手册明确指出这是一个带有顶部测试的while count != 0循环,这是合理的预期行为。

但我在其他地方看到的大多数模糊报告暗示没有零时的初始测试,因此它会像倒计时一样,最后才进行测试,如果是repeat {… count -= 1; } until count == 0;,可能会出现灾难,谁知道。

英文:

What happens for an initial count of zero for an x86 rep prefix?

Intel's manual says explicitly it’s a while count != 0 loop with the test at the top, which is the sane expected behaviour.

But most of the many vague reports I’ve seen elsewhere suggest that there’s no initial test for zero so it would be like a countdown with a test at the end and so disaster if it’s repeat {… count —=1; } until count == 0; or who knows.

答案1

得分: 4

RCX=0 时不会发生任何事情;rep 前缀确实像伪代码所说的那样首先检查是否为零。 (与 loop 指令不同,它与 do{}while(--ecx) 底部完全相同,或者 dec rcx / jnz 但不影响 FLAGS。)

我认为我曾经听说过这很少被用作条件加载或存储的习语,使用 rep lodswrep stosw,计数为0或1,尤其是在 cmov 出现之前的那些糟糕的日子。 (cmov 是一个无条件加载,提供 ALU 选择操作,因此它需要一个有效的地址,与计数为零的 rep lods 不同。)这在现代 x86 上尤其不高效,特别是在没有类似于 Fast Short Rep-Movs(如果我没记错的话,是 Ice Lake)的东西的情况下,使用 Fast Strings 微码(P6及以后的版本)的 rep stos

对于将前缀视为 repz / repnz(cmps/scas)而不是无条件 rep(lods/stos/movs)的指令也适用。零次迭代意味着它们不会修改 FLAGS。

如果您想在 repe/ne cmps/scas 之后检查 FLAGS,您需要确保计数不为零,或者 FLAGS 已经设置,以便在长度为零的缓冲区中以有用的方式进行分支。 (也许是通过异或将您以后会需要的寄存器置零。)

rep movsrep stos 自 P6 以来的 CPU 上都具有快速字符串微码,但启动开销使它们很少值得使用,特别是在大小可能很小和/或数据可能不对齐的情况下。它们在内核代码中更有用,您不能自由使用 XMM 寄存器时。一些最近的 CPU,如 Ice Lake,具有快速短 rep 微码,我认为它们旨在减少小计数的启动开销。

大多数 CPU 上的 repe/ne scas/cmps 没有快速字符串微码,只有非常新的 CPU,如 Sapphire Rapids 和可能的 Alder Lake P 核。因此,它们非常慢,根据 https://agner.org/optimize/https://uops.info/ 的测试,每个时钟周期一个加载(因此 cmpsb/w/d/q 的每个计数需要 2 个周期)。

英文:

Nothing happens with RCX=0; rep prefixes do check for zero first like the pseudocode says. (Unlike the loop instruction which is exactly like the bottom of a do{}while(--ecx), or a dec rcx/jnz but without affecting FLAGS.)

I think I've heard of this rarely being used as an idiom for a conditional load or store with rep lodsw or rep stosw with a count of 0 or 1, especially in the bad old days before cmov. (cmov is an unconditional load feeding an ALU select operation, so it needs a valid address, unlike rep lods with a count of zero.) This is not efficient especially for rep stos on modern x86 with Fast Strings microcode (P6 and later), especially without anything like Fast Short Rep-Movs (Ice Lake IIRC.)

The same applies for instructions that treat the prefixes as repz / repnz (cmps/scas) instead of unconditional rep (lods/stos/movs). Doing zero iterations means they leave FLAGS umodified.

If you want to check FLAGS after a repe/ne cmps/scas, you need to make sure the count was non-zero, or that FLAGS was already set such that you'll branch in a useful way for zero-length buffers. (Perhaps from xor-zeroing a register that you're going to want later.)


rep movs and rep stos have fast-strings microcode on CPUs since P6, but the startup overhead makes them rarely worth it, especially when sizes can be short and/or data might be misaligned. They're more useful in kernel code where you can't freely use XMM registers. Some recent CPUs like Ice Lake have fast-short-rep microcode that I think is supposed to reduce startup overhead for small counts.

repe/ne scas/cmps do not have fast-strings microcode on most CPUs, only on very recent CPUs like Sapphire Rapids and maybe Alder Lake P-cores. So they're quite slow, like one load per clock cycle (so 2 cycles per count for cmpsb/w/d/q) according to testing by https://agner.org/optimize/ and https://uops.info/.

huangapple
  • 本文由 发表于 2023年6月1日 13:37:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76378929.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定