问题

x86架构下的rep前缀初始计数为零会发生什么？

Intel手册明确指出这是一个带有顶部测试的while count != 0循环，这是合理的预期行为。

但我在其他地方看到的大多数模糊报告暗示没有零时的初始测试，因此它会像倒计时一样，最后才进行测试，如果是repeat {… count -= 1; } until count == 0;，可能会出现灾难，谁知道。

英文:

What happens for an initial count of zero for an x86 rep prefix?

Intel's manual says explicitly it’s a while count != 0 loop with the test at the top, which is the sane expected behaviour.

But most of the many vague reports I’ve seen elsewhere suggest that there’s no initial test for zero so it would be like a countdown with a test at the end and so disaster if it’s repeat {… count —=1; } until count == 0; or who knows.

答案1

得分: 4

RCX=0 时不会发生任何事情；rep 前缀确实像伪代码所说的那样首先检查是否为零。（与 loop 指令不同，它与 do{}while(--ecx) 底部完全相同，或者 dec rcx / jnz 但不影响 FLAGS。）

我认为我曾经听说过这很少被用作条件加载或存储的习语，使用 rep lodsw 或 rep stosw，计数为0或1，尤其是在 cmov 出现之前的那些糟糕的日子。（cmov 是一个无条件加载，提供 ALU 选择操作，因此它需要一个有效的地址，与计数为零的 rep lods 不同。）这在现代 x86 上尤其不高效，特别是在没有类似于 Fast Short Rep-Movs（如果我没记错的话，是 Ice Lake）的东西的情况下，使用 Fast Strings 微码（P6及以后的版本）的 rep stos。

对于将前缀视为 repz / repnz（cmps/scas）而不是无条件 rep（lods/stos/movs）的指令也适用。零次迭代意味着它们不会修改 FLAGS。

如果您想在 repe/ne cmps/scas 之后检查 FLAGS，您需要确保计数不为零，或者 FLAGS 已经设置，以便在长度为零的缓冲区中以有用的方式进行分支。（也许是通过异或将您以后会需要的寄存器置零。）

rep movs 和 rep stos 自 P6 以来的 CPU 上都具有快速字符串微码，但启动开销使它们很少值得使用，特别是在大小可能很小和/或数据可能不对齐的情况下。它们在内核代码中更有用，您不能自由使用 XMM 寄存器时。一些最近的 CPU，如 Ice Lake，具有快速短 rep 微码，我认为它们旨在减少小计数的启动开销。

大多数 CPU 上的 repe/ne scas/cmps 没有快速字符串微码，只有非常新的 CPU，如 Sapphire Rapids 和可能的 Alder Lake P 核。因此，它们非常慢，根据 https://agner.org/optimize/ 和 https://uops.info/ 的测试，每个时钟周期一个加载（因此 cmpsb/w/d/q 的每个计数需要 2 个周期）。

https://stackoverflow.com/questions/33902068/what-setup-does-rep-do
https://stackoverflow.com/questions/55563598/why-is-this-code-using-strlen-heavily-6-5x-slower-with-gcc-optimizations-enabled - GCC -O1 曾经使用 repne scasb 来内联 strlen。这对于长字符串来说是一场灾难。
https://stackoverflow.com/questions/75309389/which-processors-support-fast-short-rep-cmpsb-and-scasb（非常新的功能）
https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy - 即使没有 ERMSB，rep movs 对于大尺寸将使用无需 RFO 的存储，类似于 NT 存储，但不会绕过缓存。关于内存带宽考虑的良好通用 Q&A。

英文:

Nothing happens with RCX=0; rep prefixes do check for zero first like the pseudocode says. (Unlike the loop instruction which is exactly like the bottom of a do{}while(--ecx), or a dec rcx/jnz but without affecting FLAGS.)

I think I've heard of this rarely being used as an idiom for a conditional load or store with rep lodsw or rep stosw with a count of 0 or 1, especially in the bad old days before cmov. (cmov is an unconditional load feeding an ALU select operation, so it needs a valid address, unlike rep lods with a count of zero.) This is not efficient especially for rep stos on modern x86 with Fast Strings microcode (P6 and later), especially without anything like Fast Short Rep-Movs (Ice Lake IIRC.)

The same applies for instructions that treat the prefixes as repz / repnz (cmps/scas) instead of unconditional rep (lods/stos/movs). Doing zero iterations means they leave FLAGS umodified.

If you want to check FLAGS after a repe/ne cmps/scas, you need to make sure the count was non-zero, or that FLAGS was already set such that you'll branch in a useful way for zero-length buffers. (Perhaps from xor-zeroing a register that you're going to want later.)

rep movs and rep stos have fast-strings microcode on CPUs since P6, but the startup overhead makes them rarely worth it, especially when sizes can be short and/or data might be misaligned. They're more useful in kernel code where you can't freely use XMM registers. Some recent CPUs like Ice Lake have fast-short-rep microcode that I think is supposed to reduce startup overhead for small counts.

repe/ne scas/cmps do not have fast-strings microcode on most CPUs, only on very recent CPUs like Sapphire Rapids and maybe Alder Lake P-cores. So they're quite slow, like one load per clock cycle (so 2 cycles per count for cmpsb/w/d/q) according to testing by https://agner.org/optimize/ and https://uops.info/.

https://stackoverflow.com/questions/33902068/what-setup-does-rep-do
https://stackoverflow.com/questions/55563598/why-is-this-code-using-strlen-heavily-6-5x-slower-with-gcc-optimizations-enabled - GCC -O1 used to use repne scasb to inline strlen. This is a disaster for long strings.
https://stackoverflow.com/questions/75309389/which-processors-support-fast-short-rep-cmpsb-and-scasb (very recent feature)
https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy - even without ERMSB, rep movs will use no-RFO stores for large sizes, similar to NT stores but not bypassing the cache. Good general Q&A about memory bandwidth considerations.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

x86 rep前缀，计数为零：会发生什么？

问题

答案1

Stack Initialization Problem in Real Mode: 如何正确初始化堆栈？

使用相对路径在Visual Studio中包含MASM库。

在M1 Mac上搜索数组的最快方法

Enabling the VGA 13h video mode on a modern PC in UEFI via a UEFI bootloader, written in assembly

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论