2023年2月10日 06:20:21go评论94阅读模式

英文:

imul then mov vs mov then imul - any difference?

问题

以下是要翻译的内容：

如果我在clang 15中编译以下的C++程序：

int baz(int x) { return x * x; }

我会得到：

baz(int):
        mov     eax, edi
        imul    eax, edi
        ret

而在gcc 12.2中，我会得到：

baz(int):
        imul    edi, edi
        mov     eax, edi
        ret

（在GodBolt上查看）

这两个实现是否完全等效，仅仅是任意选择的问题？如果它们不等效，它们的差异如何体现，或者如何影响我的程序？我的意思是，从CPU状态的副作用、其他指令的延迟、内联期间的行为等方面来看。

英文:

If I compile the following C++ program:

int baz(int x) { return x * x; }

in clang 15, I get:

baz(int):
        mov     eax, edi
        imul    eax, edi
        ret

while gcc 12.2 gives me:

baz(int):
        imul    edi, edi
        mov     eax, edi
        ret

(See this on GodBolt)

Are these two implementations entirely equivalent, and merely a matter of arbitrary choice? If they're not equivalent, how can their difference manifest, or affect my program? I mean, in terms of CPU-state side-effects, latencies of other instructions, behavior during inlining etc.

答案1

得分: 5

Do mov then imul because it's better with mov-elimination, and not worse anywhere for any other reason.

This is true in general for mov/and, mov/sub, etc, as long as you don't have a use for the original value. If you do, then sometimes mov to make a copy and then modify the original to hide mov latency for CPUs without move elimination. (mov/add or small shift should normally be lea).

CPU with mov-elimination

mov then imul is strictly better; overwriting a mov reg,reg result lets Intel CPUs free some resources they use to track mov elimination. (Probably something like a reference count for extra references beyond the normal RAT.) This increases the likelihood of later mov-eliminations being successful. See https://stackoverflow.com/questions/75204302/how-do-move-elimination-slots-work-in-intel-cpu

All else essentially equal (as in this case), prefer to mov then overwrite its result, especially when that doesn't make things worse for CPUs without mov-elimination (like Ice Lake, thanks Intel.)

It doesn't have to be in the next instruction, just sometime soon, preferably not left indefinitely e.g. for a long-running loop. But even that isn't a disaster usually.

To measure this benefit, a microbenchmark would probably need to do a lot of mov instructions that don't overwrite their result, to run the CPU out of mov-elimination slots and have some of them need an execution unit. The microbenchmark would also need to be sensitive to the latency of those mov instructions, since most modern Intel CPUs have enough execution units to keep up with the issue/rename width in terms of throughput.

CPU without mov-elimination

mov reg,reg has 1 cycle latency. If you'd been doing x*y with two separate inputs, mov then imul makes that latency part of the input->output latency for one input but not the other. The other has an extra cycle to become ready before the imul would have to wait for it, if out-of-order exec would tend to have one input ready before the other.

(A compiler would typically have no way to guess which input was the result of a long dep chain vs. a mov-immediate when compiling a non-inline function, but a 50/50 chance of winning a cycle is better than having the mov always on the critical path after the imul.)

But with x*x without mov-elimination, the only difference is that we're writing both EDI and EAX, instead of writing EAX twice. I don't think that's significant in terms of using up physical-register-file (PRF) entries or freeing them sooner. Since most code-gen is trying to be good across multiple CPUs, favor mov then imul because some CPUs do have mov-elimination. It's essentially a tie for CPUs without, when you're squaring one variable.

Things that don't matter

On a CPU that does partial register renaming, writing a register might free up two physical-register-file (PRF) entries instead of just one. (While allocating a new PRF entry either way.) But just reading the full register would already insert a merging uop. Intel Sandybridge-family is the only x86-64 microarchitecture that does partial-register renaming and uses a PRF. Intel P6 family (Nehalem and earlier) keeps results right in the ROB, associated with the uop that produced them, until commit to a separate "retirement register file"; this is why it has register-read stalls when you read too many "cold" registers. Only Sandybridge itself (and possibly Ivy Bridge) rename low-8 registers like DIL and DL separate from full registers; on Haswell/Skylake and later only high-8 registers like DH get renamed separately.

Anyway, DIL might have been renamed separately from the full RDI. There is no DIH equivalent of DH or CH, since we're talking about EDI not EDX or ECX (the next two arg-passing registers), and gcc/clang very rarely generate code that writes high-8-bit registers. (https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers)

But either mov/imul or imul/mov will merge DIL into RDI before EDI is read, whether it's written or not (by the same imul uop). Same for DH on Haswell and later if we had an arg in EDX.

英文:

Do mov then imul because it's better with mov-elimination, and not worse anywhere for any other reason.

CPU with mov-elimination

All else essentially equal (as in this case), prefer to mov then overwrite its result, especially when that doesn't make things worse for CPUs without mov-elimination (like Ice Lake, thanks Intel.)

It doesn't have to be in the next instruction, just sometime soon, preferably not left indefinitely e.g. for a long-running loop. But even that isn't a disaster usually.

CPU without mov-elimination

But with x*x without mov-elimination, the only difference is that we're writing both EDI and EAX, instead of writing EAX twice. I don't think that's significant in terms of using up physical-register-file (PRF) entries or freeing them sooner. Since most code-gen is trying to be good across multiple CPUs, favour mov then imul because some CPUs do have mov-elimination. It's essentially a tie for CPUs without, when you're squaring one variable.

Things that don't matter

Intel Sandybridge-family is the only x86-64 microarchitecture that does partial-register renaming and uses a PRF. Intel P6 family (Nehalem and earlier) keeps results right in the ROB, associated with the uop that produced them, until commit to a separate "retirement register file"; this is why it has register-read stalls when you read too many "cold" registers. Only Sandybridge itself (and possibly Ivy Bridge) rename low-8 registers like DIL and DL separate from full registers; on Haswell/Skylake and later only high-8 registers like DH get renamed separately.

But either mov/imul or imul/mov will merge DIL into RDI before EDI is read, whether it's written or not (by the same imul uop). Same for DH on Haswell and later if we had an arg in EDX.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

imul然后mov与mov然后imul – 有什么区别？

问题

答案1

CPU with mov-elimination

CPU without mov-elimination

Things that don't matter

CPU with mov-elimination

CPU without mov-elimination

Things that don't matter

寻找大型 switch-case 和 if-else 结构的最有效方法

在OpenCL中，__kernel和KERNEL_FQ之间的区别是什么？

Segmentation Fault 在神经网络程序中的 c++

Valgrind在树莓派上的多线程C++程序中未检测到内存泄漏。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论