imul然后mov与mov然后imul – 有什么区别?

huangapple go评论52阅读模式
英文:

imul then mov vs mov then imul - any difference?

问题

以下是要翻译的内容:

如果我在clang 15中编译以下的C++程序:

int baz(int x) { return x * x; }

我会得到:

baz(int):
        mov     eax, edi
        imul    eax, edi
        ret

而在gcc 12.2中,我会得到:

baz(int):
        imul    edi, edi
        mov     eax, edi
        ret

(在GodBolt上查看)

这两个实现是否完全等效,仅仅是任意选择的问题?如果它们不等效,它们的差异如何体现,或者如何影响我的程序?我的意思是,从CPU状态的副作用、其他指令的延迟、内联期间的行为等方面来看。

英文:

If I compile the following C++ program:

int baz(int x) { return x * x; }

in clang 15, I get:

baz(int):
        mov     eax, edi
        imul    eax, edi
        ret

while gcc 12.2 gives me:

baz(int):
        imul    edi, edi
        mov     eax, edi
        ret

(See this on GodBolt)

Are these two implementations entirely equivalent, and merely a matter of arbitrary choice? If they're not equivalent, how can their difference manifest, or affect my program? I mean, in terms of CPU-state side-effects, latencies of other instructions, behavior during inlining etc.

答案1

得分: 5

Do mov then imul because it's better with mov-elimination, and not worse anywhere for any other reason.

This is true in general for mov/and, mov/sub, etc, as long as you don't have a use for the original value. If you do, then sometimes mov to make a copy and then modify the original to hide mov latency for CPUs without move elimination. (mov/add or small shift should normally be lea).

CPU with mov-elimination

mov then imul is strictly better; overwriting a mov reg,reg result lets Intel CPUs free some resources they use to track mov elimination. (Probably something like a reference count for extra references beyond the normal RAT.) This increases the likelihood of later mov-eliminations being successful. See https://stackoverflow.com/questions/75204302/how-do-move-elimination-slots-work-in-intel-cpu

All else essentially equal (as in this case), prefer to mov then overwrite its result, especially when that doesn't make things worse for CPUs without mov-elimination (like Ice Lake, thanks Intel.)

It doesn't have to be in the next instruction, just sometime soon, preferably not left indefinitely e.g. for a long-running loop. But even that isn't a disaster usually.

To measure this benefit, a microbenchmark would probably need to do a lot of mov instructions that don't overwrite their result, to run the CPU out of mov-elimination slots and have some of them need an execution unit. The microbenchmark would also need to be sensitive to the latency of those mov instructions, since most modern Intel CPUs have enough execution units to keep up with the issue/rename width in terms of throughput.

CPU without mov-elimination

mov reg,reg has 1 cycle latency. If you'd been doing x*y with two separate inputs, mov then imul makes that latency part of the input->output latency for one input but not the other. The other has an extra cycle to become ready before the imul would have to wait for it, if out-of-order exec would tend to have one input ready before the other.

(A compiler would typically have no way to guess which input was the result of a long dep chain vs. a mov-immediate when compiling a non-inline function, but a 50/50 chance of winning a cycle is better than having the mov always on the critical path after the imul.)

But with x*x without mov-elimination, the only difference is that we're writing both EDI and EAX, instead of writing EAX twice. I don't think that's significant in terms of using up physical-register-file (PRF) entries or freeing them sooner. Since most code-gen is trying to be good across multiple CPUs, favor mov then imul because some CPUs do have mov-elimination. It's essentially a tie for CPUs without, when you're squaring one variable.

Things that don't matter

On a CPU that does partial register renaming, writing a register might free up two physical-register-file (PRF) entries instead of just one. (While allocating a new PRF entry either way.) But just reading the full register would already insert a merging uop. Intel Sandybridge-family is the only x86-64 microarchitecture that does partial-register renaming and uses a PRF. Intel P6 family (Nehalem and earlier) keeps results right in the ROB, associated with the uop that produced them, until commit to a separate "retirement register file"; this is why it has register-read stalls when you read too many "cold" registers. Only Sandybridge itself (and possibly Ivy Bridge) rename low-8 registers like DIL and DL separate from full registers; on Haswell/Skylake and later only high-8 registers like DH get renamed separately.

Anyway, DIL might have been renamed separately from the full RDI. There is no DIH equivalent of DH or CH, since we're talking about EDI not EDX or ECX (the next two arg-passing registers), and gcc/clang very rarely generate code that writes high-8-bit registers. (https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers)

But either mov/imul or imul/mov will merge DIL into RDI before EDI is read, whether it's written or not (by the same imul uop). Same for DH on Haswell and later if we had an arg in EDX.

英文:

Do mov then imul because it's better with mov-elimination, and not worse anywhere for any other reason.

This is true in general for mov/and, mov/sub, etc, as long as you don't have a use for the original value. If you do, then sometimes mov to make a copy and then modify the original to hide mov latency for CPUs without move elimination. (mov/add or small shift should normally be lea).

CPU with mov-elimination

mov then imul is strictly better; overwriting a mov reg,reg result lets Intel CPUs free some resources they use to track mov elimination. (Probably something like a reference count for extra references beyond the normal RAT.) This increases the likelihood of later mov-eliminations being successful. See https://stackoverflow.com/questions/75204302/how-do-move-elimination-slots-work-in-intel-cpu

All else essentially equal (as in this case), prefer to mov then overwrite its result, especially when that doesn't make things worse for CPUs without mov-elimination (like Ice Lake, thanks Intel.)

It doesn't have to be in the next instruction, just sometime soon, preferably not left indefinitely e.g. for a long-running loop. But even that isn't a disaster usually.

To measure this benefit, a microbenchmark would probably need to do a lot of mov instructions that don't overwrite their result, to run the CPU out of mov-elimination slots and have some of them need an execution unit. The microbenchmark would also need to be sensitive to the latency of those mov instructions, since most modern Intel CPUs have enough execution units to keep up with the issue/rename width in terms of throughput.

CPU without mov-elimination

mov reg,reg has 1 cycle latency. If you'd been doing x*y with two separate inputs, mov then imul makes that latency part of the input->output latency for one input but not the other. The other has an extra cycle to become ready before the imul would have to wait for it, if out-of-order exec would tend to have one input ready before the other.

(A compiler would typically have no way to guess which input was the result of a long dep chain vs. a mov-immediate when compiling a non-inline function, but a 50/50 chance of winning a cycle is better than having the mov always on the critical path after the imul.)

But with x*x without mov-elimination, the only difference is that we're writing both EDI and EAX, instead of writing EAX twice. I don't think that's significant in terms of using up physical-register-file (PRF) entries or freeing them sooner. Since most code-gen is trying to be good across multiple CPUs, favour mov then imul because some CPUs do have mov-elimination. It's essentially a tie for CPUs without, when you're squaring one variable.


Things that don't matter

On a CPU that does partial register renaming, writing a register might free up two physical-register-file (PRF) entries instead of just one. (While allocating a new PRF entry either way.) But just reading the full register would already insert a merging uop.

Intel Sandybridge-family is the only x86-64 microarchitecture that does partial-register renaming and uses a PRF. Intel P6 family (Nehalem and earlier) keeps results right in the ROB, associated with the uop that produced them, until commit to a separate "retirement register file"; this is why it has register-read stalls when you read too many "cold" registers. Only Sandybridge itself (and possibly Ivy Bridge) rename low-8 registers like DIL and DL separate from full registers; on Haswell/Skylake and later only high-8 registers like DH get renamed separately.

Anyway, DIL might have been renamed separately from the full RDI. There is no DIH equivalent of DH or CH, since we're talking about EDI not EDX or ECX (the next two arg-passing registers), and gcc/clang very rarely generate code that writes high-8-bit registers. (https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers)

But either mov/imul or imul/mov will merge DIL into RDI before EDI is read, whether it's written or not (by the same imul uop). Same for DH on Haswell and later if we had an arg in EDX.

huangapple
  • 本文由 发表于 2023年2月10日 06:20:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/75405054.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定