英文:
imul then mov vs mov then imul - any difference?
问题
以下是要翻译的内容:
如果我在clang 15中编译以下的C++程序:
int baz(int x) { return x * x; }
我会得到:
baz(int):
mov eax, edi
imul eax, edi
ret
而在gcc 12.2中,我会得到:
baz(int):
imul edi, edi
mov eax, edi
ret
(在GodBolt上查看)
这两个实现是否完全等效,仅仅是任意选择的问题?如果它们不等效,它们的差异如何体现,或者如何影响我的程序?我的意思是,从CPU状态的副作用、其他指令的延迟、内联期间的行为等方面来看。
英文:
If I compile the following C++ program:
int baz(int x) { return x * x; }
in clang 15, I get:
baz(int):
mov eax, edi
imul eax, edi
ret
while gcc 12.2 gives me:
baz(int):
imul edi, edi
mov eax, edi
ret
(See this on GodBolt)
Are these two implementations entirely equivalent, and merely a matter of arbitrary choice? If they're not equivalent, how can their difference manifest, or affect my program? I mean, in terms of CPU-state side-effects, latencies of other instructions, behavior during inlining etc.
答案1
得分: 5
Do mov
then imul
because it's better with mov-elimination, and not worse anywhere for any other reason.
This is true in general for mov
/and
, mov
/sub
, etc, as long as you don't have a use for the original value. If you do, then sometimes mov
to make a copy and then modify the original to hide mov
latency for CPUs without move elimination. (mov
/add
or small shift should normally be lea
).
CPU with mov-elimination
mov
then imul
is strictly better; overwriting a mov reg,reg
result lets Intel CPUs free some resources they use to track mov elimination. (Probably something like a reference count for extra references beyond the normal RAT.) This increases the likelihood of later mov-eliminations being successful. See https://stackoverflow.com/questions/75204302/how-do-move-elimination-slots-work-in-intel-cpu
All else essentially equal (as in this case), prefer to mov
then overwrite its result, especially when that doesn't make things worse for CPUs without mov-elimination (like Ice Lake, thanks Intel.)
It doesn't have to be in the next instruction, just sometime soon, preferably not left indefinitely e.g. for a long-running loop. But even that isn't a disaster usually.
To measure this benefit, a microbenchmark would probably need to do a lot of mov
instructions that don't overwrite their result, to run the CPU out of mov-elimination slots and have some of them need an execution unit. The microbenchmark would also need to be sensitive to the latency of those mov
instructions, since most modern Intel CPUs have enough execution units to keep up with the issue/rename width in terms of throughput.
CPU without mov-elimination
mov reg,reg
has 1 cycle latency. If you'd been doing x*y
with two separate inputs, mov
then imul
makes that latency part of the input->output latency for one input but not the other. The other has an extra cycle to become ready before the imul
would have to wait for it, if out-of-order exec would tend to have one input ready before the other.
(A compiler would typically have no way to guess which input was the result of a long dep chain vs. a mov
-immediate when compiling a non-inline function, but a 50/50 chance of winning a cycle is better than having the mov
always on the critical path after the imul
.)
But with x*x
without mov-elimination, the only difference is that we're writing both EDI and EAX, instead of writing EAX twice. I don't think that's significant in terms of using up physical-register-file (PRF) entries or freeing them sooner. Since most code-gen is trying to be good across multiple CPUs, favor mov then imul because some CPUs do have mov-elimination. It's essentially a tie for CPUs without, when you're squaring one variable.
Things that don't matter
On a CPU that does partial register renaming, writing a register might free up two physical-register-file (PRF) entries instead of just one. (While allocating a new PRF entry either way.) But just reading the full register would already insert a merging uop. Intel Sandybridge-family is the only x86-64 microarchitecture that does partial-register renaming and uses a PRF. Intel P6 family (Nehalem and earlier) keeps results right in the ROB, associated with the uop that produced them, until commit to a separate "retirement register file"; this is why it has register-read stalls when you read too many "cold" registers. Only Sandybridge itself (and possibly Ivy Bridge) rename low-8 registers like DIL and DL separate from full registers; on Haswell/Skylake and later only high-8 registers like DH get renamed separately.
Anyway, DIL might have been renamed separately from the full RDI. There is no DIH equivalent of DH or CH, since we're talking about EDI not EDX or ECX (the next two arg-passing registers), and gcc/clang very rarely generate code that writes high-8-bit registers. (https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers)
But either mov/imul or imul/mov will merge DIL into RDI before EDI is read, whether it's written or not (by the same imul uop). Same for DH on Haswell and later if we had an arg in EDX.
英文:
Do mov
then imul
because it's better with mov-elimination, and not worse anywhere for any other reason.
This is true in general for mov
/and
, mov
/sub
, etc, as long as you don't have a use for the original value. If you do, then sometimes mov
to make a copy and then modify the original to hide mov
latency for CPUs without move elimination. (mov
/add
or small shift should normally be lea
).
CPU with mov-elimination
mov
then imul
is strictly better; overwriting a mov reg,reg
result lets Intel CPUs free some resources they use to track mov elimination. (Probably something like a reference count for extra references beyond the normal RAT.) This increases the likelihood of later mov-eliminations being successful. See https://stackoverflow.com/questions/75204302/how-do-move-elimination-slots-work-in-intel-cpu
All else essentially equal (as in this case), prefer to mov
then overwrite its result, especially when that doesn't make things worse for CPUs without mov-elimination (like Ice Lake, thanks Intel.)
It doesn't have to be in the next instruction, just sometime soon, preferably not left indefinitely e.g. for a long-running loop. But even that isn't a disaster usually.
To measure this benefit, a microbenchmark would probably need to do a lot of mov
instructions that don't overwrite their result, to run the CPU out of mov-elimination slots and have some of them need an execution unit. The microbenchmark would also need to be sensitive to the latency of those mov
instructions, since most modern Intel CPUs have enough execution units to keep up with the issue/rename width in terms of throughput.
CPU without mov-elimination
mov reg,reg
has 1 cycle latency. If you'd been doing x*y
with two separate inputs, mov
then imul
makes that latency part of the input->output latency for one input but not the other. The other has an extra cycle to become ready before the imul
would have to wait for it, if out-of-order exec would tend to have one input ready before the other.
(A compiler would typically have no way to guess which input was the result of a long dep chain vs. a mov
-immediate when compiling a non-inline function, but a 50/50 chance of winning a cycle is better than having the mov
always on the critical path after the imul
.)
But with x*x
without mov-elimination, the only difference is that we're writing both EDI and EAX, instead of writing EAX twice. I don't think that's significant in terms of using up physical-register-file (PRF) entries or freeing them sooner. Since most code-gen is trying to be good across multiple CPUs, favour mov then imul because some CPUs do have mov-elimination. It's essentially a tie for CPUs without, when you're squaring one variable.
Things that don't matter
On a CPU that does partial register renaming, writing a register might free up two physical-register-file (PRF) entries instead of just one. (While allocating a new PRF entry either way.) But just reading the full register would already insert a merging uop.
Intel Sandybridge-family is the only x86-64 microarchitecture that does partial-register renaming and uses a PRF. Intel P6 family (Nehalem and earlier) keeps results right in the ROB, associated with the uop that produced them, until commit to a separate "retirement register file"; this is why it has register-read stalls when you read too many "cold" registers. Only Sandybridge itself (and possibly Ivy Bridge) rename low-8 registers like DIL and DL separate from full registers; on Haswell/Skylake and later only high-8 registers like DH get renamed separately.
Anyway, DIL might have been renamed separately from the full RDI. There is no DIH equivalent of DH or CH, since we're talking about EDI not EDX or ECX (the next two arg-passing registers), and gcc/clang very rarely generate code that writes high-8-bit registers. (https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers)
But either mov/imul or imul/mov will merge DIL into RDI before EDI is read, whether it's written or not (by the same imul uop). Same for DH on Haswell and later if we had an arg in EDX.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论