问题

On x86-64，无论微架构如何，以及ARM64设备上，错误预测的条件分支需要多少个时钟周期？并且我想我还应该询问成功预测的分支（是否采取/不采取）的数据是多少？我可以尝试在Agner Fog的表格中找到这些信息，但我对ARM也很感兴趣。

从处理器本身获取这些数据是否有相对简单的方法？

英文:

On x86-64 whatever micro architecture and ARM64 devices, how many clock cycles does a mispredicted conditional branch cost? And I suppose I should also ask what the figure is for a successfully predicted branch taken/not taken ? I can try and find this in Agner Fog’s tables but I’m interested in ARM equally.

Is there a reasonably easy way of getting this data out of the processor itself?

答案1

得分: 3

Mispredicted branches just stall the front-end, not the entire pipeline. So the cost in terms of overall performance impact depends on the code. If it was bottlenecked purely on the front-end, losing 15 to 19 cycles of front-end throughput costs that many cycles of total time, but many other programs can somewhat hide the bubble since they have other work in flight to still be working on.

See:

Skylake CPU Mispredicts a Branch
Avoid Stalling Pipeline by Calculating Conditional Early
What Considerations Go into Predicting Latency for Operations on Modern Superscalar CPUs - costs aren't one-dimensional in general, e.g. you can't add up the "cycles" cost of different instructions to get a total cost, because that's not how out-of-order exec works with different execution units for different instructions.

It's something you can microbenchmark, but it's somewhat tricky to construct such a benchmark. 7-CPU has numbers for many CPUs, e.g.:

Cortex A76 is reported as a 14-cycle penalty.
Skylake: 16.5 cycles average (if mOp cache hit) or 19-20 cycles (if mOp cache miss). The uop-cache effectively shortens the pipeline, fewer stages between re-steer and having uops ready to issue from the front-end into the back-end.
Cortex A53: 7 cycles. Much shorter recovery time, as expected for a simpler in-order pipeline.

I suspect those numbers are from vendor manuals, unless 7-cpu has a standard benchmark they use.

Also, yes, Agner Fog attempted to microbenchmark this for many x86 CPUs, but hard numbers are hard to measure; he reports that measurements were pretty noisy on some CPUs. e.g. for Haswell/Broadwell he writes in his microarch PDF:

There may be a difference in branch misprediction penalty between the three sources of µops, but I have not been able to verify such a difference because the variance in the measurements is high. The measured misprediction penalty varies between 16 and 20 clock cycles in all three cases.

英文:

See

https://stackoverflow.com/questions/50984007/what-exactly-happens-when-a-skylake-cpu-mispredicts-a-branch
https://stackoverflow.com/questions/49932119/avoid-stalling-pipeline-by-calculating-conditional-early
and in general https://stackoverflow.com/questions/51607391/what-considerations-go-into-predicting-latency-for-operations-on-modern-supersca - costs aren't one-dimensional in general, e.g. you can't add up the "cycles" cost of different instructions to get a total cost, because that's not how out-of-order exec works with different execution units for different instructions.

It's something you can microbenchmark, but it's somewhat tricky to construct such a benchmark. https://www.7-cpu.com/ has numbers for many CPUs, e.g.

Cortex A76 is reported as a 14-cycle penalty,
Skylake 16.5 cycles average (if mOp cache hit) or 19-20 cycles (if mOp cache miss). The uop-cache effectively shortens the pipeline, fewer stages between re-steer and having uops ready to issue from the front-end into the back-end.
Cortex A53: 7 cycles. Much shorter recovery time, as expected for a simpler in-order pipeline.

I suspect those numbers are from vendor manuals, unless 7-cpu has a standard benchmark they use.

Also yes, Agner Fog attempted to microbenchmark this for many x86 CPUs, but hard numbers are hard to measure; he reports that measurements were pretty noisy on some CPUs. e.g. for Haswell/Broadwell he writes in his microarch PDF

> There may be a difference in branch misprediction penalty between the three sources of
µops, but I have not been able to verify such a difference because the variance in the
measurements is high. The measured misprediction penalty varies between 16 and 20 clock
cycles in all three cases.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

一个错误的条件分支预测成本多少？

问题

答案1

彩虹表是使用GPU还是CPU运行的？

如何在当前方法调用出现异常时停止执行后续的方法调用。

为什么这段并行化的代码花费的时间与非并行化的代码相同？

在Java中，“if”和“if else”的不同表现

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论