如何使间接(函数指针)调用正确地进行跳转/分支预测?

huangapple go评论81阅读模式
英文:

How can I cause indirect (function pointer) call to be correctly jump/branch predicted?

问题

在给定的代码示例中,无法完全依赖CPU正确地预测对bar的间接调用,特别是当bar的值在foo的调用点高度动态且难以预测时。尽管有注释中提到的“大量计算”,但CPU仍然可能无法始终进行准确的预测,因为这涉及到复杂的执行流水线和分支预测机制。

作为程序员,有一些策略可以帮助CPU提高预测准确性,例如:

  1. 使用函数指针表: 尝试使用函数指针表来存储可能的回调函数。这将使CPU更容易进行预测,因为它可以更有效地缓存函数指针表的内容。

  2. 编写简单的回调函数: 如果可能的话,编写简单的回调函数,这样CPU更容易预测它们的行为。

  3. 避免复杂的依赖关系: 确保bar的调用不依赖于过多的复杂状态或条件。这有助于减少CPU的预测错误。

  4. 分析和优化性能: 使用性能分析工具来检查代码,找到性能瓶颈并尝试优化它们,以减少对CPU预测的依赖。

尽管这些策略可以提高CPU的预测准确性,但在高度动态和不可预测的情况下,仍然不能保证100%的准确性。因此,在编写代码时需要权衡性能和可维护性,并根据具体情况选择合适的方法。

英文:

let's say I have a function that accepts a callback argument (example given in rust and C)

void foo(void (*bar)(int)) {
    // lots of computation
    bar(3);
}
fn foo(bar: fn(u32)) {
    // lots of computation
    bar(3)
}

can I rely on the indirect call to bar being correctly predicted by the CPU? Assume that at the callsite of foo, the value of bar is in fact highly dynamic and unpredictable - but because of // lots of computation, my expectation is that the CPU has enough "advance warning" such that speculative/out-of-order execution can work across the function pointer boundary.

Is this the case? If not, is there anything I can do as the programmer to help the CPU out?

答案1

得分: 4

你无法保证正确的预测。

你所能做的最好的事情就是已经在做的事情:提前准备好分支条件(或目标),以便乱序执行能够早期检测到错误预测并在浪费大量工作之前进行恢复。(因此,希望任何错误的指令尚未执行,前端可以及时重新导向以避免失去周期。)

请参阅 https://stackoverflow.com/q/49932119

错误预测将不可避免地会导致前端的周期成本,但是很多代码的执行速度慢于每周期4条指令(或每周期4个微操作),因此前端能够超越最老的微操作仍在执行的地方,特别是如果存在瓶颈,例如长依赖链而没有很多指令级并行性。

现代分支预测器即使在复杂模式下也非常出色,例如 IT-TAGE 使用过去几个分支的历史来索引此分支的预测器。这导致即使在解释器循环中的 switch 中,也能获得良好的性能,不像在旧的CPU中,单个间接分支的复杂模式是一个大问题。(有关一些链接,请参阅 https://stackoverflow.com/questions/58399395/how-does-branch-prediction-affect-performance-in-r,特别是 Rohou、Swamy 和 Seznec 撰写的Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore (2015)。)

英文:

You can't guarantee a correct prediction.

The best you can do is what you're already doing: have the branch condition (or target) ready early, to let out-of-order exec detect a misprediction early and recover before it costs much wasted work. (So any mis-speculated instructions hopefully haven't been executed yet, and the front-end can re-steer in time to avoid losing cycles.)

See https://stackoverflow.com/q/49932119

A mispredict will unavoidably cost cycles for the front-end, but a lot of code executes slower than 4 instructions per cycle (or 4 uops per cycle), so the front-end is able to get ahead of where the oldest uops are still executing, especially if there are bottlenecks like a long dependency chain without a lot of instruction-level parallelism.


Modern branch predictors are quite good even with complex patterns, e.g. IT-TAGE uses history of the past few branches to index a predictor for this one. This leads to decent performance even in the switch in an interpreter loop or something like that, unlike in older CPUs where a complex pattern for a single indirect branch was a big problem. (See https://stackoverflow.com/questions/58399395/how-does-branch-prediction-affect-performance-in-r for some links, especially Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore (2015) by Rohou, Swamy, and Seznec.

huangapple
  • 本文由 发表于 2023年8月5日 01:23:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76838024.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定