最快的方法让一个核心信号另一个核心?

huangapple go评论61阅读模式
英文:

Fastest way for one core to signal another?

问题

在Intel CPU上,我希望CPU核心A在完成事件后通知CPU核心B。有几种方法可以实现这一点:

  1. A向B发送中断信号。
  2. A写入缓存行(例如,翻转位),然后B轮询缓存行。

我希望B能够以最小的开销了解事件。请注意,我指的是“开销”,而不是端到端的延迟。如果B需要一些时间来了解事件(例如,定期轮询是可以接受的),但B在检测事件时应浪费尽可能少的周期。

上述选项1由于中断处理程序的开销太大。选项2更好,但我仍然不满意B必须等待缓存行从A的L1缓存传输到自己的L1缓存所需的时间。

是否有一种方法可以使A直接将缓存行推送到B的L1缓存中?如果在这种情况下需要额外的开销,也可以接受。我不确定是否有一些技巧可以尝试,例如A将页面标记为不可缓存,B将页面标记为写回...

或者,Intel处理器中是否有一些其他机制可以帮助解决这个问题?

我假设在AMD CPU上,这可能不是一个大问题,因为它们使用MOESI一致性协议,因此“O”应该允许A广播缓存行的更改给B。

英文:

On an Intel CPU, I want CPU core A to signal CPU core B when A has completed an event. There are a couple ways to do this:

  1. A sends an interrupt to B.
  2. A writes a cache line (e.g., a bit flip) and B polls the cache line.

I want B to learn about the event with the least amount of overhead possible. Note that I am referring to overhead, not end-to-end latency. It's alright if B takes a while to learn about the event (e.g., periodic polling is fine), but B should waste as few cycles as possible detecting the event.

Option 1 above has too much overhead due to the interrupt handler. Option 2 is better, but I am still unhappy with the amount of time that B must wait for the cache line to transfer from A's L1 cache to its own L1 cache.

Is there some way A can directly push the cache line into B's L1 cache? It's fine if there is additional overhead for A in this case. I'm not sure if there some trick I can try where A marks the page as uncacheable and B marks the page as write-back...

Alternatively, is there some other mechanism built into Intel processors that can help with this?

I assume this is less of an issue on AMD CPUs as they use the MOESI coherence protocol, so the "O" should presumably allow A to broadcast the cache line changes to B.

答案1

得分: 0

这个问题在x86上要解决起来相当困难,除非使用一些非常新的ISA扩展,比如cldemote(Tremont或Alder Lake / Sapphire Rapids),或者Sapphire Rapids中的用户空间IPI(跨处理器中断),可能还包括Alder Lake。(有关UIPI的详细信息,请参见为什么x86没有实现直接的核间消息传递汇编/ CPU指令?。)

如果没有这些特性,选择偶尔轮询(或者如果另一个核没有任务可做,则使用monitor/mwait)与中断之间的选择取决于您希望在发送通知之前轮询多少次。(以及由于其他线程没有及时注意到标志更新而导致的任何连锁效应可能导致的潜在吞吐量损失,例如,如果这意味着更大的缓冲区导致更多的高速缓存缺失。)

在用户空间中,除了共享内存或UIPI之外,其他选择包括操作系统提供的进程间通信,如信号、管道写入或eventfd;根据Linux UIPI的基准测试,它将其与各种用于延迟和吞吐量的机制进行了比较。


AMD处理器不会广播存储操作;这会导致互连通信大量增加,从而破坏了私有L1d高速缓存对于那些重复写入的缓存行的优势(即使它在从其他核心访问期间避免了它,也会对那些最近没有共享的缓存行进行了优化)。

英文:

There's disappointingly little you can do about this on x86 without some very recent ISA extensions, like cldemote (Tremont or Alder Lake / Sapphire Rapids) or user-space IPI (inter-processor interrupts) in Sapphire Rapids, and maybe also Alder Lake. (See Why didn't x86 implement direct core-to-core messaging assembly/cpu instructions? for details on UIPI.)

Without any of those features, the choice between occasional polling (or monitor/mwait if the other core has nothing to do) vs. interrupt depends on how many times you expect to poll before you want to send a notification. (And how much potential throughput you'll lose due to any knock-on effects from the other thread not noticing the flag update soon, e.g. if that means larger buffers leading to more cache misses.)

In user-space, other than shared memory or UIPI, the alternatives are OS-delivered inter-process-communications like a signal or a pipe write or eventfd; the Linux UIPI benchmarks compared it to various mechanisms for latency and throughput IIRC.


AMD CPUs don't broadcast stores; that would swamp the interconnect with traffic and defeat the benefit of private L1d cache for lines that get repeatedly written (between accesses from other cores, even if it avoided it for lines that weren't recently shared.)

huangapple
  • 本文由 发表于 2023年2月18日 12:23:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/75491180.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定