英文:
Why struct with padding fields works faster
问题
我刚刚发现了这个库,它提供了一个无锁环形队列,比通道的速度要快得多:https://github.com/textnode/gringo(尤其是在 GOMAXPROCS > 1 的情况下)。
但有趣的部分是用于管理队列状态的结构体:
type Gringo struct {
padding1 [8]uint64
lastCommittedIndex uint64
padding2 [8]uint64
nextFreeIndex uint64
padding3 [8]uint64
readerIndex uint64
padding4 [8]uint64
contents [queueSize]Payload
padding5 [8]uint64
}
如果我移除 "paddingX [8]uint64" 字段,它的速度会慢大约 20%。这是为什么呢?
同时,如果有人能解释一下为什么这个无锁算法比通道快得多,即使是带缓冲的通道,我会很感激。
英文:
I just found this library, that provides lock-free ring, that works way faster then channels: https://github.com/textnode/gringo (and it works really faster especially with GOMAXPROCS > 1 )
But interesting part is struct for managing queue state:
type Gringo struct {
padding1 [8]uint64
lastCommittedIndex uint64
padding2 [8]uint64
nextFreeIndex uint64
padding3 [8]uint64
readerIndex uint64
padding4 [8]uint64
contents [queueSize]Payload
padding5 [8]uint64
}
If i remove "paddingX [8]uint64" fields it works about 20% slower. How it can be?
Also appreciate if someone explained why this lock-free algorithm much faster then channels, even buffered?
答案1
得分: 13
填充通过将每个结构放置在自己的缓存行上来消除伪共享。如果两个变量共享一个缓存行,那么如果在另一个变量上进行了介入性写操作,对未修改变量的读取将与对修改变量的读取一样昂贵。
当一个变量在多个核心上被读取但未修改时,缓存行被核心共享。这使得读取操作非常廉价。在任何核心可以写入该缓存行的任何部分之前,它必须使其他核心上的缓存行失效。如果任何核心稍后从该缓存行读取,它将发现缓存行已失效,并且必须重新共享它。当一个变量频繁修改而另一个变量频繁读取时,这会导致痛苦的额外缓存一致性流量。
英文:
Padding eliminates false sharing by putting each structure on its own cache line. If two variables share a cache line, a read of an unmodified variable will be as expensive as a read of a modified variable if there's an intervening write to the other variable.
When a variable is read on multiple cores and not modified, the cache line is shared by the cores. This makes the reads very cheap. Before any core can write to any part of that cache line, it must invalidate the cache line on other cores. If any core later reads from that cache line, it will find the cache line invalidated and have to go back to sharing it. This makes painful extra cache coherency traffic when one variable is frequently modified and the other is frequently read.
答案2
得分: 4
它的工作速度更快,因为它不需要锁。这个是一个在Java中的实现(称为Disruptor),它工作得非常好,似乎是gringo的灵感来源。他们在这里解释了锁的成本以及如何提高吞吐量。
至于填充,该论文也暗示了一些原因。基本上是:处理器缓存。这篇论文对此进行了很好的解释。通过尽可能让处理器访问其一级缓存而不是经常访问内存或外部缓存,您可以获得巨大的性能提升。但这需要采取额外的预防措施,因为处理器将完全加载其缓存,并在每次需要时重新加载它(从内存或二级-三级缓存)。
在并发数据结构的情况下,正如@David Schwartz所说,伪共享将迫使处理器更频繁地重新加载其缓存,因为某些数据可能加载在内存行的其余部分,被修改,并迫使整个缓存重新加载。
英文:
It works faster because it does not require locks. This is an implementation in Java (called Disruptor) which works really well, and seems to be the inspiration for gringo. They explain the cost of locks and how you can increase throughput here.
As for the padding, the paper also hints at some of the reasons. Basically: processor caches. This paper explains it well. You can gain tremendous performance gain by making the processor access its Level 1 cache instead of going through memory or its outer caches as often as possible. But this requires to take extra precautions as the processor will fully load its cache, and reload it (from memory or level 2-3 caches) every time it is required.
In the case of concurrent data structure, as @David Schwartz said, the false sharing will force the processor to reload its cache much more often, as some data might be loaded in the rest of the memory line, be modified, and force the whole cache to be loaded again.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论