英文:
When shoud we use `CacheLinePad` to avoid false sharing?
问题
众所周知,使用填充(pad)来使结构体占用一个或多个缓存行对性能有好处。
但是在什么场景下,我们应该像下面这样添加填充来提高性能呢?
这里有一些经验法则吗?
import "golang.org/x/sys/cpu"
var S struct {
_ cpu.CacheLinePad
A string
_ cpu.CacheLinePad
}
英文:
It's well-known that using pad to make a struct exclusive one or more cache line is good for performance.
But for what scene, we should add a pad like the following to improve performance?
Are there some rules of thumb here?
import "golang.org/x/sys/cpu"
var S struct {
_ cpu.CacheLinePad
A string
_ cpu.CacheLinePad
}
答案1
得分: 4
我从来不太喜欢"false sharing"这个术语。我认为更好的称呼应该是"不适当的共享"或者"过度共享"。
但是对于哪种情况,我们应该像下面这样添加填充来提高性能呢?这里有一些经验法则吗?
规则是:首先进行测量(基准测试)。然后,如果某个地方花费了很多时间,找出其中的原因。
"False sharing"会导致性能问题,如果你使用的底层软件和硬件坚持以缓慢的方式移动数据,即使有更快的方式可用。通过扭曲你自己的代码,你可以说服软件和/或硬件使用更快的方式。
这样做通常会使你自己的代码变得不太可读,或者占用更多的空间,或者有其他类似的缺点。确保这种缺点的成本超过了增加的速度的价值。如果你的软件运行速度相同或更慢,那么为了速度而破坏你的代码的成本是不值得的,所以不要这样做。1
通常的"false sharing"情况——这也是我不喜欢这个术语的原因——发生在某些数据结构中的某些数据可以很好地共享,被多个CPU在多个缓存中使用,除了某个特定的数据项写入(存储操作)发生在这样一个情况下,其中一个CPU使得所有其他CPU的缓存无效,因此所有其他CPU必须返回到主内存或重新复制来自写入CPU的数据。你描述的"插入填充"技巧在写入CPU不再影响其他CPU对相邻数据项的使用时是有帮助的,因为这些数据项虽然在逻辑上是相邻的(例如,在数组或切片的连续元素中),但不再占用一个被写入使其无效的缓存行。
例如,假设我们有一个数据结构,其中有三个(或者七个)每个CPU都会读取的八字节字段,以及一个最后的八字节字段,只有其中一个CPU(但只有一个)可能会更新。进一步假设这台机器的缓存行大小为32(或者64)字节,并且CPU本身使用类似于MESI或MOESI缓存模型的东西。在这种情况下,写入一个八字节字段的那个CPU会立即使所有其他CPU缓存中存在的共享副本无效。
然而,如果那个将被一个CPU写入的特定八字节字段在它自己的缓存行中,或者至少不在共享的缓存行中,例如在一个单独的数组中,那么写入CPU就不会使任何共享副本无效;这些副本在所有CPU中保持在S(共享)状态。
如果编译器可以移动一些数据结构的只读和读/写字段,使得那些从时间上受益于共享的可共享部分保持可共享,那么你就不需要调整自己的代码。像C和C++一样,Go对编译器有一些限制,可能会阻止它们在这里进行自己的优化,这意味着你可能需要自己来做。
但是一定要先进行测量!
1这类似于在股市赚钱的规则:如果股票会涨,就买入。如果它没有涨,就不要买入。但至少计算机版本实际上是可以实现的,因为你可以运行你的原始版本和你付出代价的扭曲版本,看看你付出的代价是否值得获得的收益。
英文:
I've never really liked the term "false sharing". I think it would be better called "inappropriate sharing" or "oversharing". 😀
> But for what scene, we should add a pad like the following to improve performance? Are there some rules of thumb here?
The rule is: measure first (benchmark). Then, if a lot of time is being spent somewhere, figure out why.
"False sharing" causes performance problems if and when the underlying software and hardware you're using insists on moving data around in slow ways, even though there are faster ways available. By contorting your own code, you can convince the software and/or hardware to use the faster ways.
Doing so often makes your own code less readable, or take more space, or has some other similar drawback. Be sure that the cost of this drawback is exceeded by the value of the increased speed. If your software runs at the same speed or slower, the cost of damaging your code for speed was not paid-for, so don't have done it.<sup>1</sup>
The usual case of "false sharing"—which is why I dislike the term—occurs when some data in some data structure could be shared well, used by multiple CPUs in multiple caches, except that some particular data-item write (store operation) happens such that one CPU invalidates all the other CPUs' caches, so that all the other CPUs must go back to main memory or re-copy the data from the writing CPU. The "insert padding" trick you describe helps if and when the writing-CPU no longer affects the other CPU's use of adjacent data items because those items, although adjacent in logical terms (e.g., in successive elements of an array or slice), no longer occupy a single cache line that becomes invalidated by the write.
Suppose, for instance, that we have a data structure in which there are three (or perhaps seven) eight-byte fields that every CPU in a many-CPU machine will read, and a final eight-byte field that one of those CPUs (but only one) might update. Suppose further that the cache line size on this machine is 32 (or perhaps 64) bytes, and that the CPUs themselves use something like the MESI or MOESI cache model. In this case, the one CPU that writes to the one eight-byte field immediately invalidates any shared copies that exist in all the other CPUs' caches.
If, however, that particular eight-byte field, that will be written by one CPU, is in its own cache line, or at least not in the shared cache line—e.g., is in a separate array—then the writing CPU does not invalidate any shared copies; these stay in the S (shared) state in all the CPUs.
If a compiler can move the read-only and read/write fields of some data structure(s) around, so that the shareable parts that will benefit, time-wise, from being shared, stay shareable, you will not need to tweak your own code. Go, like C and C++, puts some constraints on compilers that may prevent them from doing their own optimizations here, which means you might have to do it yourself.
But always measure first!
<sup>1</sup>This is similar to the rule for making money in the stock market: buy a stock if it's going to go up. If it did not go up, don't have bought it. But at least the computer version is actually achievable, since you can run both your original version, and your price-paid contorted version, and see if the price you paid was worth the gain.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论