英文:
Given the choice, should I store my static buffer inline or on the heap?
问题
我已经编写了一个具有静态容量的循环缓冲区,因此它由一个(固定大小的)数组支持。它由堆上分配的单例类拥有。我的问题是循环缓冲区中的后备数组是否应该内联还是分配在堆上。区别在于:
CircularBuffer
{
std::array<uint64_t, N> buffer;
}
CircularBuffer
{
uint64_t* buffer;
}
要明确的是,分配将在启动时发生一次,并且一切都将在堆上,因为封装对象将在堆上分配。此外,由于数据结构的性质和使用方式,缓冲区的每个部分将被访问相同次数。
从理论上讲,如果你的目标是性能,你会考虑什么因素来做出这个决定?
当然,我将根据基准测试做出最终决定。但作为第一步,我正在考虑的参数包括:
- N:缓冲区的大小
- R:封装对象的调用中有多大比例导致对缓冲区的读/写操作
在实践中,N = 1MB,R = 100%(所有调用都导致对缓冲区的读写操作),我在高端CPU上以单线程方式运行。
你是否考虑其他参数?
英文:
I have written a circular buffer with static capacity, therefore it is backed by a (constant size) array. It is owned by a singleton class allocated on the heap. My question is whether or not the backing array in the circular buffer should be inline, or allocated on the heap. The difference would be:
CircularBuffer
{
std::array<uint64_t, N> buffer;
}
CircularBuffer
{
uint64_t* buffer;
}
To be clear, allocation will happen once at startup, and everything will be on the heap, as the enclosing object will be heap allocated. Also, each portion of the buffer will be accessed an equal amount of times due to the nature of the data structure and its usage.
Theoretically, what would you consider in making this decision, if your goal is performance?
Of course I will make my final decision based on benchmarks. But as a first pass, the parameters I am considering are:
- N: Size of buffer
- R: What percentage of calls to the enclosing object result in reads/writes to the buffer
In practice, N = 1MB, R = 100% (all calls result in both a read and write to the buffer), and I am running single threaded on a high end cpu.
Are there other parameters you would consider?
答案1
得分: 1
出于纯性能原因,没有真正的理论依据来确定哪个会更快,所以你应该在你的目标架构上运行基准测试,正如你所说的。最终,如果你的CPU具有大量缓存,你的缓冲区应该位于同一RAM中,甚至可能在同一高速缓存中。
(参见这个问题:在堆中访问数据是否比在栈中更快?。)
如果你想要安全性,使用原始指针(或更安全的是不调整大小的std::vector)可能会导致在运行时进行更多边界检查,因此性能较差。
我不知道你的应用是什么,但对我来说,这看起来像是过早的优化(也许不是,但请考虑它可能是),因为我不认为这会有太大的性能差异。通常情况下,当性能差异不是很重要时,更干净的代码会更明智。
至于其他考虑因素,使用1MB的栈可能没问题,但如果你考虑在其他地方使用更多的栈空间,那可能会成为问题,因为栈的空间比堆有限得多,通常只有几MB的栈空间。
英文:
For pure performance reasons there's no real theoretical rationale about which one will run faster so you should run benchmarks on the specific architecture you target, as you said. In the end your buffer should be paged in the same RAM, or in the same cache even, if your CPU has a lot of cache.
(See this question : Is accessing data in the heap faster than from the stack?.)
If you want safety, a raw pointer (or safer, a std::vector which you don't resize) could lead to more checks for boundaries at runtime, so less performance.
I don't know what is your application, but to me it looks like premature optimization (maybe it's not, but consider it could be), as I don't expect a lot of difference coming out of this. And usually cleaner code would be wiser to write when performance differences don't matter that much.
As for as other considerations go, eating up 1MB of the stack can be fine, but if you consider using more of the stack elsewhere it can be a problem as the stack is way more limited than the heap for large allocations (typically a few MB for the stack).
答案2
得分: 1
将其作为单例的一部分意味着访问它时不会有额外的间接性,与如果单例中有一个指向单独堆分配的指针成员相比。
由于您拥有堆分配的单例(而不是static
或全局变量),已经存在一层间接性,以通过存储在某处的指针来访问类对象本身。
您可以考虑将数组设为static
成员,这样您以后如果想在同一应用程序中拥有多个循环缓冲区而不是单例,就可以更改它。在非PIE Linux可执行文件的x86-64上,可以使用[disp32 + reg]
来索引静态存储中的数组,这可能比使用2寄存器寻址模式([reg + offsetof(Circularbuffer, arr) + reg*8]
)更便宜,至少在Intel CPU上,其中索引寻址模式可以取消层压微融合的微操作,尤其是对于AVX指令。但是,如果构建普通的PIE可执行文件或共享库,则不适用此规则。在其他ISA上,将静态数组的基地址放入寄存器需要额外的指令,而不仅仅是使用您已经需要放入寄存器的类对象的指针。
将数组放入对象中,如果您倾向于在访问数组元素的同时访问单例的任何成员,尤其是如果数组很小,可以提供局部性。 (如果不是缓存行,至少是dTLB局部性,至少对于靠近数组开头的元素而言。)将数组成员放在最后,以便其他成员变量彼此靠近。
这还可以减少分配的代码(编译器生成的汇编代码);只有一个分配。这个好处也适用于使用static
存储;它已经在BSS中预留。
英文:
Making it part of the singleton means no extra indirection to access it, vs. if there was a pointer member in the singleton pointing to a separate heap allocation.
Since you have a heap-allocated singleton (instead of a static
or global variable), there's already a level of indirection to reach the class object itself with a pointer to it stored somewhere.
You could consider making the array a static
member, which you could change if you ever want to have multiple circular buffers in the same application instead of a singleton. On x86-64 in a non-PIE Linux executable, indexing an array in static storage can be done with [disp32 + reg]
which can be cheaper than a 2-register addressing mode ([reg + offsetof(Circularbuffer, arr) + reg*8]
), at least on Intel CPUs where indexed addressing modes can un-laminate micro-fused uops, especially for AVX instructions. But that doesn't apply if you build a normal PIE executable or shared library. On other ISAs, generating the base address for a static array in a register takes extra instructions vs. just using the pointer to the class object that you're going to already need in a register.
Putting the array in the object gives locality if you tend to access any members of the singleton at similar time to accessing any array elements, especially if the array is small. (dTLB locality if not cache line, at least for elements near the start of the array.) Put the array member last so other member vars are close to each other.
It's also less code (compiler-generated assembly) to get it allocated; just one allocation. This benefit also applies to using static
storage; it's already reserved in the BSS.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论