How do segmented stacks work

huangapple go评论124阅读模式
英文:

How do segmented stacks work

问题

分段栈是如何工作的?这个问题也适用于Boost.Coroutine,所以我在这里也使用了C++标签。主要的疑问来自于这篇文章。看起来它们所做的是在栈的底部保留一些空间,并通过在那里分配的内存上注册某种信号处理程序来检查是否已经被破坏(可能是通过mmapmprotect实现的)。然后,当它们检测到空间不足时,它们通过分配更多的内存并从那里继续执行来继续执行。

关于此事有3个问题:

  1. 这个构造不是用户空间的东西吗?他们如何控制新栈的分配位置,程序编译成的指令如何知道这一点?

    push指令基本上只是将一个值添加到堆栈指针上,并将该值存储在堆栈上的寄存器中,那么push指令如何知道新栈的起始位置,相应地,pop指令如何知道何时将堆栈指针移回旧栈?

  2. 他们还说

    在我们获得一个新的栈段之后,我们通过重新尝试导致我们耗尽栈的函数来重新启动goroutine

    这是什么意思?他们重新启动整个goroutine吗?这不会导致非确定性行为吗?

  3. 他们如何检测程序是否超出了栈的范围?如果他们在底部保留了一个类似canary的内存区域,那么当用户程序创建一个足够大的数组以溢出它时会发生什么?这不会导致栈溢出和潜在的安全漏洞吗?

如果Go和Boost的实现方式不同,我很乐意了解它们如何处理这种情况。1: https://blog.cloudflare.com/how-stacks-are-handled-in-go/

英文:

How do segmented stacks work? This question also applies to Boost.Coroutine so I am using the C++ tag as well here. The main doubt comes from this article It looks like what they do is keep some space at the bottom of the stack and check if it has gotten corrupted by registering some sort of signal handler with the memory allocated there (perhaps via mmap and mprotect?) And then when they detect that they have run out of space they continue by allocating more memory and then continuing from there. 3 questions about this

  1. Isn't this construct a user space thing? How do they control where the new stack is allocated and how do the instructions the program is compiled down to get aware of that?

A push instruction is basically just adding a value to the stack pointer and then storing the value in a register on the stack, then how can the push instruction be aware of where the new stack starts and correspondingly how can the pop know when it has to move the stack pointer back to the old stack?

  1. They also say
    > After we've got a new stack segment, we restart the goroutine by retrying the function that caused us to run out of stack

what does this mean? Do they restart the entire goroutine? Won't this possibly cause non deterministic behavior?

  1. How do they detect that the program has overrun the stack? If they keep a canary-ish memory area at the bottom then what happens when the user program creates an array big enough that overflows that? Will that not cause a stack overflow and is a potential security vulnerability?

If the implementations are different for Go and Boost I would be happy to know how either of them deal with this situation 🙂

答案1

得分: 11

我将为您提供一种可能的实现的简要概述。

首先,假设大多数堆栈帧的大小都小于某个值。对于较大的堆栈帧,我们可以在入口处使用更长的指令序列,以确保有足够的堆栈空间。假设我们使用的架构具有4k页面,并且我们选择4k-1作为快速路径处理的最大堆栈帧大小。

堆栈在底部分配了一个单独的守护页。也就是说,一个未映射为可写的页面。在函数入口处,堆栈指针会减小堆栈帧大小,该大小小于一个页面的大小,然后程序会安排在新分配的堆栈帧的最低地址处写入一个值。如果到达堆栈的末尾,这个写操作将导致处理器异常,并最终转化为操作系统对用户程序的某种上调用,例如在UNIX系列操作系统中的信号。

信号处理程序(或等效程序)必须能够根据故障指令的地址和写入地址确定这是一个堆栈扩展故障。这可以通过检查指令是否位于函数的前导部分以及被写入的地址是否位于当前线程的堆栈的守护页中来确定。可以通过要求函数开头的一系列非常特定的指令模式或可能通过维护有关函数的元数据(可能使用回溯表)来识别指令是否位于前导部分。

此时,处理程序可以分配一个新的堆栈块,将堆栈指针设置为块的顶部,执行某些操作以处理解链堆栈块,然后再次调用发生故障的函数。这第二次调用是安全的,因为故障发生在编译器生成的函数前导部分,且在验证是否有足够的堆栈空间之前不允许产生任何副作用。(对于自动将返回地址推送到堆栈上的架构,代码还可能需要修复返回地址。如果返回地址在寄存器中,只需在进行第二次调用时保持在相同的寄存器中。)

处理解链的最简单方法可能是在新的扩展块上推送一个小的堆栈帧,该堆栈帧在返回时解链新的堆栈块并释放分配的内存。然后,它将处理器寄存器返回到调用导致需要扩展堆栈的状态。

这种设计的优点是函数入口序列非常简短且在非扩展情况下非常快速。缺点是在需要扩展堆栈的情况下,处理器会产生异常,这可能比函数调用的代价要高得多。

如果我理解正确,Go实际上并不使用守护页。相反,函数前导显式检查堆栈限制,如果新的堆栈帧不适合,则调用一个函数来扩展堆栈。

Go 1.3更改了其设计,不再使用堆栈块的链表。这是为了避免在某种调用模式中多次交叉越过扩展边界时的陷阱成本。它们从一个小堆栈开始,并使用类似的机制来检测扩展的需要。但是,当发生堆栈扩展故障时,整个堆栈将移动到一个更大的块中。这完全消除了解链的需要。

这里省略了许多细节。(例如,可能无法在信号处理程序本身中进行堆栈扩展。相反,处理程序可以安排将线程挂起并将其交给管理线程。处理信号可能还需要使用专用信号堆栈。)

这种情况下的另一种常见模式是运行时要求当前堆栈帧下方有一定数量的有效堆栈空间,用于信号处理程序或调用运行时中的特殊例程。Go就是这样工作的,堆栈限制测试保证了当前帧下方有一定数量的固定堆栈空间可用。可以在堆栈上调用普通的C函数,只要确保它们不会消耗超过固定堆栈保留量的空间。(理论上可以使用这种方法调用C库例程,尽管大多数这些例程没有正式规定它们可能使用多少堆栈。)

在堆栈帧中进行动态分配,例如alloca或堆栈分配的可变长度数组,会给实现增加一些复杂性。如果程序可以在前导部分计算出整个帧的最终大小,那么实现就相对简单。在程序运行时,帧大小的增加可能需要建模为新的调用,尽管使用Go的新架构允许在那里进行堆栈移动,因此可以将程序中的alloca点设置为所有状态都允许堆栈移动。

英文:

I'll give you a quick sketch of one possible implementation.

First, assume most stack frames are smaller than some size. For ones that are larger, we can use a longer instruction sequence at entry to make sure there is enough stack space. Let's assume we're on an architecture that that has 4k pages and we're choosing 4k - 1 as the maximum size stack frame handled by the fast path.

The stack is allocated with a single guard page at the bottom. That is, a page that is not mapped for write. At function entry, the stack pointer is decremented by the stack frame size, which is less than the size of a page, and then the program arranges to write a value at the lowest address in the newly allocated stack frame. If the end of the stack has been reached, this write will cause a processor exception and ultimately be turned into some sort of upcall from the OS to the user program -- e.g. a signal in UNIX family OSes.

The signal handler (or equivalent) has to be able to determine this is a stack extension fault from the address of the instruction that faulted and the address it was writing to. This is determinable as the instruction is in the prolog of a function and the address being written to is in the guard page of the stack for the current thread. The instruction being in the prolog can be recognized by requiring a very specific pattern of instructions at the start of functions, or possibly by maintaining metadata about functions. (Possibly using traceback tables.)

At this point the handler can allocate a new stack block, set the stack pointer to the top of the block, do something to handle unchaining the stack block, and then call the function that faulted again. This second call is safe because the fault is in the function prolog the compiler generated and no side effects are allowed before validating there is enough stack space. (The code may also need to fixup the return address for architectures that push it onto the stack automatically. If the return address is in a register, it just needs to be in the same register when the second call is made.)

Likely the easiest way to handle unchaining is to push a small stack frame onto the new extension block for a routine that when returned to unchains the new stack block and frees the allocated memory. It then returns the processor registers to the state they were in when the call was made that caused the stack to need to be extended.

The advantage of this design is that the function entry sequence is very few instructions and is very fast in the non-extending case. The disadvantage is that in the case where the stack does need to be extended, the processor incurs an exception, which may cost much much more than a function call.

Go doesn't actually use a guard page if I understand correctly. Rather the function prolog explicitly checks the stack limit and if the new stack frame won't fit it calls a function to extend the stack.

Go 1.3 changed its design to not use a linked list of stack blocks. This is to avoid the trap cost if the extension boundary is crossed in both directions many times in a certain calling pattern. They start with a small stack, and use a similar mechanism to detect the need for extension. But when a stack extension fault does occur, the entire stack is moved to a larger block. This removes the need for unchaining entirely.

There are quite a few details glossed over here. (E.g. one may not be able to do the stack extension in the signal handler itself. Rather the handler can arrange to have the thread suspended and hand it off to a manager thread. One likely has to use a dedicated signal stack to handle the signal as well.)

Another common pattern with this sort of thing is the runtime requiring there to be a certain amount of valid stack space below the current stack frame for either something like a signal handler or for calling special routines in the runtime. Go works this way and the stack limit test guarantees a certain fixed amount of stack space is available below the current frame. One can e.g. call plain C functions on the stack so long as one guarantees they do not consume more than the fixed stack reserve amount. (One can use this to call C library routines in theory, though most of these have no formal specification of how much stack they might use.)

Dynamic allocation in the stack frame, such as alloca or stack allocated variable length arrays, add some complexity to the implementation. If the routine can compute the entire final size of the frame in the prolog then it is fairly straightforward. Any increase in the frame size while the routine is running likely has to be modeled as a new call, though with Go's new architecture that allows moving the stack, it is possible the alloca point in the routine can be made such that all the state allows a stack move to happen there.

huangapple
  • 本文由 发表于 2017年9月16日 11:39:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/46249794.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定