英文:
What are some obvious optimizations for a virtual machine implementing a functional language?
问题
我正在开发一种中间语言和一个虚拟机,用于运行一个具有一些“有问题”属性的函数式语言:
- 词法命名空间(闭包)
- 动态增长的调用栈
- 一个慢速整数类型(大数)
中间语言是基于堆栈的,当前命名空间使用一个简单的哈希表。这里是一个示例,让你了解它的样子,这是McCarthy91函数的代码:
# McCarthy 91: M(n) = n - 10 if n > 100 else M(M(n + 11))
.sub M
args
sto n
rcl n
float 100
gt
.if
.sub
rcl n
float 10
sub
.end
.sub
rcl n
float 11
add
list 1
rcl M
call-fast
list 1
rcl M
tail
.end
call-fast
.end
“大循环”很简单:
- 获取指令
- 增加指令指针(或程序计数器)
- 执行指令
除了sto
,rcl
和其他很多指令之外,还有三个用于函数调用的指令:
call
复制命名空间(深拷贝)并将指令指针推入调用栈call-fast
也是一样,但只创建一个浅拷贝tail
基本上是一个“跳转”
实现非常简单。为了让你更好地了解,这里只是“大循环”中间的一个随机片段(已更新,请参见下文):
} else if inst == 2 /* STO */ {
local[data] = stack[len(stack) - 1]
if code[ip + 1][0] != 3 {
stack = stack[:len(stack) - 1]
} else {
ip++
}
} else if inst == 3 /* RCL */ {
stack = append(stack, local[data])
} else if inst == 12 /* .END */ {
outer = outer[:len(outer) - 1]
ip = calls[len(calls) - 1]
calls = calls[:len(calls) - 1]
} else if inst == 20 /* CALL */ {
calls = append(calls, ip)
cp := make(Local, len(local))
copy(cp, local)
outer = append(outer, &cp)
x := stack[len(stack) - 1]
stack = stack[:len(stack) - 1]
ip = x.(int)
} else if inst == 21 /* TAIL */ {
x := stack[len(stack) - 1]
stack = stack[:len(stack) - 1]
ip = x.(int)
}
问题是:使用值为-10000的参数调用McCarthy91函数16次,几乎需要3秒钟(在优化掉深拷贝后,增加了近1秒的时间)。
我的问题是:有哪些常见的技术可以优化这种语言的解释执行?有什么低成本的优化方法吗?
我在列表(参数、各种堆栈、命名空间的映射列表等)中使用了切片,所以我在代码中到处都是这样的写法:call_stack[:len(call_stack) - 1]
。现在,我真的不知道哪些代码片段使得程序变慢。任何提示都将不胜感激,尽管我主要是在寻找一般的优化策略。
此外:
我可以通过绕过我的调用约定来大大减少执行时间。list <n>
指令从堆栈中获取n个参数,并将它们的列表推回到堆栈中,args
指令弹出该列表,并将每个项推回到堆栈中。首先,这是为了检查函数是否以正确数量的参数调用,其次是为了能够调用具有可变参数列表的函数(例如(defun f x:xs)
)。即使我不得不使用这个list
/args
机制,通过删除它,并添加一个sto* <x>
指令来替换sto <x>; rcl <x>
,我可以将执行时间缩短到2秒。虽然还不是很好,但我必须使用这个list
/args
机制。
另外(这是一个很长的问题,对不起):
使用pprof对程序进行分析并没有告诉我太多信息(如果这不明显的话,我是Go的新手):-)。这是pprof报告的前3个项目:
16 6.1% 6.1% 16 6.1% sweep pkg/runtime/mgc0.c:745
9 3.4% 9.5% 9 3.4% fmt.(*fmt).fmt_qc pkg/fmt/format.go:323
4 1.5% 13.0% 4 1.5% fmt.(*fmt).integer pkg/fmt/format.go:248
我迄今为止所做的更改有:
- 我删除了哈希表。现在,我只传递数组的指针,并且只在必要时才有效地复制局部作用域。
- 我用整数操作码替换了指令名称。之前,我浪费了很多时间在字符串比较上。
call-fast
指令已经消失了(在其他更改后,速度提升已经无法测量)。- 我只有一个
eval
指令,用于在编译时(即字节码编译)计算常量。然后,eval
只是将对它们的引用推入堆栈。 - 在更改
.if
的语义后,我可以摆脱这些伪函数。现在是.if
,.else
和.endif
,具有隐式的goto和类似.sub
的块语义。
在实现词法分析器、解析器和字节码编译器之后,速度稍微下降了一点,但并不是非常严重。计算MC(-10000) 16次使其在1.2秒内评估了420万个字节码指令。这里是它生成的代码的一个示例(来自这里)。
英文:
I'm working on an intermediate language and a virtual machine to run a functional language with a couple of "problematic" properties:
- Lexical namespaces (closures)
- Dynamically growing call stack
- A slow integer type (bignums)
The intermediate language is stack based, with a simple hash-table for the current namespace. Just so you get an idea of what it looks like, here's the McCarthy91 function:
# McCarthy 91: M(n) = n - 10 if n > 100 else M(M(n + 11))
.sub M
args
sto n
rcl n
float 100
gt
.if
.sub
rcl n
float 10
sub
.end
.sub
rcl n
float 11
add
list 1
rcl M
call-fast
list 1
rcl M
tail
.end
call-fast
.end
The "big loop" is straightforward:
- fetch an instruction
- increment the instruction pointer (or program counter)
- evaluate the instruction
Along with sto
, rcl
and a whole lot more, there are three instructions for function calls:
call
copies the namespace (deep copy) and pushes the instruction pointer onto the call stackcall-fast
is the same, but only creates a shallow copytail
is basically a 'goto'
The implementation is really straightforward. To give you a better idea, here's just a random snippet from the middle of the "big loop" (updated, see below)
<!-- language: lang-go -->
} else if inst == 2 /* STO */ {
local[data] = stack[len(stack) - 1]
if code[ip + 1][0] != 3 {
stack = stack[:len(stack) - 1]
} else {
ip++
}
} else if inst == 3 /* RCL */ {
stack = append(stack, local[data])
} else if inst == 12 /* .END */ {
outer = outer[:len(outer) - 1]
ip = calls[len(calls) - 1]
calls = calls[:len(calls) - 1]
} else if inst == 20 /* CALL */ {
calls = append(calls, ip)
cp := make(Local, len(local))
copy(cp, local)
outer = append(outer, &cp)
x := stack[len(stack) - 1]
stack = stack[:len(stack) - 1]
ip = x.(int)
} else if inst == 21 /* TAIL */ {
x := stack[len(stack) - 1]
stack = stack[:len(stack) - 1]
ip = x.(int)
The problem is this: Calling McCarthy91 16 times with a value of -10000 takes, near as makes no difference, 3 seconds (after optimizing away the deep-copy, which adds nearly a second).
My question is: What are some common techniques for optimizing interpretation of this kind of language? Is there any low-hanging fruit?
I used slices for my lists (arguments, the various stacks, slice of maps for the namespaces, ...), so I do this sort of thing all over the place: call_stack[:len(call_stack) - 1]
. Right now, I really don't have a clue what pieces of code make this program slow. Any tips will be appreciated, though I'm primarily looking for general optimization strategies.
Aside:
I can reduce execution time quite a bit by circumventing my calling conventions. The list <n>
instruction fetches n arguments of the stack and pushes a list of them back onto the stack, the args
instruction pops off that list and pushes each item back onto the stack. This is firstly to check that functions are called with the correct number of arguments and secondly to be able to call functions with variable argument-lists (i.e. (defun f x:xs)
). Removing that, and also adding an instruction sto* <x>
, which replaces sto <x>; rcl <x>
, I can get it down to 2 seconds. Still not brilliant, and I have to have this list
/args
business anyway.
Another aside (this is a long question I know, sorry):
Profiling the program with pprof told me very little (I'm new to Go in case that's not obvious) :-). These are the top 3 items as reported by pprof:
16 6.1% 6.1% 16 6.1% sweep pkg/runtime/mgc0.c:745
9 3.4% 9.5% 9 3.4% fmt.(*fmt).fmt_qc pkg/fmt/format.go:323
4 1.5% 13.0% 4 1.5% fmt.(*fmt).integer pkg/fmt/format.go:248
These are the changes I've made so far:
- I removed the hash table. Instead I'm now passing just pointers to arrays, and I only efficiently copy the local scope when I have to.
- I replaced the instruction names with integer opcodes. Before, I've wasted quite a bit of time comparing strings.
- The
call-fast
instruction is gone (the speedup wasn't measurable anymore after the other changes) - Instead of having "int", "float" and "str" instructions, I just have one
eval
and I evaluate the constants at compile time (compilation of the bytecode that is). Theneval
just pushes a reference to them. - After changing the semantics of
.if
, I could get rid of these pseudo-functions. it's now.if
,.else
and.endif
, with implicit gotos ànd block-semantics similar to.sub
. (some example code)
After implementing the lexer, parser, and bytecode compiler, the speed went down a little bit, but not terribly so. Calculating MC(-10000) 16 times makes it evaluate 4.2 million bytecode instructions in 1.2 seconds. Here's a sample of the code it generates (from this).
答案1
得分: 9
有几十年的关于可以优化的事物的研究:
-
实现函数式语言:教程,Simon Peyton Jones和David Lester。由Prentice Hall出版,1992年。
-
编程语言的实践基础,Robert Harper
英文:
There are decades of research on things you can optimize:
-
Implementing functional languages: a tutorial, Simon Peyton Jones and David Lester. Published by Prentice Hall, 1992.
-
Practical Foundations for Programming Languages, Robert Harper
答案2
得分: 5
你应该为解释器的各种概念提供高效的算法表示。在哈希表上进行深拷贝看起来是一个糟糕的想法,但我看到你已经移除了这个。
(是的,你使用数组切片进行弹栈操作看起来可疑。你应该确保它们确实具有预期的算法复杂度,否则使用专用的数据结构(如栈)。我对使用通用数据结构(如Python列表或PHP哈希表)进行此用法持谨慎态度,因为它们不一定设计用于处理这种特定用例,但可能切片确保在所有情况下都具有O(1)的推入和弹出成本。)
处理环境的最佳方法是使用数字索引而不是变量(de Bruijn索引(最后一个绑定变量为0),或de Bruijn级别(第一个绑定变量为0)。这样,您只需保留一个动态调整大小的数组用于环境,并且访问它非常快速。如果您有一级闭包,还需要捕获环境,这将更加昂贵:您必须将其部分复制到专用结构中,或者对整个环境使用非可变结构。只有实验才能告诉,但我的经验是选择一个快速可变的环境结构,并为闭包构造支付更高的成本要好于始终具有更多簿记的不可变结构;当然,您应该进行使用分析,以仅捕获闭包中必要的变量。
最后,一旦您排除了与算法选择相关的低效源,热点区域将是:
-
垃圾回收(绝对是一个困难的话题;如果您不想成为垃圾回收专家,您应该认真考虑重用现有的运行时);您可能正在使用宿主语言的垃圾回收(在您的解释语言中的堆分配被转换为在您的实现语言中的堆分配,具有相同的生命周期),在您展示的代码片段中并不清楚;这种策略非常适合以简单的方式获得合理高效的结果
-
数值实现;有各种各样的技巧可以在您操作的整数实际上很小的情况下提高效率。您最好是重用那些在此方面投入了大量工作的人的成果,因此我强烈建议您重用例如GMP库。然后,您还可以重用宿主语言对大数的支持,如果有的话,例如Go的math/big包。
-
解释器循环的低级设计。在像您这样的“简单字节码”语言中(每个字节码指令被转换为少量本地指令,而不是具有高级语义的复杂字节码,如Parrot字节码),实际的“循环和字节码分派”代码可能成为瓶颈。已经对如何编写这种字节码分派循环进行了相当多的研究,以避免if/then/else的级联(跳转表),从宿主处理器的分支预测中获益,简化控制流等。这被称为“线程化代码”,有很多(相当简单)不同的技术:直接线程化,间接线程化...如果您想研究一些研究,例如Anton Ertl的工作,The Structure and Performance of Efficient Interpreters(2003年),以及后来的Context threading: A flexible and efficient dispatch technique for virtual machine interpreters。这些技术的好处往往与处理器有关,因此您应该自己测试各种可能性。
虽然STG的工作很有趣(Peyton-Jones关于编程语言实现的书非常好),但它在某种程度上偏向于惰性求值。关于严格函数式语言的高效字节码设计,我的参考是Xavier Leroy在1990年关于ZINC机器的工作:The ZINC experiment: An Economical Implementation of the ML Language,这是ML语言实现的开创性工作,并且仍然在OCaml语言的实现中使用:有字节码和本地编译器,但字节码仍然使用了一种改进的ZINC机器。
英文:
You should have efficient algorithmic representations for the various concepts of your interpreter. Doing deep copies on a hashtable looks like a terrible idea, but I see that you have already removed that.
(Yes, your stack-popping operation using array slices look suspect. You should make sure they really have the expected algorithmic complexity, or else use a dedicated data structure (... a stack). I'm generally wary of using all-purposes data structure such as Python lists or PHP hashtables for this usage, because they are not necessarily designed to handle this particular use case well, but it may be that slices do guarantee an O(1) pushing and popping cost under all circumstances.)
The best way to handle environments, as long as they don't need to be reified, is to use numeric indices instead of variables (de Bruijn indices (0 for the variable bound last), or de Bruijn levels (0 for the variable bound first). This way you can only keep a dynamically resized array for the environment and accessing it is very fast. If you have first-class closures you will also need to capture the environment, which will be more costly: you have to copy the part of it in a dedicated structure, or use a non-mutable structure for the whole environment. Only experiment will tell, but my experience is that going for a fast mutable environment structure and paying a higher cost for closure construction is better than having an immutable structure with more bookkeeping all the time; of course you should make an usage analysis to capture only the necessary variables in your closures.
Finally, once you have rooted out the inefficiency sources related to your algorithmic choices, the hot area will be:
-
garbage collection (definitely a hard topic; if you don't want to become a GC expert, you should seriously consider reusing an existing runtime); you may be using the GC of your host language (heap-allocations in your interpreted language are translated into heap-allocations in your implementation language, with the same lifetime), it's not clear in the code snippet you've shown; this strategy is excellent to get something reasonably efficient in a simple way
-
numerics implementation; there are all kind of hacks to be efficient when the integers you manipulate are in fact small. Your best bet is to reuse the work of people that have invested tons of effort on this, so I strongly recommend you reuse for example the GMP library. Then again, you may also reuse your host language support for bignum if it has some, in your case Go's math/big package.
-
the low-level design of your interpreter loop. In a language with "simple bytecode" such as yours (each bytecode instruction is translated in a small number of native instructions, as opposed to complex bytecodes having high-level semantics such as the Parrot bytecode), the actual "looping and dispatching on bytecodes" code can be a bottleneck. There has been quite a lot of research on what's the best way to write such bytecode dispatch loops, to avoid the cascade of if/then/else (jump tables), benefit from the host processor branch prediction, simplify the control flow, etc. This is called threaded code and there are a lot of (rather simple) different techniques : direct threading, indirect threading... If you want to look into some of the research, there is for example work by Anton Ertl, The Structure and Performance of Efficient Interpreters in 2003, and later Context threading: A flexible and efficient dispatch technique for virtual machine interpreters. The benefits of those techniques tend to be fairly processor-sensitive, so you should test the various possibilities yourself.
While the STG work is interesting (and Peyton-Jones book on programming language implementation is excellent), it is somewhat oriented towards lazy evaluation. Regarding the design of efficient bytecode for strict functional languages, my reference is Xavier Leroy's 1990 work on the ZINC machine: The ZINC experiment: An Economical Implementation of the ML Language, which was ground-breaking work for the implementation of ML languages, and is still in use in the implementation of the OCaml language: there are both a bytecode and a native compiler, but the bytecode still uses a glorified ZINC machine.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论