IMPLEMENTATION OF THE FACTOR COMPILER Compilation of Factor is a messy business, driven by heuristics and not formal theory. The compiler is inherently limited -- some expressions cannot be compiled by definition. The programmer must take care to ensure that performance-critical sections of code are written such that they can be compiled. === Introduction ==== The problem The Factor interpreter introduces a lot of overhead: - Execution of a quotation involves iteration down a linked list. - Stack access is not as fast as local variables, since Java bound-checks all array accesses. - At the lowest level, everything is expressed as Java reflection calls to the Factor and Java platform libraries. Java reflection is not as fast as statically-compiled Java calls. - Since Factor is dynamically-typed, intermediate values on the stack are all stored as java.lang.Object types, so type checks and possibly coercions must be done at each step of the computation. ==== The solution The following optimizations naturally suggest themselves, and lead to the implementation of the Factor compiler: - Compiling Factor code down to Java platform bytecode. - Using virtual machine local variables instead of an array stack to store intermediate values. - Statically compiling in Java calls where the class, method and variable names are known ahead of time. - Type inference and soft typing to eliminate unnecessary type checks. (At the time of writing, this is in progress and is not documented in this paper.) === Preliminaries: interpreter internals A word object is essentially a property list. The one property we are concerned with here is "def", which holds a FactorWordDefinition object. The accessor word "worddef" pushes the "def" slot of a given word name or word object: 0] "+" worddef . # Generally, the word definition is an opaque object, however there are various ways to deconstruct it, which will not be convered here (see the worddef>list word if you are interested). When a word object is being executed, the eval() method of its definition is invoked. The eval() method takes one parameter, which is the FactorInterpreter instance. The interpreter instance provides access to the stacks, global namespace, vocabularies, and so on. (In this article, we will use the term "word" and "word definition" somewhat interchangably; this does not cause any confusion. If a "word" is mentioned where one would expect a definition, simply assume the "def" slot of the word is being accessed.) The class FactorWordDefinition is abstract; a number of subclasses exist: - FactorCompoundDefinition: a standard colon definition consisting of a quotation; for example, : sq dup * ; is syntax for a compound definition named "sq" with quotation [ dup * ]. Of course, its eval() method simply pushes the quotation on the interpreter's callstack. - FactorShuffleDefinition: a stack rearrangement word, whose syntax is described in detail in parser.txt. For example, ~<< swap a b -- b a >>~ is syntax for a shuffle definition named "swap" that exchanges the top two values on the data stack. - FactorPrimitiveDefinition: primitive word definitions are written in Java. Various concrete subclasses of this class in the factor.primitives package provide implementations of eval(). When a word definition is compiled, the compiler dynamically generates a new class, creates a new instance, and replaces the "def" slot of the word in question with the instance of the compiled class. So the compiler's primary job is to generate appropriate Java bytecode for the eval() method. === Preliminaries: the specimen Consider the following (naive) implementation of the Fibonacci sequence: : fib ( n -- nth fibonacci number ) dup 1 <= [ drop 1 ] [ pred dup fib swap pred fib + ] ifte ; A quick overview of the words used here: - dup: a shuffle word that duplicates the top of the stack. - <=: compare the top two numbers on the stack. - drop: remove the top of the stack. - pred: decrement the top of the stack by one. Indeed, it is defined as simply : pred 1 - ;. - swap: exchange the top two stack elements. - +: add the top two stack elements. - ifte: execute one of two given quotations, depending on the condition on the stack. === Java reflection The biggest performance improvement comes from the transformation of Java reflection calls into static bytecode. Indeed, when the compiler was first written, the only type of word it could compile were such simple expressions that interfaced with Java and nothing else. In the above definition of "fib", the three key words <= - and + (note that - is not referenced directly, but rather is a factor of the word pred). All three of these words are implemented as Java calls into the Factor math library: : <= ( a b -- boolean ) [ "java.lang.Number" "java.lang.Number" ] "factor.math.FactorMath" "lessEqual" jinvoke-static ; : - ( a b -- a-b ) [ "java.lang.Number" "java.lang.Number" ] "factor.math.FactorMath" "subtract" jinvoke-static ; : + ( a b -- a+b ) [ "java.lang.Number" "java.lang.Number" ] "factor.math.FactorMath" "add" jinvoke-static ; During interpretation, the execution of one of these words involves a lot of overhead. First, the argument list is transformed into a Java Class[] array; then the Class object corresponding to the containing class is looked up; then the appropriate Method object defined in this class is looked up; then the method is invoked, by passing it an Object[] array consisting of arguments from the stack. As one might guess, this is horribly inefficient. Indeed, look at the time taken to compute the 25th Fibonacci number using pure interpretation (of course depending on your hardware, results might vary): 0] [ 25 fib ] time 24538 One quickly notices that in fact, all the overhead from the reflection API is unnecessary; the containing class, method name and argument types are, after all, known ahead of time. For instance, the word "<=" might be compiled into the following pseudo-bytecode (the details are a bit more complex in reality; we'll get to it later): MOVE datastack[top - 2] to JVM stack // get operands in right order CHECKCAST java/lang/Number MOVE datastack[top - 1] to JVM stack CHECKCAST java/lang/Number DECREMENT datastack.top 2 // pop the operands INVOKESTATIC // invoke the method "factor/FactorMath" "lessEqual" "(Ljava/lang/Number;Ljava/lang/Number;)Ljava/lang/Number;" MOVE JVM stack top to datastack // push return value Notice that no dynamic class or method lookups are done, and no arrays are constructed; in fact, a modern Java virtual machine with a native code compiler should be able to transform an INVOKESTATIC into a simple subroutine call. So what how much overhead is eliminated in practice? It is easy to find out: 5] [ + - <= ] [ compile ] each 1] [ 25 fib ] time 937 This is still quite slow -- however, already we've gained a 26x speed improvement! Words consisting entirely of literal parameters to Java primitives such as jinvoke, jnew, jvar-get/set, or jvar-get/set-static are compiled in a similar manner; there is nothing new there. === First attempt at compiling compound definitions Now consider the problem of compiling a word that does not directly call Java primitives, but instead calls other words, which are already been compiled. For instance, consider the following word (recall that (...) is a comment!): : mag2 ( x y -- sqrt[x*x+y*y] ) swap dup * swap dup * + sqrt ; Lets assume that 'swap', 'dup', '*' and '+' are defined as before, and that 'sqrt' is an already-compiled word that calls into the math library. Assume that the pseudo-bytecode INVOKEWORD invokes the "eval" method of a FactorWordDefinition instance. (In reality, it is a bit more complex: GETFIELD ... some field that stores a FactorWordDefinition instance ... ALOAD 0 // push interpreter parameter to eval() on the stack INVOKEVIRTUAL "factor/FactorWordDefinition" "eval" "(Lfactor/FactorInterpreter;)V" However the above takes up more space and adds no extra information over the INVOKE notation.) Now, we have the tools necessary to try compiling "mag2" as follows: INVOKEWORD swap INVOKEWORD dup INVOKEWORD * INVOKEWORD swap INVOKEWORD dup INVOKEWORD * INVOKEWORD + INVOKEWORD sqrt In other words, the words still shuffle values back and forth on the interpreter data stack as before; however, instead of the interpreter iterating down a word thread, compiled bytecode invokes words directly. This might seem like the obvious approach; however, it turns out it brings very little performance benefit over simply iterating down a linked list representing a quotation! What we would like to do is just eliminate use of the interpreter's stack for intermediate values altogether, and just loading the inputs at the beginning and storing them at the end. === Avoiding the interpreter stack The JVM is a stack machine, however its semantics are so different that a direct mapping of interpreter stack use to stack bytecode would not be feasable: - No arbitrary stack access is allowed in Java; only a few, fixed stack bytecodes like POP, DUP, SWAP are provided. - A Java function receives input parameters in local variables, not in the JVM stack. In fact, the second point suggests that it is a better idea is to use JVM *local variables* for temporary storage in compiled definitions. Since no indirect addressing of locals is permitted, stack positions used in computations must be known ahead of time. This process is known as "stack effect deduction", and is the key concept of the Factor compiler. === Fundamental idea: eval/core split Earlier, we showed pseudo-bytecode for the word <=, however it was noted that the reality is a bit more complicated. Recall that FactorWordDefinition.eval() takes an interpreter instance. It is the responsibility of this method to marshall and unmarshall values on the interpreter stack before and after the word performs any computation on the values. In actual fact, compiled word definitions have a second method named core(). Instead of accessing the interpreter data stack directly, this method takes inputs from formal parameters passed to the method, in the natural stack order. So, lets look at possible disassembly for the eval() and core() methods of the word <=: void eval(FactorInterpreter interp) ALOAD 0 // push interpreter instance on JVM stack MOVE datastack[top - 2] to JVM stack // get operands in right order CHECKCAST java/lang/Number MOVE datastack[top - 1] to JVM stack CHECKCAST java/lang/Number DECREMENT datastack.top 2 // pop the operands INVOKESTATIC // invoke the method ... compiled definition class name ... "core" "(Lfactor/FactorInterpreter;Ljava/lang/Object;Ljava/lang/Object;) Ljava/lang/Object;" MOVE JVM stack top to datastack // push return value Object core(FactorInterpreter interp, Object x, Object y) ALOAD 0 // push formal parameters ALOAD 1 ALOAD 2 INVOKESTATIC // invoke the actual method "factor/FactorMath" "lessEqual" "(Ljava/lang/Number;Ljava/lang/Number;)Ljava/lang/Number;" ARETURN // pass return value up to eval() ==== Using the JVM stack and locals for intermediates At first glance it seems nothing was achieved with the eval/core split, excepting an extra layer of overhead. However, the new revalation here is that compiled word definitions can call each other's core methods *directly*, passing in the parameters through JVM local variables, without the interpreter data stack being involved! Instead of pseudo-bytecode, from now on we will consider a very abstract, high level "register transfer language". The extra verbosity of bytecode will only distract from the key ideas. Tentatively, we would like to compile the word 'mag2' as follows: r0 * r0 -> r0 r1 * r1 -> r1 r0 + r1 -> r0 sqrt r0 -> r0 return r0 However this looks very different from the original, RPN definition; in particular, we have named values, and the stack operations are gone! As it turns out, there is a automatic way to transform the stack program 'mag2' into the register transfer program above (the reverse is also possible, but will not be discussed here). ==== Stack effect deduction Consider the following quotation: [ swap dup * swap dup * + sqrt ] The transformation of the above stack code into register code consists of two passes. (A one-pass approach is also possible; however because of the design of the assembler used by the compiler, an extra pass will be required elsewhere if this transformation described here is single-pass). The first pass is simply to determine the total number of input and output parameters of the quotation (its "stack effect"). We proceed as follows. 1. Create a 'simulated' datastack. It does not contain actual values, but rather markers. Set the input parameter count to zero. 2. Iterate through each element of the quotation, and act as follows: - If the element is a literal, allocate a simulated stack entry. - If the element is a word, ensure that the stack has at least as many items as the word's input parameter count. If the stack does not have enough items, increment the input parameter count by the difference between the stack item count and the word's expected input parameter count, and fill the stack with the difference. Decrement the stack pointer by the word's input parameter count. Increment the stack pointer by the word's output parameter count, filling the new entries with markers. 3. When the end of the quotation is reached, the output parameter count is the number of items on the simulated stack. The input parameter count is the value of the intermediate parameter created in step 1. Note that this algorithm is recursive -- to determine the stack effect of a word, the stack effects of all its factors must be known. For now, assume the stack effects of words that use the Java primitives are "trivially" known. A brief walkthrough of the above algorithm for the quotation [ swap dup * swap dup * + sqrt ]: swap - the simulated stack is empty but swap expects two parameters, so the input parameter count becomes 2. two empty markers are pushed on the simulated stack: # # dup - requires one parameter, which is already present. another empty marker is pushed on the simulated stack: # # # * - requires two parameters, and returns one parameter, so the simulated stack is now: # # swap - requires and returns two parameters. # # dup - requires one, returns two parameters. # # # * - requires two, and returns one parameter. # # + - requires two, and returns one parameter. # sqrt - requires one, and returns one parameter. # So the input parameter count is two, and the output parameter count is one (since at the end of the quotation the simulated datastack contains one item marker). ==== The dataflow algorithm The second pass of the compiler algorithm relies on the stack effect already being known. It consists of these steps: 1. Create a new simulated stack. For each input parameter, a new entry is allocated. This time, entries are not blank markers, but rather register numbers. 2. Iterate through each element of the quotation, and act as follows: - If the element is a literal, allocate a simulated stack entry. This time, allocation finds an unused register number by checking each stack entry. - If the element is a shuffle word, apply the shuffle to the simulated stack *and do not emit any code!* - If the element is another word, pop the appropriate number of register numbers from the simulated stack, and emit assembly code for invoking the word with parameters stored in these registers. Decrement the simulated stack pointer by the word's input parameter count. Increment the simulated stack pointer by the word's output parameter count, filling the new entries with newly-allocated register numbers. Emit assembly code for moving the return values of the word into the newly allocated registers. Voila! The 'simulated stack' is a compile time only notion, and the resulting emitted code does not explicitly reference any stacks at all; in fact, applying this algorithm to the following quotation: [ swap dup * swap dup * + sqrt ] Yields the following output: r0 * r0 -> r0 r1 * r1 -> r1 r0 + r1 -> r0 sqrt r0 -> r0 return r0 ==== Multiple return values A minor implementation detail is multiple return values. Java does not support them directly, but a Factor word can return any number of values. This is implemented by temporarily using the interpreter data stack to return multiple values. This is the only time the interpreter data stack is used. ==== The call stack Sometimes Factor code uses the call stack as an 'extra hand' for temporary storage: dup >r + r> * The dataflow algorithm can be trivially generalized with two simulated stacks; there is nothing more to be said about this. === Questioning assumptions The dataflow compilation algorithm gives us another nice performance improvement. However, the algorithm assumes that the stack effect of each word is known a priori, or can be deduced using the algorithm. The algorithm falls down when faced with the following more complicated expressions: - Combinators calling the 'call' and 'ifte' primitives - Recursive words So ironically, this algorithm is unsuitable for code where it would help the most -- complex code with a lot of branching, and tight loops and recursions. === Eliminating explicit 'call': As described above, the dataflow algorithm would break when it encountered the 'call' primitive: [ 2 + ] 5 swap call The 'call' primitive executes the quotation at the top of the stack. So its stack effect depends on its input parameter! The first problem we faced was compilation of Java reflection primitives. A critical observation was that all the information to compile them efficiently was 'already there' in the source. Our intuitition tells us that in the above code, the occurrence of 'call' *always* receives the parameter of [ 2 + ]; so somehow, the quotation can be transformed into the following, which we can already compile: [ 2 + ] 5 swap drop 2 + ^^^^^^^^ "immediate instantiation" of 'call' Or indeed, once the unused literal [ 2 + ] is factored out, simply: 5 2 + ==== Generalizing the 'simulated stack' It might seem surprising that such expressions can be easily compiled, once the 'simulated stack' is generalized such that it can hold literal values! The only change that needs to be made, is that in both passes, when a literal is encountered, it is pushed directly on the simulated stack. Also, when the primitive 'call' is encountered, its stack effect is assumed to be the stack effect of the literal quotation at the top of the simulated stack. (What if the top of the simulated stack is a register number? The word cannot be compiled, since the stack effect can potentially be arbitrary!) Being able to compile 'call' whose parameters are literals from the same word definition doesn't really add nothing new. A real breakthrough would be compiling "combinators"; words that take parameters that are themselves quotations. As it turns out, combinators themselves are not compiled -- however, specific *instances* of combinators in other word definitions are. For example, we can rewrite our word 'mag2' as follows: : mag2 ( x y -- sqrt[x*x+y*y] ) [ sq ] 2apply + sqrt ; Where 2apply is defined as follows: : 2apply ( x y [ code ] -- ) 2dup 2>r nip call 2r> call ; How can we compile this new, equivalent, form of 'mag2'? ==== Inline words Normally, when the dataflow algorithm encounters a word as an element of a quotation, a call to that word's core() method is emitted. However, if the word is compiled 'immediately', its definition is substituted in. Assume for a second that in the new form of 'mag2', the word '2apply' is compiled inline (ignoring the specifics of how this decision is made). In other words, it is as if 'mag2' was defined as follows: : mag2 ( x y -- sqrt[x*x+y*y] ) [ sq ] 2dup 2>r nip call 2r> call + sqrt ; However, we already have a way of compiling the above code; in fact it is compiled into the equivalent of: : mag2 ( x y -- sqrt[x*x+y*y] ) [ sq ] 2dup 2>r nip drop sq 2r> drop sq + sqrt ; ^^^^^^^ ^^^^^^^ immediate instantiation of 'call' As an aside, recall that the stack words 2dup, 2>r, nip, drop, and 2r> do not emit any code, and the 'drop' of the literal [ sq ] ensures that it never makes it to the compiled definition. The end-result is that the register-transfer code is identical to the earlier definition of 'mag2' which did not involve 2apply: r0 * r0 -> r0 r1 * r1 -> r1 r0 + r1 -> r0 sqrt r0 -> r0 return r0 So, how is the decision made to compile a word inline, or not? It is quite simple. If the word has a deducable stack effect on the simulated stack of the current compilation, but it does *not* have a deducable stack effect on an empty simulated stack, it is compiled immediate. For example, the following word has a deducable stack effect, regardless of the values of any literals on the simulated stack: : sq ( x -- x^2 ) dup * ; So the word 'sq' is always compiled normally. However, the '2apply' word we saw earlier does not have a deducable stack effect unless there is a literal quotation at the top of the simulated stack: : 2apply ( x y [ code ] -- ) 2dup 2>r nip call 2r> call ; So it is compiled inline. Sometimes it is desirable to have short non-combinator words inlined. While this is not necessary (whereas non-inlined combinators do not compile), it can increase performance, especially if the word returns multiple values (and without inlining, the interpreter datastack will need to be used). To mark a word for inline compilation, use the word 'inline' like so: : sq ( x -- x^2 ) dup * ; inline The word 'inline' sets the inline slot of the most recently defined word object. (Indeed, to push a reference to the most recently defined word object, use the word 'word'). === Branching The only branching primitive supported by factor is 'ifte'. The syntax is as follows: 2 2 + 4 = ( condition that leaves boolean on the stack ) [ ( code to execute if condition is true ) ] [ ( code to execute if condition is false ) ] ifte Note that the different components might be spread between words, and affected by stack operations in transit. Due to the dataflow algorithm and inlining, all useful cases can be handled correctly. ==== Not all branching forms have a deducable stack effect The first observation we gain is that if the two branches leave the stack in inconsistent states, then stack positions used by subsequent code will depend on the outcome of the branch. This practice is discouraged anyway -- it leads to hard-to-understand code -- so it is not supported by the compiler. If you must do it, the words will always run in the interpreter. Attempting to compile or balance an expression with such a branch raises an error: 9] : bad-ifte 3 = [ 1 2 3 ] [ 2 2 + ] ifte ; 10] word effect . break called. :r prints the callstack. :j prints the Java stack. :x returns to top level. :s returns to top level, retaining the data stack. :g continues execution (but expect another error). ERROR: Stack effect of [ 1 2 3 ] ( java.lang.Object -- java.lang.Object java.lang.Object java.lang.Object ) is inconsistent with [ 2 2 + ] ( java.lang.Object -- java.lang.Object ) Head is ( java.lang.Object -- ) Recursive state: [ # # ] ==== Merging Lets return to our register transfer language, and add a branching notation: - two-instruction sequence to branch to