542 lines
19 KiB
Plaintext
542 lines
19 KiB
Plaintext
FACTOR INTERNALS GUIDE
|
|
|
|
This guide focuses on portion of Factor that is written in C. The
|
|
sources can be found in the native/ subdirectory.
|
|
|
|
Factor defines a few new integer types; they will be mentioned without
|
|
further explanation:
|
|
|
|
- CELL: unsigned word-size field.
|
|
- FIXNUM: signed word-size field.
|
|
|
|
The 'CELLS' constant is defined as sizeof(CELL). The idea is to be able
|
|
to write expressions like 4*CELLS.
|
|
|
|
* The memory manager
|
|
|
|
Factor's memory manager is concentrated in the files memory.c and gc.c.
|
|
|
|
A guard page is allocated above the memory heap, and an out of memory
|
|
condition is raised if an attempt is made to write to the guard page.
|
|
Out of memory errors are fatal -- the garbage collector must be
|
|
triggered before this happends.
|
|
|
|
The alloc_guarded(CELL size) function allocates a block of memory with
|
|
guard pages above and below.
|
|
|
|
There are two memory zones of identical size; only one is in use at any
|
|
one time -- this is denoted as the 'active' zone. The other is the
|
|
'prior' zone. Each zone is identified with a ZONE struct stored in a
|
|
global variable named after the zone's role.
|
|
|
|
Memory is allocated in the zone using the allot(CELL a) function. This
|
|
function takes the size of a memory block to allocate. Memory blocks are
|
|
aligned on 8 byte boundaries, due to tagged representation of pointers.
|
|
If a non-8-byte-aligned block is passed in, allot() will round up the
|
|
size accordingly. Memory is allocated in a linear fashion, simply by
|
|
incrementing the 'here' field of the 'active' ZONE struct.
|
|
|
|
Note that allot() cannot be used in an arbitrary fashion, since doing so
|
|
will confuse the garbage collector. See the section on object
|
|
representation below.
|
|
|
|
* Tagged pointer representation
|
|
|
|
A ``tagged value'' is a CELL whose three least significant bits are a
|
|
``type tag''. The type tags are defined in types.h:
|
|
|
|
#define FIXNUM_TYPE 0
|
|
#define WORD_TYPE 1
|
|
#define CONS_TYPE 2
|
|
#define OBJECT_TYPE 3
|
|
#define RATIO_TYPE 4
|
|
#define COMPLEX_TYPE 5
|
|
#define HEADER_TYPE 6
|
|
#define GC_COLLECTED 7
|
|
|
|
If the tag is OBJECT_TYPE and the remaining bits of the cell are zero --
|
|
that is, if the tagged cell has integer value 3 -- it is taken to be the
|
|
canonical falsity value, and also the empty list. There is a convinient
|
|
macro:
|
|
|
|
#define F RETAG(0,OBJECT_TYPE)
|
|
|
|
If the tag is FIXNUM_TYPE, the most significant 29 bits of the cell are
|
|
taken to be a literal integer value. To decode the integer value, the
|
|
cell must be shifted to the right by three bits.
|
|
|
|
The role of the header tag is described below.
|
|
|
|
Any other tag signifies that the cell is a pointer to a 8-byte-aligned
|
|
block of memory; the address of the block is obtained by masking off the
|
|
least significant three bits of the tag.
|
|
|
|
Some macros for working with tagged cells are provided:
|
|
|
|
- TAG(cell) -- the tag of the cell
|
|
|
|
- UNTAG(cell) -- mask off the tag, turning it into a pointer
|
|
|
|
- RETAG(cell,tag) -- set the tag of a cell
|
|
|
|
- untag_fixnum_fast(cell) -- shift the cell to the right by three bits,
|
|
without actually checking that the tag is FIXNUM_TYPE. If it is not
|
|
FIXNUM_TYPE, a meaningless value is returned.
|
|
|
|
* Built-in data types
|
|
|
|
In Factor, all data types are defined as part of the runtime. There are
|
|
no user-defined types.
|
|
|
|
For some cells, the type of the object they point to is encoded entirely
|
|
in the tagged cell. This is true for the following types:
|
|
|
|
#define FIXNUM_TYPE 0
|
|
#define WORD_TYPE 1
|
|
#define CONS_TYPE 2
|
|
#define RATIO_TYPE 4
|
|
#define COMPLEX_TYPE 5
|
|
|
|
All other data types are pointed to by cells with a tag of OBJECT_TYPE,
|
|
and the first cell of the object must then be a tagged cell with tag
|
|
HEADER_TYPE, and the remaining bits must be one of the following values:
|
|
|
|
#define T_TYPE 7
|
|
#define ARRAY_TYPE 8
|
|
#define BIGNUM_TYPE 9
|
|
#define FLOAT_TYPE 10
|
|
#define VECTOR_TYPE 11
|
|
#define STRING_TYPE 12
|
|
#define SBUF_TYPE 13
|
|
#define PORT_TYPE 14
|
|
#define DLL_TYPE 15
|
|
#define ALIEN_TYPE 16
|
|
|
|
There are three fundamental functions for working with types:
|
|
|
|
- type_of(CELL tagged) -- return one of the above values. Never returns
|
|
OBJECT_TYPE or HEADER_TYPE.
|
|
|
|
- typep(CELL type, CELL tagged) -- return a boolean value. Behaves
|
|
identically to: type_of(tagged) == type.
|
|
|
|
- type_check(CELL type, CELL tagged) -- raise an error if the tagged
|
|
cell does not point to an object of the given type.
|
|
|
|
* Object representation
|
|
|
|
Object representation details can be found in types.[ch].
|
|
|
|
There is a fundamental principle guiding object representation in
|
|
Factor: When given a tagged cell, one must be able to determine the size
|
|
of the object it points to, and what other objects this object points
|
|
to.
|
|
|
|
There are two primary classes of objects:
|
|
|
|
- Conslikes. The type of a conslike is encoded in the tag. Conslikes are
|
|
represented 'naked' in the heap, with no header. These are exactly
|
|
objects of type CONS_TYPE, RATIO_TYPE, and COMPLEX_TYPE. Note that
|
|
WORD_TYPE also has its own tag, however words do have a header.
|
|
|
|
Since conslikes lack a header, the garbage collector and relocator
|
|
cannot distinglish between them while doing a linear scan of the heap.
|
|
This has an important consequence: the fields of conslikes must all be
|
|
tagged cells.
|
|
|
|
- Large objects. So called because the have a header and are larger than
|
|
conslikes. Unlike conslikes, an object with a header permits the garbage
|
|
collector and relocator to behave accordingly. For example, because
|
|
strings have a header, the relocator can skip over the characters of a
|
|
string, instead of treating them as tagged cells and crashing.
|
|
|
|
Tagged cells pointing to headed objects all have a tag OBJECT_TYPE,
|
|
except for pointers to words which have a tag WORD_TYPE.
|
|
|
|
* Conses and related types
|
|
|
|
Conslikes are found in the files cons.[ch], ratio.[ch] and complex.[ch].
|
|
|
|
Conslikes are one of the most important data types in Factor. They use
|
|
exactly two words of storage in the heap. This may seem surprising --
|
|
however, since all pointers to conslikes are tagged as being such, no
|
|
ambiguity results.
|
|
|
|
The most important is the CONS_TYPE.
|
|
|
|
Given a tagged cell pointing to a CONS_TYPE, one can call
|
|
untag_cons(CELL cell) to obtain a pointer to a F_CONS. F_CONS is a
|
|
struct defined as:
|
|
|
|
typedef struct {
|
|
CELL car;
|
|
CELL cdr;
|
|
} F_CONS;
|
|
|
|
Given an untagged pointer to an F_CONS, one can obtain a tagged cell by
|
|
calling tag_cons(F_CONS* cons).
|
|
|
|
There are corresponding taggers and untaggers for the other two
|
|
conslikes:
|
|
|
|
untag_ratio(), tag_ratio(), untag_complex(), tag_complex().
|
|
|
|
* Image format
|
|
|
|
Image loading and saving is performed by image.[ch] and relocate.[ch].
|
|
|
|
On startup Factor loads an image file. The image file format depends on
|
|
the CPU word size and byte order, since it is directly loaded into
|
|
memory.
|
|
|
|
The image begins with two constants that must exactly equal:
|
|
|
|
- CELL magic -- 0x0f0e0d0c
|
|
- CELL version -- 0
|
|
|
|
The next value is used to relocate the image:
|
|
|
|
- CELL relocation_base
|
|
|
|
The following two are tagged pointers for user space:
|
|
|
|
- CELL boot -- quotation to interpret on startup.
|
|
- CELL global -- global namespace.
|
|
|
|
The last value is the image size, for verification purposes:
|
|
|
|
- CELL size.
|
|
|
|
* Image relocation
|
|
|
|
After the image has been loaded into the heap, a relocation procedure is
|
|
performed. The idea behind relocation is as follows: in memory, the
|
|
image contains absolute addresses to other parts of the image. However,
|
|
it is desirable to be able to load the image into another offset of
|
|
memory; depending on absolute memory mapping is unportable and
|
|
troublesome.
|
|
|
|
When saving the image, Factor stores the start address of the heap into
|
|
the header field relocation_base.
|
|
|
|
When loading the image, each address must be manipulated like so:
|
|
|
|
void fixup(CELL* cell)
|
|
{
|
|
if(TAG(*cell) != FIXNUM_TYPE && *cell != F)
|
|
*cell += (active.base - relocation_base);
|
|
}
|
|
|
|
Where 'active.base' is the heap start offset of the current instance,
|
|
and 'relocation_base' is the value found in the image header.
|
|
|
|
Relocation relies on total knowledge of object structure; it does this
|
|
by inspecting type tags.
|
|
|
|
Also the unportable 'xt' field of F_WORD structs is reset during
|
|
relocation, by looking up the word's primitive number in a global table
|
|
of primitives.
|
|
|
|
* Garbage collection
|
|
|
|
The garbage collector is defined in gc.c.
|
|
|
|
Factor's garbage collector is a standard two-space copying collector,
|
|
using Cheney's algorithm. Much has been said about this algorithm in the
|
|
literature, and this information will not be duplicated here.
|
|
|
|
Garbage collection is manually triggered by calling
|
|
maybe_garbage_collection(). This function checks if free memory is below
|
|
a certain threshold, and if so, commences a garbage collection.
|
|
|
|
*Beware!* Any local variables in the C stack are not visible to the
|
|
garbage collector, and are not taken as roots. Therefore, the only place
|
|
it is safe to call maybe_garbage_collection() is from inside a primitive
|
|
that was called directly from run(), before any values are stored in
|
|
locals.
|
|
|
|
For example, consider the following is a safe call to
|
|
maybe_garbage_collection():
|
|
|
|
void primitive_cons(void)
|
|
{
|
|
CELL car, cdr;
|
|
maybe_garbage_collection();
|
|
cdr = dpop();
|
|
car = dpop();
|
|
dpush(cons(car,cdr));
|
|
}
|
|
|
|
The function primitive_cons() is only ever called from run() or
|
|
primitive_execute(). In both these cases, the C stack does not store any
|
|
heap-allocated values in local variables.
|
|
|
|
However, the following would not be safe, and would result in a runtime
|
|
crash sooner or later:
|
|
|
|
void primitive_cons(void)
|
|
{
|
|
CELL cdr = dpop();
|
|
CELL car = dpop();
|
|
maybe_garbage_collection();
|
|
dpush(cons(car,cdr));
|
|
}
|
|
|
|
The garbage collector would not update the car and cdr local variables
|
|
to point to their new locations; so after it returned, the primitive
|
|
might have allocated pushed a cons cell that refers to oldspace. A crash
|
|
would follow soon after.
|
|
|
|
* The stacks
|
|
|
|
The stacks are defined in run.h.
|
|
|
|
The runtime maintains exactly two stacks -- the data stack and the
|
|
return stack. (The name stack and catch stack are purely library
|
|
phenomena.) Both the data and return stacks are allocated using
|
|
alloc_guarded(), so stack over/underflow checks are done in hardware
|
|
with no runtime overhead.
|
|
|
|
Both stacks hold tagged cells, and grow down -- that is, pushing
|
|
increments the pointer.
|
|
|
|
The data and return stacks are scoped by two global variables each:
|
|
|
|
- CELL ds_bot/cs_bot -- a pointer to the bottom of the data/return
|
|
stack; this is the first value you can write to, right above the guard
|
|
page.
|
|
|
|
- CELL ds/cs -- a pointer to the value at the top of the data/return
|
|
stack.
|
|
|
|
A set of inline functions are provided for pushing and popping the
|
|
stacks:
|
|
|
|
- dpop()/cpop() -- pop a value off the data/return stack.
|
|
|
|
- dpush()/cpush() -- push a value on the data/return stack.
|
|
|
|
- dpeek() -- return the top value on the data stack without popping.
|
|
|
|
- drepl() -- replace the top of the stack with a given value.
|
|
|
|
You can acess other values using the get() and set() inline functions.
|
|
For example, to get the value on the data stack under the top value,
|
|
without popping:
|
|
|
|
get(ds - CELLS);
|
|
|
|
* The inner interpreter
|
|
|
|
The inner interpreter is defined in the run() function of run.c.
|
|
|
|
The inner interpreter is a loop. It does not call itself recursively;
|
|
rather, it pushes and pops the Factor return stack to maintain recursive
|
|
state for interpreted definitions.
|
|
|
|
The currently executing quotation is stored in the global variable 'cf'.
|
|
This is a tagged cell that must point to a CONS_TYPE; otherwise a type
|
|
error is raised. This quotation will be called the ``call frame''.
|
|
|
|
If the end of the call frame quotation is reached -- that is, if it
|
|
becomes identically equal to F -- the return stack is popped, and the
|
|
popped value becomes the new call frame.
|
|
|
|
Each step of the interpreter takes the car of the call frame, referred
|
|
to as ``next'', and advances execution state by setting the call frame
|
|
to its cdr.
|
|
|
|
The type tag of next is inspected. If it is anything but WORD_TYPE, it
|
|
is pushed on the stack using dpush(). Otherwise, the word is executed as
|
|
described below.
|
|
|
|
When a word is executed by the inner interpreter, first the global
|
|
variable 'executing' is set to an untagged pointer to the F_WORD struct.
|
|
|
|
Next, the interpreter calls the C function whose address is stored in
|
|
the 'xt' field of the word using the EXECUTE macro:
|
|
|
|
#define EXECUTE(w) ((XT)(w->xt))()
|
|
|
|
An 'xt' is just a function that takes no arguments:
|
|
|
|
typedef void (*XT)(void);
|
|
|
|
As soon as the function returns, the inner interpreter loop continues
|
|
iterating down the call frame, pushing literals, and executing words,
|
|
then finally it reaches the end of the call frame, pops the return
|
|
stack, and the whole cycle repeats again.
|
|
|
|
The run() function never returns. The only way to exit the inner
|
|
interpreter is by calling the exit* primitive, defined in the function
|
|
primitive_exit() in misc.c
|
|
|
|
* Primitives
|
|
|
|
Primitives are exported to user space via numbers -- each F_WORD struct
|
|
has a 'primitive' field. During relocation, the 'xt' field of the F_WORD
|
|
is set to the corresponding C function pointer by looking up the
|
|
primitive number in a table.
|
|
|
|
Six fundamental primitives are:
|
|
|
|
- #0: undefined() -- undefined words have this primitive/XT. It simply
|
|
raises an error.
|
|
|
|
- #1: docol() -- compound definitions have this primitive/XT. Recall
|
|
that the inner interpreter sets the 'executing' global variable to the
|
|
word object before calling its XT. The docol() function accesses the
|
|
quotation stored in the 'parameter' field of the executing word, and
|
|
calls it. What exactly is meant by 'calls' is described below.
|
|
|
|
- #2: dosym() -- symbol definitions have this primitive/XT. It simply
|
|
pushes the executing word on the data stack, thus making it behave like
|
|
a literal.
|
|
|
|
- #3: primitive_execute() -- defined in user space as 'execute' in the
|
|
'words' vocabulary. Pops a word off the datastack, stores it in the
|
|
'executing' global variable, and calls its XT.
|
|
|
|
- #4: primitive_call() -- defined in user space as 'call' in the
|
|
'kernel' vocabulary. Pops a cons cell off the datastack, and calls it.
|
|
What exactly is meant by 'calls' is described below.
|
|
|
|
- #5: primitive_ifte() -- defined in user space as 'ifte' in the
|
|
'kernel' vocabulary. Executes one of two quotations on the stack,
|
|
depending on a boolean value stored below.
|
|
|
|
A 'boolean value' is of course 'false' if it is identically equal to F,
|
|
and 'true' otherwise.
|
|
* Quotations
|
|
|
|
Notice that docol(), primitive_call() and primitive_ifte() all take a
|
|
quotation from some source, and 'call' it. They do this using the inline
|
|
function call():
|
|
|
|
INLINE void call(CELL quot)
|
|
{
|
|
/* tail call optimization */
|
|
if(callframe != F)
|
|
{
|
|
cpush(tag_word(executing));
|
|
cpush(callframe);
|
|
}
|
|
callframe = quot;
|
|
}
|
|
|
|
Indeed, this function does not actually call the quotation itself. It
|
|
merely changes inner interpreter state in such a way that the next
|
|
iteration of the interpreter loop will begin executing this quotation;
|
|
then when it is done, the previous quotation is popped off the return
|
|
stack.
|
|
|
|
Two final points should be said about call(). It also pushes the
|
|
currently executing word for the profiler's sake; the inner interpreter
|
|
itself does not use the word that was pushed on the return stack.
|
|
|
|
Finally, call() performs tail call optimization. If the current call
|
|
frame is F -- in other words, if the occurrence of docol(),
|
|
primitive_call() or primitive_ifte() was the last in a quotation -- the
|
|
call frame is not pushed on the return stack, since there is nothing to
|
|
return to. If F was pushed on the return stack, it would simply be
|
|
popped again at a later time.
|
|
|
|
* User environment
|
|
|
|
At startup, the last thing done by the C code is setting the call frame
|
|
to point to the boot quotation (as defined in the image header). Then,
|
|
it calls run() and execution of the boot quotation begins.
|
|
|
|
User space makes use of a 'user environment' to store state that does
|
|
not belong on the data stack. The user environment is roughly like
|
|
global variables, however their number is fixed.
|
|
|
|
The user environment consists of 16 numbered slots, each slot holding
|
|
one tagged cell. User environment slots are accessed using the getenv (
|
|
slot -- value ) and setenv ( value slot -- ) primitives. The slots have
|
|
the following assignment:
|
|
|
|
#define STDIN_ENV 0
|
|
#define STDOUT_ENV 1
|
|
#define STDERR_ENV 2
|
|
#define NAMESTACK_ENV 3
|
|
#define GLOBAL_ENV 4
|
|
#define BREAK_ENV 5
|
|
#define CATCHSTACK_ENV 6
|
|
#define CPU_ENV 7
|
|
#define BOOT_ENV 8
|
|
#define RUNQUEUE_ENV 9
|
|
#define ARGS_ENV 10
|
|
#define OS_ENV 11
|
|
#define RESERVED_ENV_1 12
|
|
#define RESERVED_ENV_2 13
|
|
#define RESERVED_ENV_3 14
|
|
#define RESERVED_ENV_4 15
|
|
|
|
STDIN_ENV and STDOUT_ENV are set by the I/O code to point to F_PORTs for
|
|
the appropriate system streams. STDERR_ENV is unused.
|
|
|
|
NAMESTACK_ENV is not used by the runtime; it is for user space use only.
|
|
GLOBAL_ENV is set by the runtime on startup to the same value as the
|
|
'global' field of the image header. User space makes use of
|
|
NAMESTACK_ENV and GLOBAL_ENV to implement named variables and nested
|
|
namespaces. Both values are initialized in 'init-namespaces' of the
|
|
'namespaces' vocabulary.
|
|
|
|
BREAK_ENV is a quotation, set by user space. This quotation is called
|
|
when the runtime raises an error. CATCHSTACK_ENV is not set or read by
|
|
the runtime, it is for user space use only. Both are initialized in
|
|
'init-errors' of the 'errors' vocabulary.
|
|
|
|
CPU_ENV is set by the runtime to one of the two strings "x86" or
|
|
"unknown". It is not intended to be written. The 'cpu' word of the
|
|
'kernel' vocabulary is defined for reading it.
|
|
|
|
BOOT_ENV is set by the runtime to the same value as the 'boot' field of
|
|
the image header -- that is, the boot quotation. If it is set to a
|
|
different value and then an image is saved, the new image will have this
|
|
boot quotation. This can be used to make 'turnkey' images that start a
|
|
custom app, instead of the listener loop.
|
|
|
|
RUNQUEUE_ENV is used exclusively by user space, for threading. See the
|
|
'threads' vocabulary.
|
|
|
|
ARGS_ENV is set by the runtime to a linked list of strings,
|
|
corresponding to command line arguments given to the Factor runtime
|
|
executable. The first of these arguments will always be an image name.
|
|
|
|
OS_ENV is set by the runtime to one of the two strings "unix" or
|
|
"win32". It is not intended to be written. The 'os' word of the 'kernel'
|
|
vocabulary is defined for reading it.
|
|
|
|
The remaining four user environment slots must not be read or written by
|
|
user space code.
|
|
|
|
* Error handling
|
|
|
|
Error handling goes on inside the error.c file, as well as run.c.
|
|
|
|
When the inner interpreter loop begins, setjmp is called to save the
|
|
current C stack in the global variable 'toplevel'.
|
|
|
|
When an error is raised inside the runtime, one of two things happen:
|
|
|
|
- BREAK_ENV is not yet set, so the interpreter prints an error message
|
|
and immediately exits. This is the case during bootstrapping.
|
|
|
|
- BREAK_ENV is set, in which case the error is stored in the
|
|
'thrown_error' global variable, and the error handler long-jumps to the
|
|
top level. Execution then resumes at the inner interpreter loop, which
|
|
then checks if there was a thrown error, and if so, it pushes the error
|
|
on the data stack, and calls the quotation stored in BREAK_ENV.
|
|
|
|
User space takes over from here. The BREAK_ENV is defined as follows in
|
|
the 'init-error-handler' word of debugger.factor:
|
|
|
|
[ dup save-error rethrow ] 5 setenv
|
|
|
|
The 'save-error' word snapshots the current stacks -- this is how :s,
|
|
:r, :n and :c work. 'rethrow' then pops a continuation from the catch
|
|
stack and calls it, which results in a certain 'catch' block being
|
|
executed.
|