GC & Allocator

Overview (most of this needs some updates)

The basic algorithm is adopted from LuaJit 3.0: New-Garbage-Collector.

Terms:

Mutator: The actual program being executed, as opposed to the garbage collector.
Object: Any chunk of memory allocated from the GC
Root: Roots are the references to objects that are reachable at the start of a collection cycle, including every reference in registers, on the stack and in static vars.

There are two kinds of GC phases a program might be in:

the mark phase
the sweep phase

The basic idea of a tracing GC is that

all roots are identified
all objects reachable from the roots are marked (mark phase)
all objects that have not been marked are freed (sweep phase)

Descriptions of the 2 and 3 color algorithms can also be found [here] (http://wiki.luajit.org/New-Garbage-Collector#Two-Color-Mark-&-Sweep). The 4 color algorithm takes this a step further, as explained [here] (http://wiki.luajit.org/New-Garbage-Collector#Quad-Color-Optimized-Incremental-Mark-&-Sweep).

Objects & Type Parameters

A precise GC needs to know where references are stored in objects. Type parameters can be either reference or value types, hence we need to store that info in instances of generic types, closures and have to pass it in calls to generic functions.

The proposed layout uses 3 bits for every type parameter, which allows to encode

0 - reference types
1 - reference to value types
2 - 8 bit value "
3 - 16 bit value "
4 - 32 bit value "
5 - 64 bit value " for now, and we have 2 more for future use.

object header:

[16 bits type tag][GC grey flag][GC value-type flag?][?][45 bits for 15 type parameters max]

this could be extended by as many 64 bit fields as required.

When marking an object, the GC knows about non-TP fields by looking up the info using the type-tag, and about TP-fields using the info stored in the object header.

All type parameters start as arguments to functions at some point. We want to avoid having to specialize every function/constructor call, we want to avoid trashing the cache by constantly looking up type-parameter info somewhere, so we'd ideally pass it on via the stack.

For generic classes, enums, closures and anons we have to store that info within the objects. Because we don't know whether an argument to a closure is actually a type-parameter or not, we have to pass that info for every argument, regardless whether an argument is a TP originally or not.

We do that by adding a 64bit field to the closures arguments, which limits us to 21 arguments (because we need 3 bits per parameter), for now. Other generic functions are limited to 21 arguments that are type parameters, for now, they are passed in the same way.

If we construct an instance of a type with TPs, we need to know the TP info. There are only three options where that can come from:

It's a concrete type, so we just know it.
We are inside a method of a class instance, and the type parameter in question is also a parameter to this, in that case we look up the info from this' object header.
We are in a generic function, in that case the necessary info is passed to us as function argument.

As a not-quite-accurate-but-good-enough-to-get-the-idea illustration see: https://gist.github.com/ousado/43d11fec19a1ed44bc09

As an example for calling a generic function/accessing type parameters in pseudo-C/Haxe code:

    class C<T> {

        var class_bits:UInt64;
        var t:T;

        function new<A,B,T>(new_bits:UInt64){
            class_bits = 7 & ( new_bits >> ((2-0)*3) );
        }

        function f<V,U>(f_bits:UInt64,a:U,s:String,b:V){
            
            var t_bits   = c_bits; // has type parameter in the correct position 0 (bits 0-2)
            
                // we calculate the bits argument as follows:
                // var MASK = ( 7 << (dest.pos*3))
                // if origin.pos 
                //    <  dest.pos: MASK & (origin << ((dest.pos - origin.pos)*3))
                //    == dest.pos: MASK & origin
                //    >  dest.pos: MASK & (origin >> ((origin.pos - dest.pos)*3))
                // so:
                
                // T has pos 0 in class_bits (origin) and pos 0 in call_bits (dest)
                call_bits  =  7 & class_bits;
                // U has pos 1 in f_bits (origin) and pos 2 in call_bits (dest)
                call_bits  |= (7 << (3*2)) & (f_bits  << ((2 - 1)*3));
                // V has pos 0 in f_bits (origin) and pos 1 in call_bits (dest)
                call_bits  |= (7 << (3*1)) & (f_bits << ((1 - 0) * 3));
                
                callMe(call_bits,t,v,u);
            
        }
        
        static function callMe<T,V,U>(bits:UInt64,t:T,v:V,u:U){
            
        }
        
    }

Known arguments are hardcoded. ( e.g. a 16-bit Int at position 3: (2 << (3*3)) == 1024 )

Possible optimizations include:

if there's just one parameter in the TP-info, avoid &-ing
if we're shifting the TP at origin.pos 0 to the last dest.pos, avoid &-ing
if we know that we don't need the TPs inside the function for constructing an instance, or calling another generic function that does, recursively, just pass 0, as that's the only reason for passing the TPs. (hum, actually we still might need it for boxed value types, but that's not passing a TP, technically, so yes) (hum 2, we also might need them to know whether we have to check for triggering the write barrier, so maybe actually no :) )

Stack map

To have a fully precise GC we need knowledge about references stored on the stack (including registers).

In a given stack frame, we create a struct that has enough fields to store all relevant references that can occur in that frame.

/* 
implicit StackMap<{v:Option<Int>,v2:R, vlast:Option<Int>}>                */
function f( ref1 : R, ref2 : R ){

ref1 and ref2 don't need to be added to the stackmap, because they're in the stackmap of an ancestors frame already. That's a really awesome thing, because it means that in many/most cases we can avoid having type parameters in the stackmap, which need additional info, which means some encoding overhead. */

  var v = Some(123); // needs to go into the stackmap
  
  var v2 : R = SomeClass.get_ref(v); // v2 also needs to go into the stackmap

  if _ {
    var v3 = Some(123); // -> stackmap
  } else {
    var v3 = Some(123); // -> stackmap
  }

BUT we want to avoid for the stackmap to hold references to non-live values, hence we need to reuse fields of the stackmap, which is possible when the former reference

isn't live anymore
has the same kind (from the GCs perspective, which means either valreftype or reftype). Additionally we should set non-live references to null in the stackmap if there will or might be an allocation later, before we've returned from this function (IOW, the stackmap might still be used)

  var v3 = Some(123); // -> stackmap (should use the same field of the stackmap as the above v3's)

  var vlast = Some(123);

strictly speaking, vlast doesn't need to go into the stackmap, because there's no allocation, and no function call after this allocation */

A Traversable Objects Life-Cycle

Allocation

When we allocate an object, we know whether they are value types or not. We also know the exact size of the object. We hence also know which pool we're allocating from, because our pool stores only objects of this specific size (that is, objects where 2n >= object_size < 2(n+1))

    // this has to be called from the constructor, 
    // we already know the size of the object, and hence the pool idx,
    // as well as whether we're dealing with a value type or not
 
    function alloc_specific( pool_idx:Int, is_valtype:Bool):GC_t{
        var obj:GC_t = area.pool[idx].alloc();
        if (!is_valtype){
            obj.set_light_gray();
        } else {
           // do nothing, object stays white, optionally for possible val-types (type parameters):
           obj.set_valtype();
        }
        return obj;
    }

Writing to objects I

We only write references to ref_types, obviously. When we encounter an object, and write a reference to it, we check whether the gray bit is set. If so, we don't take any further action. If not, see Write Barrier below.

    // we have to call this after writing a non-null reference to the object.
    function on_write_reference(obj:GC_t){
        if (obj.is_gray() && !obj.is_valtype){ // we might need is_valtype here for type parameters
            // do nothing
        } else {
            // triggered write barrier
            obj.set_gray();
            if (obj.is_black()){
                graystack.push(obj);
            } else {
                // do nothing
            }
        }
    }

Marking I

Now the GC starts marking and traversing objects from the roots, and eventually meets our object, which means that it's reachable. It was light-gray, now it's marked as reachable (mark-bit turned black), and the gray bit remains set - it turned into a dark-gray object. At this point it's also pushed onto the graystack.

While the object is dark-gray, it can still be written to by only checking the gray bit.

Marking II

The next transition happens, when the GC pops the object from the graystack - It traverses all the objects references (sets their mark-bit black and pushes them on the graystack, as per above ), and after that the object doesn't have unchecked refs anymore, the gray bit can be cleared. The object is black now.

@:enum abstract MarkE {
    var ref_type     = 0; // a reference
    var val_ref_type = 1; // a reference to a value type
    var val_type     = 2; // a value type
    var type_param   = 3; // either a val_type, a ref_type or a val_type_ref
}

class Marking {
    
    var C_T_desc:RefData = {
        N   :  4,
        data: [type_param,val_type,ref_type,ref_type]
    }
    var ref_data = [_, C_T_desc ,_];

    function mark(v:GC_t){
        
        var data     = ref_data[v.type_tag];
        var tp_mask  = 7;
        
        for (idx in 0...ref_d.N) switch data[idx] {
            case ref_type:     mark_ref(v,idx);
            case val_type:     // do nothing
            case val_ref_type: mark_ref_black(v,idx);
            case type_param:   // check our TP info
                switch (tp_mask & v.class_bits){
                    case ref_type:     mark_ref(v,idx);
                    case val_ref_type: mark_ref_black(v,idx);
                    case _:            // do nothing
                }
                tp_mask <<= 3;
        }
        mark_black(v);
    }
    
    function mark_ref(v:GC_t,idx:Int){
        if (v[idx] != null){
            pool.graystack.push(v[idx]);
        }
    }
}

Write Barrier

If the object is written to now, we have an issue - because black objects may never refer to non-gray objects, we have to traverse the new reference, too. We prepare for this by setting the gray flag and pushing it onto the graystack again. This scenario is called "triggering the write barrier" in the Lua GCs description.

Major Sweep

Now we reach a point where we have consumed the whole graystack - there's nothing to be done for the GC, except freeing unreachable objects. Here a slightly modified version of Mike Palls crazy optimization kicks in: We have a bitmap of markbits and we have a bitmap in our freestack, we can sweep the entire used memory by simply inverting the markbits (set, black bits are cleared, and all others are set) and writing them over the freestack bitmap. For any 64-bit item that contains at least one free (set) bit, we push the index of that bitfield on our freestacks stack. After that, we clear all the mark-bits, by simply zeroing out the complete marking bitmap.

IOW, we reinitialize our freestack during the sweep:

    
    function sweep(){
        for (pool in area.pools){
            pool.freestack.idx = 0;
            for (idx in 0...pool.committed){
                var segment:UInt64 = pool.markbits[idx];
                pool.freestack.bitmap[idx] = ~segment;
                if (segment != 0){
                    pool.freestack.stack[pool.freestack.idx] = idx;
                    pool.freestack.idx += 1;
                }
            }
        }
    }

A Non-Traversable Objects Life-Cycle

Value types are a bit different, because they don't have to be traversed.

Allocation

When we allocate them, we set the value-type flag on the object, and leave it white.

Marking

When we encounter a value type in the Mark I phase, we don't push it onto the graystack. Instead, we immediately turn the object black. As we will never write a reference to it, we'll never trigger the write barrier and push it back onto the graystack. This works both for true value types, and accidental value-types (objects with all-value-types type parameters), because we have the value-type flag, that prevents us from doing unnecessary work.

Sweeping

No difference to reference types.

After the Sweep

Write Barrier II

After the sweep phase we have only white objects left. We left the val_type bit untouched, but cleared all grey bits. This means that when we write to a traversable object, we have to check its mark bit. If the mark bit is black, we have to follow the procedure described in Writer Barrier I. This is a bit unfortunate, since directly after the sweep phase, not a single object is black, and the check for the mark bit is relatively expensive (compared to newly allocated objects, where the gray bit is set directly).

Optimization Opportunity Or Stupid?

As one optimization, we could think about introducing a post-sweep, pre-mark-II phase. In this phase we know that no single black object exists, so whenever we encounter a white traversable object, we can safely set the gray bit, and we're as good as in the newly allocated phase. When we mark objects, we turn them black but we do not clear the gray flag, and we do not remove them from the graystack (this requires a a different way of accessing the graystack, which is not hard to do, however, because it's just a bitfield, so we can process 64 candidates at a time). At some point, the GC has accumulated a certain amount of objects on the graystack, and we switch over to the regular marking-phase, which involves removing objects from the graystack and marking them black by clearing the gray flag. From now on the write barrier triggers as usual.

Generational Mode

Another optimization might be the generational mode described in the Lua GC documentation: The introduction of a minor sweep & mark mode.

Minor Mark & Sweep Mode

In this mode, the sweep phase is modified to prepare for a following minor collection - instead of marking objects white after sweeping, the GC leaves them black. It keeps track of newly allocated objects by allocating them from different regions in our allocator (we could e.g. do this by adding a second base-index per pool, that is only used during a minor collection), in which then only newly allocated objects live. The write barrier triggers as normal, so black survivors are marked dark-gray and pushed on the graystack. These are the "minor roots" - the GC traverses these (and only these) to establish which of the newly allocated objects are reachable. When there are no more dark-gray objects to process, the GC sweeps only the regions with new objects, and hence only frees unreachable newly allocated objects, the young generation, without having to move objects around.

After a number of minor collections, we start a major one by simply using the major marking mode - traversing all objects from the roots, with a subsequent major sweep.

Debug Mode

We want to make sure that we never access a freed object. We can accomplish this by checking the free-flag in our free stack on any read or write object access. While this is expensive because it constantly pollutes the cache, we obtain exact information about which kinds of objects aren't handled correctly.

For testing the thoroughness of the marking implementation, we can introduce a second, full mark-phase directly before the sweep phase, before the mutator has allocated any new object after draining the graystack. If we fail to mark objects using the actual algorithm we should hopefully be able to detect that during this phase. For the detection of certain problems, e.g. the false attribution of type parameter info, this test should check any possible references, even if they aren't marked as such.

Memory layout & allocator

We have several memory pools for objects, where the object sizes are powers of 2. Each such pool has a bitmap for the mark bits, and a bitmap for the allocator(free stack). On 64 bit systems, these pools reserve a large range of virtual memory, the actually used memory can be extended on demand. This way we obtain one large linear array per object size, which makes the mapping from an index in that array to the respective bits in the bitmaps trivial.

As an example let's take object sizes of 16,32,128 and 256 bytes, and let's assume a maximum of 2^20 (ca. one million) objects per size. We need one bit per object for each the free stack and the mark bits.

16: [192KB freestack bitmap+int array][192KB greystack bitmap+int array][128KB mark bitmap][16MB data]
32: [192KB freestack bitmap+int array][192KB greystack bitmap+int array][128KB mark bitmap][32MB data]
128: [192KB freestack bitmap+int array][192KB greystack bitmap+int array][128KB mark bitmap][128MB data]
256: [192KB freestack bitmap+int array][192KB greystack bitmap+int array][128KB mark bitmap][256MB data]

The required space for freestack,greystack and mark bits remains constant for a given maximum number of objects, only the size of data area varies.

TODO: Detailed documentation, 32bit version

Implementation

required changes in genc:

add the tpinfo bitfield to
- all generic function and constructor calls
- all closure calls
identify instantiations of
- generic classes
- generic enums
- closures
- anons where type parameters are stored
For every such instance, store the tpinfo into the object header, as described in objects & type parameters.
write barrier & friends

in release mode: Add wrapper code for conditionally handling the writing of references to objects, and for writing of TP-typed values to objects (the code checks the object kind, if it's a value type, do nothing) for a sequence of writes to the same object within the same frame (can be extended by doing more static analysis later) it'd be great to only wrap the first write to the object. If were in the post-sweep-premark II phase, don't invoke the barrier

We have to exclude accesses to Struct<T> members, and we need some metadata to mark the code in certain types as runtime-level, preallocated, so the GC has no business in dealing with them.

in debug mode: Additionally, add code that checks whether an object is alive on every access. This means creating wrapper code for both reads and writes, and it means checking every level of deep field accesses.
allocations: We store a context with a pointer to the threads allocator in a thread-local var. For every instance, we figure out the object size at compile time (except for array storage, of course), so we can directly call the allocation function for a specific pool. If we're allocating a generic type, we check the type parameters. If all type parameters are value types, we're passing is_valtype=true to the allocation function, which sets the respective bit, so the GC knows it doesn't have to traverse that object, and can mark it black immediately, when it encounters it. Otherwise that flag is clear.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC & Allocator

Overview (most of this needs some updates)

Objects & Type Parameters

Stack map

A Traversable Objects Life-Cycle

Allocation

Writing to objects I

Marking I

Marking II

Write Barrier

Major Sweep

A Non-Traversable Objects Life-Cycle

Allocation

Marking

Sweeping

After the Sweep

Write Barrier II

Optimization Opportunity Or Stupid?

Generational Mode

Minor Mark & Sweep Mode

Debug Mode

Memory layout & allocator

Implementation

Clone this wiki locally