backgnd_merpar.tex

\status{This section is complete.}

The Mercury compiler has several backends,
in this dissertation we are only concerned with the low level C backend.
The low level C backend uses an abstract machine implemented using the C
preprocessor.
The abstract machine has 1024 general purpose registers and some
special registers, such as the program counter.
These registers are virtual:
they are mapped onto global variables.
Commonly used virtual registers are mapped to real machine registers.
The number of virtual registers mapped onto real machine registers depends upon
the architecture of the physical machine.

The compiler knows the determinism of every procedure.
It will generate more efficient code for deterministic procedures.
In particular,
code with at most one solution uses the \emph{det stack} (deterministic
stack),
which operates in much the same way as a stack in an imperative
language.
The det stack has a single pointer called the stack pointer,
it points to the top of the stack.
Code with one or more solutions uses the \emph{nondet stack}:
When a nondeterministic procedure produces a solution it returns control to
its caller but
it leaves its stack frame on the nondet stack so that
the procedure's variables are available on backtracking.
Therefore,
two stack pointers are used with the nondet stack,
the first points to the top of the stack where new frames are allocated
and the second points to the current stack frame.
Mercury only supports parallel conjunctions in deterministic code,
therefore the nondet stack and its two stack pointers are largely irrelevant
for this dissertation.
\citet*{mercury_jlp} explains Mercury's execution algorithm in more detail.

\label{page:engine}
To support parallel execution,
\citet*{conway:2002:par} grouped the variables used to implement the abstract
machine's registers into a structure called an \emph{engine}.
Each engine corresponds to a POSIX Thread (pthread)
\citep{butenhof1997:pthreads}.
Multiple engines can reside in memory at the same time,
allowing each pthread to maintain its own execution state.
The number of engines that a parallel Mercury program will allocate on startup
is configurable by the user.
Once the runtime system has been stared the number of engines in use is
fixed.

In non-parallel grades a single engine is allocated statically and is known
to reside at a particular address.
However,
in parallel grades the multiple engine's addresses are unknown at compile
time.
For each pthread,
the runtime system tracks the thread's engine's address in a real CPU
register\footnote{
    If GCC global registers are unavailable,
    POSIX thread-local storage is used.
},
therefore there one less virtual register can be mapped to a real register.

\citet{conway:2002:par} also introduced a new structure called a
\emph{context}, known elsewhere as a green thread.
Contexts represent computations in progress.
Unlike green threads contexts cannot be preempted.
An engine can either be idle, or executing a context;
a context can either be running on an engine, or be suspended.
Separate computations must have separate stacks,
therefore each stack is associated with a context.
When a computation is suspended,
the registers from the engine are copied into the context,
making the engine free to load a new context.

Stacks account for most of a context's memory usage.
When a context finishes execution
it can either be retained by the engine or
have its storage released to the free context pool.
This decision depends on what the engine will do next:
if the engine will execute a different context or go to sleep
(because there is no work to do),
then the current context is released.
Otherwise the engine holds into a context expecting it to be used to execute
a \emph{spark}.

Sparks represent goals that have been spawned off
but whose execution has not yet been started.
\citet*{wang:2006:hons} introduced sparks with the intention of
enabling work stealing as described by
\citet*{blumofe:1994:work-stealing,halstead:1985:multilisp}
and \citet*{kranz:1989:mul-t}.
We did not implement work stealing until during my Ph.D.\ candidature
and therefore it is not discussed here, in the background chapter.
We discuss work stealing in Sections~\ref{sec:rts_work_stealing}
and~\ref{sec:rts_work_stealing2}.
Sparks are very light weight,
being three words in size.
Therefore, compared with contexts,
they represent outstanding parallel work very efficiently;
this is beneficial, even without work stealing.
When an engine executes a spark, it will convert the spark into a context.
it will use its current context if it has one
(the engine's current context is always free when the engine attempts to run
a spark),
or if it does not have a context,
it allocates a new one
(from the pool of free contexts if the pool is not empty,
otherwise the engine creates a new one).
We will often compare our use of sparks to that of
\citet*{simonmar_2009_multicore_rts} since GHC's runtime system has a
similar structure.
One of the more significant differences
is that since Haskell is non-strict and Mercury is eager,
sparks in Mercury cannot be garbage collected and must be executed.
In comparison,
Haskell's sparks represent speculative execution of values which may not be
needed and can be safely discarded.

The only parallel construct in Mercury is parallel conjunction,
denoted as $(G_1~\&~\ldots~\&~G_n)$.
All the conjuncts must be \ddet or \dccmulti,
that is, they must all have exactly one solution,
or commit to exactly one solution.
This restriction greatly simplifies parallel execution in two ways:
Firstly,
it guarantees that there can never be any need
to execute $(G_2~\&~\ldots~\&~G_n)$ multiple times,
just because $G_1$ has succeeded multiple times.
(Any local backtracking inside $G_1$ will not be visible to the other
conjuncts;
bindings made by \ddet code are never retracted.)
Supporting parallelisation of
code that might have more than one solution
requires new infrastructure for managing bindings.
Any new infrastructure for managing bindings has a significant runtime cost,
reducing the benefits of parallel execution.
Secondly,
code that might have no solutions also requires extra infrastructure.
Additionally,
if $G_1$ may fail then
$G_2~\&~\ldots~\&~G_n$'s execution is speculative and may waste resources
since its result may never be needed.
Restricting parallelism to \ddet and \dccmulti code is not a significant
limitation;
since the design of Mercury strongly encourages deterministic code,
in our experience, about 75 to 85\% of all Mercury procedures are
\ddet.
(This statistic was calculated by counting the different
determinism declarations in the source code of the Mercury system.)
Furthermore,
we expect that most programs spend an even greater fraction of their time in
\ddet code
(we know from profiling data that the Mercury compiler does).
% Table~\ref{tab:recursion_types}) shows that at least 95\% of the
% compiler's time is spent in \ddet, \dccmulti, \dsemidet or \dccnondet
% code, which is the most accuratly we have measured this.
Existing algorithms for executing nondeterministic code in parallel
have very significant overheads, generating slowdowns by integer factors.
Thus we have given priority to parallelising deterministic code,
which we can do with \emph{much} lower overhead.
We think that avoiding such slowdowns is a good idea,
even if it does mean foregoing the parallelisation of 15 to 25\% of a program.

\begin{figure}
\begin{verbatim}
    /*
    ** A SyncTerm. One syncterm is created for each parallel conjunction.
    */
struct MR_SyncTerm_Struct {
    MR_Context              *MR_st_orig_context;
    MR_Word                 *MR_st_parent_sp;
    volatile MR_Unsigned    MR_st_count;
};

    /*
    ** A Mercury Spark.  A spark is created for a spawned off parallel
    ** conjunct.
    */
struct MR_Spark {
    MR_SyncTerm             *MR_spark_sync_term;
    MR_Code                 *MR_spark_resume;
    MR_ThreadLocalMuts      *MR_spark_thread_local_mutables;
}
\end{verbatim}
\caption{Syncterms and sparks}
\label{fig:spark_and_syncterm}
\end{figure}

A Mercury engine begins the execution of $(G_1~\&~G_2~\&~\ldots~\&~G_n)$
by creating a \emph{syncterm} (synchronisation term), a data structure
representing a barrier for the whole conjunction.
The syncterm contains:
a pointer to the context that began executing the parallel conjunction
(the parent context),
a pointer to the stack frame of the procedure containing this parallel
conjunction,
and a thread safe counter that tracks the number of conjuncts that have not
yet executed their barrier.
After initialising this syncterm, the engine then
spawns off $(G_2~\&~\ldots~\&~G_n)$ as a spark and continues by executing
$G_1$ itself.
The spark contains a pointer to the syncterm,
as well as a pointer to the code it must execute,
and another pointer to a set of thread local mutables.\footnote{
    Mercury has a module-local mutable feature that supports
    per thread mutables.
    This is not relevant to the dissertation.}
Figure~\ref{fig:spark_and_syncterm} shows these two data structures,
the \code{MR\_} prefixes on symbol names are used for Mercury runtime
symbols,
preventing their names from clashing with any symbols in C code linked with
a Mercury program.
Chapter~\ref{chap:rts} describes how sparks are managed.

When the engine finishes the execution of the first conjunct ($G_1$)
it executes the barrier code \joinandcontinue at the end of the conjunct.
The barrier prevents the original context from executing the code that
follows the parallel conjunction
until each of the conjuncts has been executed.
The barrier also makes several scheduling decisions, see
Chapter~\ref{chap:rts}.

\begin{figure}
\begin{center}
\begin{tabular}{ll}
\multicolumn{1}{c}{\textbf{Source code}} &
\multicolumn{1}{c}{\textbf{Transformed pseudo-code}} \\
\hline
                    & \code{~~MR\_SyncTerm ST;} \\
                    & \code{~~MR\_init\_syncterm(\&ST, 3);} \\
\code{~~}$($        & \code{~~spawn\_off(\&ST, Spawn\_Label\_1);} \\
\code{~~~~}$G_1$    & \code{~~}$G_1$ \\
\code{~~}$\&$       & \code{~~MR\_join\_and\_continue(\&ST, Cont\_Label);} \\
                    & \code{Spawn\_Label\_1:} \\
                    & \code{~~spawn\_off(\&ST, Spawn\_Label\_2);} \\
\code{~~~~}$G_2$    & \code{~~}$G_2$ \\
\code{~~}$\&$       & \code{~~MR\_join\_and\_continue(\&ST, Cont\_Label);} \\
                    & \code{Spawn\_Label\_2:} \\
\code{~~~~}$G_3$    & \code{~~}$G_3$ \\
\code{~~}$)$        & \code{~~MR\_join\_and\_continue(\&ST, Cont\_Label);} \\
                    & \code{Cont\_Label:} \\
\end{tabular}
\end{center}
\caption{The implementation of a parallel conjunction}
\label{fig:par_conj}
\end{figure}

Since $(G_2~\&~\ldots~\&~G_n)$ is itself a conjunction,
it is handled in a similar way:
the context executing it
first spawns off $(G_3~\&~\ldots~\&~G_n)$ as a spark that points to the sync
term created earlier,
and then executes $G_2$ itself.
Eventually, the spawned-off remainder of the conjunction
consists only of the final conjunct, $G_n$,
and the context just executes it.
Once each conjunct synchronises using {\joinandcontinue},
the original context will continue execution after the parallel conjunction.
The introduction of the barrier at the end of the conjunction can prevent
the compiler from using tail recursion optimisation.
This occurs when $G_n$ ended in a recursive call, and the whole conjunction
was the last conjunction in a procedure's body.
We discuss this problem in more detail and provide our solution in
Chapter~\ref{chap:loop_control}.
Figure~\ref{fig:par_conj} shows an example of the code generated to execute
a parallel conjunction.
In this example the first conjunct creates a spark that represents the
execution of the second and third conjuncts,
the second conjunct then creates a spark representing just the third conjunct,
which does not create any sparks.