backgnd_deppar.tex


\status{This section's content is complete.}

% This is a figure since there are a number of references to it.
\begin{figure}
\begin{verbatim}
map_foldl(M, F, L, Acc0, Acc) :-
    (
        L = [],
        Acc = Acc0
    ;
        L = [H | T],
        (
            M(H, MappedH),
            F(MappedH, Acc0, Acc1)
        &
            map_foldl(M, F, T, Acc1, Acc)
        )
    ).
\end{verbatim}
%\vspace{2mm}
\caption{Parallel \mapfoldl{}}
\label{fig:mapfoldl}
%\vspace{-1\baselineskip}
\end{figure}

Mercury's mode system allows a conjunct in a sequential conjunction to consume
variables that are produced by conjuncts to its left, but not to its right.
However,
The first parallel execution support for Mercury
initially only supported independent AND-parallelism
\citep{conway:2002:par},
Communication was not allowed between conjunctions.
\citet*{wang:2011:dep-par,wang:2006:hons} relaxed this;
now parallel conjunctions have the same mode constraints as sequential
conjunctions.
Parallel conjunctions,
such as the one in Figure~\ref{fig:mapfoldl},
may now
communicate through \emph{shared variables} such as \code{Acc1}.
A shared variable is a variable that is bound within the parallel
conjunction and occurs in more than one conjunct.
Recall that in Mercury,
the code location where a variable becomes bound is known
at compile time;
therefore,
shared variables can be identified at compile time.
The compiler replaces each shared variable with a \emph{future}
\citep{halstead:1984:multilisp},
which is used to safely communicate the variable's value among conjuncts.

\begin{figure}
\begin{verbatim}
enum MR_produced {
    MR_NOT_YET_PRODUCED, 
    MR_PRODUCED
};

struct MR_Future {
    /* lock preventing concurrent accesses */
    MercuryLock             MR_fut_lock;
    /* whether this future has been signalled yet */
    enum MR_produced        MR_fut_produced;

    /* linked list of all the contexts blocked on this future */
    MR_Context              *MR_fut_suspended;
    MR_Word                 MR_fut_value;
};
\end{verbatim}
\caption{Future data structure}
\label{fig:future}
\end{figure}

Figure~\ref{fig:future} shows the future data structure,
which contains room for the value of the variable,
a flag indicating whether the variable has been produced yet,
a queue of consumer contexts waiting for the value, and a mutex.
The initial value of the future has the flag set to
\code{MR\_NOT\_YET\_PRODUCED}.

Consumers call \wait when they want to retrieve a value from the future.
This acquires the lock, and if \code{MR\_fut\_produced} is 
\code{MR\_PRODUCED}, retrieves
\code{MR\_fut\_value} before releasing the lock.
If \code{MR\_fut\_produced} was \code{MR\_NOT\_YET\_PRODUCED},
then \wait will add the current context to
the list of suspended contexts before unlocking the future.
The engine will then suspend the context and look for other work.
Once the value of the future is provided,
the context will be woken up along with the others on the list of suspended
contexts.
\wait contains an optimisation:
it will check \code{MR\_fut\_produced} before
acquiring the lock as well as after.
If it is \code{MR\_PRODUCED} before the lock is acquired,
then \wait can safely retrieve the
future's value without using the lock.
Note also that because Mercury's mode system ensures that variable
dependencies can never form cycles,
the compiler's use of futures cannot create a deadlock.

A producer will call \signal to place a value into a future and wake up any
suspended contexts.
\signal will also acquire the lock to ensure that it does not race with any \wait.
\signal writes the value of the future to memory,
before it sets 
\code{MR\_fut\_produced} to \code{MR\_PRODUCED}.
These operations are separated by a memory barrier to ensure that they are
visible to other threads in this order.
After releasing the lock, \signal will schedule the suspended contexts.
Because \signal has no outputs and is deterministic,
it must be declared as impure so that the compiler will not optimise away calls
to it.

Some readers may wonder whether futures are similar to POSIX condition
variables \citep{butenhof1997:pthreads}.
While both name their operations \emph{wait} and \emph{signal},
they are different in two significant ways.
First,
futures store a value as well as a state,
POSIX condition variables store only their state.
Second,
when a future is signalled, all its consumers are woken,
whereas only one of a POSIX condition variable's waiters is woken.
A POSIX condition variable is more similar to a semaphore.

To minimise waiting,
the compiler pushes \signal operations on each future
as far to the left into the producer conjunct as possible,
and it pushes \wait operations
as far to the right into each of the consumer conjuncts as possible.
This means not only pushing them
into the bodies of predicates called by the conjunct,
but also into the bodies of the predicates they call,
with the intention that on each execution path
each \signal is put immediately after
the primitive goal that produces the value of the variable,
and each \wait is put immediately before
the leftmost primitive goal that consumes the value of the variable.
Since the compiler has complete information
about which goals produce and consume which variables,
the only things that can stop the pushing process
are module boundaries and  higher order calls.
This is because
the compiler cannot push a \wait or \signal operation
into the body of a predicate that it does not have access to
or into a predicate that it cannot identify.

Any variable is produced exactly once along any execution path,
and therefore there is exactly one \signal operation along any path.
However, a variable may be used any number of times along an execution path,
therefore there can be any number of \wait operations along a path.
The maximum number of \wait operations along a path is equal to the number
of times the future's value is used.
Reading a future using \wait is more expensive than reading a conventional
variable.
Therefore, we want to minimise the number of \wait operations that can be
executed along a path.
The minimum number of \wait operations along a path is either zero or one,
depending on whether a variable is used on that path.
This can be done either by adding an extra variable to track the value of
the future, setting this variable's value from the first \wait operation;
or by introducing a new operation, \get whose runtime cost is lower
and that can safely be used for the second and subsequent \wait operations.
\get is useful in cases where it is difficult or impossible to track the
value of an extra variable.

\paul{The auto-parallelism analysis assumes that a wait is used
on all success paths, which is not the default.}
An implementor can choose among always using a \wait operation or
trying to optimise the second and subsequent calls to \wait or
some point on a scale between these two extremes.
There is a trade-off to be made here.
The first option has very little analysis at compile time but has worse
runtime performance,
the second option can require a lot of compile time analysis and has ideal
runtime performance.
We have chosen to use a simple method to reduce the number of \wait
operations:
When variables are replaced by futures and \wait operations are inserted,
the transformation tracks whether there is a \wait operation on every success
path leading to the current goal.
If this is true and the current goal wants to access the value of the future
then a \get operation is used instead of a wait operation.

While \signal actually makes the value of a future available,
we have to consider that \wait has the same effect
since after a call to \wait the value of a future is available and it
may not have been available earlier.
In other words,
\wait has the side effect of guaranteeing that the value of the future is
available.
\get can only be called when the value of the future is true,
therefore it is affected by \wait's side effect.
Therefore, \get must be semipure and \wait must be impure.
This prevents the compiler from optimising away or moving a call to \wait
that is necessary to make a call to \get safe.

\begin{figure}
\begin{verbatim}
map_foldl(M, F, L, Acc0, Acc) :-
    (
        L = [],
        Acc = Acc0
    ;
        L = [H | T],
        future_new(FutureAcc1),
        (
            M(H, MappedH),
            F(MappedH, Acc0, Acc1),
            future_signal(FutureAcc1, Acc1)
        &
            map_foldl_par(M, F, T, FutureAcc1, Acc)
        )
    ).

map_foldl_par(M, F, L, FutureAcc0, Acc) :-
    (
        L = [],
        future_wait(FutureAcc0, Acc0),
        Acc = Acc0
    ;
        L = [H | T],
        future_new(FutureAcc1),
        (
            M(H, MappedH),
            future_wait(FutureAcc0, Acc0),
            F(MappedH, Acc0, Acc1),
            future_signal(FutureAcc1, Acc1)
        &
            map_foldl_par(M, F, T, FutureAcc1, Acc)
        )
    ).
\end{verbatim}
%\vspace{2mm}
\caption{\mapfoldl{} with synchronisation}
\label{fig:map_foldl_sync}
%\vspace{-1\baselineskip}
\end{figure}

Given the \mapfoldl predicate in Figure~\ref{fig:mapfoldl},
this synchronisation transformation
generates the code in Figure~\ref{fig:map_foldl_sync}.