background.tex

\section{Background and Related Work}
\label{sec:background}

The dominant FSM algorithm over the past decade has been gSpan\cite{gspan}. 
gSpan has been shown to give good performance relative to other existing
systems from both a computational time and memory consumption standpoint.
gSpan and algorithms like it are difficult to parallelize as it does a 
recursive depth-first search of the solution space, and each search
can take a variable amount of time.   

Despite this, Buehrer et al. produced a shared-memory
parallel version of a gSpan-like algorithm that achieved good speedup on a 
single NUMA system\cite{buehrer2005parallel}.  They achieved this by 
farming out the work of doing the depth first search for a candidate 
subgraph to a different thread, and used a dynamic queueing 
system to spread the work out more evenly and address the variable
mining time issue.  In this work, they cite 
memory contention as one of their main bottlenecks. It is our hope that
a distributed memory systems implementation using MPI may not suffer
from that issue.  

More recently, Wang et al. parallelized gSpan using a 
GPU\cite{gspancuda}, again achieving good results.  They parallelized both
the subgraph mining and the generation of the DFS canonical lexical 
ordering of nodes.  They rely on the built-in NVIDIA CUDA scheduler to handle
any queued up work from what they find. However, the use of a 
GPU without any other higher-layer parallel framework limits the problem
size to the relatively small resources of a GPU. 

The success of these two previous implementations leads to two interesting
conclusions:
\begin{itemize}
	\item{That gSpan-like systems can be parallelized}
	\item{That even if a pure MPI implementation is not as performant, 
		there may be the possibility of using a hybrid approach
		that uses MPI to distribute work to these single-node 
		(or single GPU) implementations}
\end{itemize}