Chapter_4.lyx

#LyX 2.0 created this file. For more info see http://www.lyx.org/
\lyxformat 413
\begin_document
\begin_header
\textclass article
\begin_preamble
\usepackage{amsmath}
%\newcommand\argmin{\operatornamewithlimits{arg\,min}}
%\DeclareMathOperator{\argmin}{arg\,min} 
 \def\argmin{\mathop{\operator@font arg\,min}} 
\def\argmax{\mathop{\operator@font arg\,max}} 
%\DeclareMathOperator{\argmax}{arg\,max}
\newcommand{\gini}{\mathtt{gini}}
\newcommand{\rce}{\mathtt{rce}}
\usepackage{siunitx}
\end_preamble
\use_default_options true
\maintain_unincluded_children false
\language english
\language_package default
\inputencoding auto
\fontencoding global
\font_roman palatino
\font_sans default
\font_typewriter default
\font_default_family default
\use_non_tex_fonts false
\font_sc false
\font_osf false
\font_sf_scale 100
\font_tt_scale 100

\graphics default
\default_output_format default
\output_sync 0
\bibtex_command default
\index_command default
\paperfontsize 12
\spacing onehalf
\use_hyperref false
\papersize a4paper
\use_geometry true
\use_amsmath 1
\use_esint 1
\use_mhchem 1
\use_mathdots 1
\cite_engine basic
\use_bibtopic false
\use_indices false
\paperorientation portrait
\suppress_date false
\use_refstyle 0
\index Index
\shortcut idx
\color #008000
\end_index
\leftmargin 20page%
\topmargin 15page%
\rightmargin 15page%
\bottommargin 15page%
\secnumdepth 3
\tocdepth 3
\paragraph_separation skip
\defskip bigskip
\quotes_language english
\papercolumns 1
\papersides 1
\paperpagestyle default
\tracking_changes false
\output_changes false
\html_math_output 0
\html_css_as_file 0
\html_be_strict false
\author 1 "" 
\author 3 "Eleftherios Garyfallidis,,," 
\end_header

\begin_body

\begin_layout Section
Highly Efficient 
\begin_inset Newline newline
\end_inset

Tractography Clustering
\begin_inset CommandInset label
LatexCommand label
name "sec:Highly-Efficient-Tractography"

\end_inset

 
\end_layout

\begin_layout Subsection
Overview
\end_layout

\begin_layout Standard
Current tractography propagation algorithms can generate very large tractographi
es which are difficult to interpret and visualize.
 A clustering of some kind seems to be a solution to simplify the complexity
 of these datasets and provide a useful segmentation; however most proposed
 clustering algorithms are very slow and often need to calculate pairwise
 distances of size 
\begin_inset Formula $N\times N$
\end_inset

 where 
\begin_inset Formula $N$
\end_inset

 is the number of tracks.
 This amount of comparisons adds a heavy load on clustering algorithms forcing
 them to be inefficient and therefore impractical for everyday analysis
 as it is difficult to compute all these distances or even store them in
 memory.
 This adds a further overhead to the use of tractography for clinical applicatio
ns but also introduces a barrier on understanding and interpreting the quality
 of diffusion data sets.
 We show in this chapter that a stable, on average linear time clustering
 algorithm exists.
 We call this algorithm QuickBundles (QB).
 QB can be used to generate meaningful clusters in seconds with minimum
 memory consumption.
 In our approach we do not need to calculate all pairwise distances unlike
 most of the other existing methods.
 Furthermore, we can update our clustering online or in parallel.
 We show that we can generate meaningful clusters of the order of 
\begin_inset Formula $1,000$
\end_inset

 times faster than any other available method and that it can be used to
 segment from a few hundred to many millions of tracks.
 Moreover our method is multi-purpose; its results can either stand on their
 own to explore the neuroanatomy directly, or the clustering technique can
 be used as a precursor tool which reduces the dimensionality of the data,
 which can then be used as an input to other algorithms of higher order
 complexity, resulting in their greater efficiency.
 Beyond the use of this algorithm to simplify tractographies, we show here
 how it can help identify landmarks, create atlases, and compare and register
 tractographies.
\end_layout

\begin_layout Subsection
Track distances and preprocessing
\begin_inset CommandInset label
LatexCommand label
name "sub:track-distances"

\end_inset


\end_layout

\begin_layout Standard
For clarity we first give brief details of various metrics for distances
 between tracks as they are integral to an understanding of the track clustering
 literature.
 Numerous distance metrics between two trajectories have been proposed in
 the literature, such as in 
\begin_inset CommandInset citation
LatexCommand cite
key "Ding2003"

\end_inset

, 
\begin_inset CommandInset citation
LatexCommand cite
key "MaddahIPMI2007"

\end_inset

, 
\begin_inset CommandInset citation
LatexCommand cite
key "zhang2005dti"

\end_inset

 with the most common being the Hausdorff distance found in 
\begin_inset CommandInset citation
LatexCommand cite
key "corouge2004towards"

\end_inset

 and many other studies.
 We mainly use a very simple symmetric distance proposed in 
\begin_inset CommandInset citation
LatexCommand cite
key "EGMB10"

\end_inset

 and 
\begin_inset CommandInset citation
LatexCommand cite
key "Visser2010"

\end_inset

 which we call Minimum average Direct-Flip 
\begin_inset Formula $\textrm{MDF}(s_{A},s_{B})$
\end_inset

 distance between track 
\begin_inset Formula $s_{A}$
\end_inset

 and track 
\begin_inset Formula $s_{b}$
\end_inset

 (see Eq.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "eq:direct_flip_distance"

\end_inset

).
 This distance can be applied only when both tracks have the same number
 of points.
 Therefore, we assume that an initial downsampling of tracks has been implemente
d, where all segments on a track have the same length, and all tracks have
 the same number of segments.
 Under that assumption MDF is defined as: 
\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none

\begin_inset Formula 
\begin{eqnarray}
\textrm{MDF}(s_{A},s_{B}) & = & \min(d_{\texttt{direct}},d_{\textrm{\texttt{flipped}}}),\,\textrm{where}\label{eq:direct_flip_distance}\\
d_{\textrm{\texttt{direct}}}(s_{A},s_{B}) & = & \frac{1}{K}\sum_{i=1}^{K}||\mathbf{x}_{i}^{A}-\mathbf{x}_{i}^{B}||_{2}\,\textrm{and}\nonumber \\
d_{\texttt{flipped}}(s_{A},s_{B}) & = & \frac{1}{K}\sum_{i=1}^{K}||\mathbf{x}_{i}^{A}-\mathbf{x}_{K-i}^{B}||_{2}\nonumber 
\end{eqnarray}

\end_inset


\family default
\series default
\shape default
\size default
\emph default
\bar default
\noun default
\color inherit

\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
noindent
\end_layout

\end_inset

 where 
\begin_inset Formula $K$
\end_inset

 is the number of points 
\begin_inset Formula $\mathbf{x}_{i}$
\end_inset

 on the two tracks 
\begin_inset Formula $A$
\end_inset

 and 
\begin_inset Formula $B$
\end_inset

.
\end_layout

\begin_layout Standard
In some cases it is still valid to use a family of Hausdorff distances which
 for simplicity we denote as MAM distances 
\begin_inset ERT
status open

\begin_layout Plain Layout

--
\end_layout

\end_inset

 short for Minimum, or Maximum, or Mean, Average Minimum distance (MAM).
 We mostly use the Mean version of this family, (see Eq.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "eq:mean_average_distance"

\end_inset

) but the others are potentially useful as they can weight different properties
 of the tracks.
 These distances are slower to compute than MDF but they can work with different
 number of segments on tracks; a property that is useful for some applications.
 The equations below show the formulation of these distances:
\begin_inset Formula 
\begin{eqnarray}
d_{\textrm{avg}}(s_{A},s_{B}) & = & \frac{1}{K_{A}}\sum_{i=1}^{K_{A}}d(x_{i}^{A},s_{B}),\nonumber \\
d_{\textrm{min}}(s_{A},s_{B}) & = & \min_{j=1,...,K_{B}}d(\mathbf{x}_{i}^{A},s_{B}),\,\textrm{and}\label{eq:mininum_distance}\\
d_{\textrm{max}}(s_{A},s_{B}) & = & \max_{j=1,...,K_{B}}d(\mathbf{x}_{i}^{A},s_{B})\,\textrm{where}\label{eq:maximum distance}\\
d(\mathbf{x},s_{B}) & = & \min_{j=1,...,K_{B}}||\mathbf{x}-\mathbf{x}_{j}^{B}||_{2}.\nonumber \\
\textrm{MAM}_{\textrm{min}}(s_{A},s_{B}) & = & \min(d_{\textrm{avg}}(s_{A},s_{B}),d_{\textrm{avg}}(s_{B},s_{A}))\label{eq:min_average_distance}\\
\textrm{MAM}_{\textrm{max}}(s_{A},s_{B}) & = & \max(d_{\textrm{avg}}(s_{A},s_{B}),d_{\textrm{avg}}(s_{B},s_{A}))\nonumber \\
\textrm{MAM}_{\textrm{avg}}(s_{A},s_{B}) & = & (d_{\textrm{avg}}(s_{A},s_{B})+d_{\textrm{avg}}(s_{B},s_{A}))/2\label{eq:mean_average_distance}
\end{eqnarray}

\end_inset


\end_layout

\begin_layout Standard

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
noindent
\end_layout

\end_inset

 where the number of points 
\begin_inset Formula $K_{A}$
\end_inset

 and 
\begin_inset Formula $K_{B}$
\end_inset

 on the two tracks are not necessarily the same.
 For the same threshold value 
\begin_inset Formula $\textrm{MAM}_{\textrm{min}}$
\end_inset

, 
\lang british

\begin_inset Formula $\textrm{MAM}_{\textrm{max}}$
\end_inset

 and 
\begin_inset Formula $\textrm{MAM}_{\textrm{avg}}$
\end_inset

 will give different results.
 For example,
\lang english
 
\begin_inset Formula $\textrm{MAM}_{\textrm{min}}$
\end_inset

will bring together more short tracks with long tracks than 
\begin_inset Formula $\textrm{MAM}_{\textrm{max}}$
\end_inset

 and 
\lang british

\begin_inset Formula $\textrm{MAM}_{\textrm{avg}}$
\end_inset


\lang english
 will have an in between effect.
 Finally, other distances than the average minimum based on the minimum
 (see Eq.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "eq:mininum_distance"

\end_inset

) or maximum distance (see Eq.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "eq:maximum distance"

\end_inset

) can be used.
 However, we have not investigated them in this thesis.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset ERT
status open

\begin_layout Plain Layout

[th!]
\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\noindent
\align center
\begin_inset Graphics
	filename QB/Thesis/Fig_2_distances2.png
	lyxscale 20
	scale 60
	rotateOrigin center

\end_inset


\begin_inset Caption

\begin_layout Plain Layout
Distances used in this work.
 The main distance used is minimum average direct flip (MDF) distance 
\begin_inset Formula $\textrm{MDF}=\min(d_{\textrm{\texttt{direct}}},d_{\texttt{flipped}})$
\end_inset

 which is a symmetric distance that can deal with the track bi-directionality
 problem and works on tracks which have the same number of points.
 Another distance is the mean average distance which is again symmetric
 but does not need for the tracks to have the same number of points 
\begin_inset Formula $\textrm{MAM}_{\textrm{avg}}=(d_{avg}(s_{A},s_{B})+d_{avg}(s_{B},s_{A}))/2$
\end_inset

.
 The components of both distances are shown; with solid lines we draw the
 tracks, and then with dashed lines we connect the pairs of points of the
 two tracks whose distances contribute to the overall metrics.
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "Flo:Distances_used"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
The main advantages of the MDF distance (see Eq.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "eq:direct_flip_distance"

\end_inset

), are that it is fast to compute, it takes account of track direction issues
 through consideration of both direct and flipped tracks, and that it is
 easy to understand how it will behave, from the simplest case of parallel
 equi-length tracks to the most complicated of very divergent tracks.
 Another advantage is that it will separate short tracks from long tracks;
 a track A that is half the length of track B will be relatively poorly
 matched on MDF to B.
 We will see later in this chapter that this helps to find broken or erroneous
 tracks.
 An asset of having tracks with the same number of points is that we can
 easily do pairwise calculations on them; for example add two or more tracks
 together to create a new average track.
 We will see in the next section that track addition is a key property of
 our clustering algorithm.
 Some care should be taken into consideration with the number of points
 allowed in a track (track downsampling).
 We always keep the endpoints intact and then downsample in equidistant
 segments.
 This means that short tracks will have the same number of points as long
 tracks.
 Therefore, the curvature from the long tracks will be lost relative to
 the short tracks i.e.
 the short tracks will have higher resolution.
 We found empirically that this is not an important issue and that for clusterin
g purposes even downsampling to only 
\begin_inset Formula $3$
\end_inset

 points in total could be useful 
\begin_inset CommandInset citation
LatexCommand cite
key "EGMB10"

\end_inset

.
 Depending on the application less or more points can be used.
 
\begin_inset Note Note
status collapsed

\begin_layout Plain Layout
MATTHEW: Discussion here of the advantages and disadvantages of different
 numbers of points allowed in tracks? For example, short tracks having the
 same number of points as long tracks means that more of the curvature etc
 data from the long tracks will be lost relative to the short tracks - I
 suppose.
 ELEF: I will work on it shortly.
\end_layout

\end_inset


\end_layout

\begin_layout Subsection
Related Work
\end_layout

\begin_layout Standard
During the last 
\begin_inset Formula $10$
\end_inset

 years there have been numerous efforts from many researchers to address
 the unsupervised and supervised learning problems of brain tractography.
 As far as we know all these methods suffer from low efficiency, however
 they provide many useful ideas which we describe in this section.
 
\begin_inset Note Note
status collapsed

\begin_layout Plain Layout
MATTHEW: I got a bit lost in this section.
 Is there a way of summarizing the papers under different themes such as
 distance metric used, cluster number finding, clustering method or something
 like that? IAN: I am planning to review the structure of this section.
 ELEF: I made some changes.
 I hope it looks better now.
\end_layout

\end_inset


\end_layout

\begin_layout Standard
Tractography clustering algorithms are rarely compared in the literature.
 Nonetheless, Moberts et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "moberts2005evaluation"

\end_inset

 are an exception.
 They evaluated different popular hierarchical clustering methods including
 a less common one, shared nearest neighbor (SNN), against a gold standard
 segmentation by physicians.
 The authors concluded that single-link clustering with mean average distance
 was the method which performed best.
 Wang et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "wang2010tractography"

\end_inset

 proposed a nonparametric Bayesian framework using hierarchical Dirichlet
 processes mixture model (HDPM).
 This is one of the very few methods not based on distances.
 In this work a track is modeled as a discrete distribution over a codebook
 of discretized orientations and voxel regions.
 The authors explain that calculating pairwise distances is very time consuming
 and therefore they avoid using them.
 Their approach automatically learns the number of clusters from data with
 Dirichlet processes priors but it is still not efficient enough for real
 time operation.
 A disadvantage of this method is that the priors do not originate from
 anatomical knowledge.
 
\end_layout

\begin_layout Standard
Visser et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Visser2010"

\end_inset

 used hierarchical clustering and fuzzy c-means together with recombination
 of subsets of the same tractography to reduce the effect of the large datasets
 on the distance matrix based on the MDF distance (see section
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "sub:track-distances"

\end_inset

) 
\begin_inset CommandInset citation
LatexCommand cite
key "EGMB10"

\end_inset

.
 An interesting result with this method was that they could automatically
 find the different sub-bundles of the Arcuate Fasciculus region in accordance
 with the supervised labeling described in 
\begin_inset CommandInset citation
LatexCommand cite
key "catani2005perisylvian"

\end_inset

.
 The algorithm that we present in this chapter also uses the minimum average
 flip (MDF) metric as a measure of distance between tracks.
 Gerig et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "gerig2004analysis"

\end_inset

 also used hierarchical clustering with a symmetrised version of closest
 point distances, 
\begin_inset Formula $\mathrm{MA}\mathrm{M}_{\mathrm{avg}}$
\end_inset

 and 
\begin_inset Formula $\mathrm{MA}\mathrm{M}_{\mathrm{max}}$
\end_inset

 (Hausdorff).
 However, they tested their method with only two bundles: Uncinate Fasciculus
 and the Corticospinal Tract.
 
\end_layout

\begin_layout Standard
Guevara et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Guevara2010"

\end_inset

 combined a great number of different algorithms from hierarchical clustering
 to 3D watershed on track extremities.
 They first divided the tractography into left-right hemisphere, inter-hemispher
ic and cerebellum subsets.
 They then created further subsets of different track length, used hierarchical
 clustering based on the random voxel par- cels, used watershed over extremities
 and finally used hierarchical clustering to merge the different sub-bundles
 using the Hausdorff distance (see section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:track-distances"

\end_inset

).
 This work stressed the need to divide the data set between shorter and
 longer tracks.
 Tsai et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Tsai2007"

\end_inset

 used a combination of cluster methods based on minimum spanning trees,
 locally linear embedding and k-means.
 They were able to incorporate both local and global structures by changing
 a few parameters.
 The main advantage of this method was that it showed a way to merge a chain
 of neighbouring structures into one cluster.
 Zhang and Laidlaw 
\begin_inset CommandInset citation
LatexCommand cite
key "zhang2005dti"

\end_inset

 used an agglomerative hierarchical clustering using the same distance as
 in 
\begin_inset CommandInset citation
LatexCommand cite
key "zhang2003visualizing"

\end_inset

 and later in 
\begin_inset CommandInset citation
LatexCommand cite
key "zhang2008identifying"

\end_inset

 combined distance-based single linkage hierarchical clustering with expert
 labeling of specific bundles.
 Zvitia et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "zvitia2008adaptive"

\end_inset

, 
\begin_inset CommandInset citation
LatexCommand cite
key "Zvitia2010"

\end_inset

 used adaptive mean shift.
 This is a clustering algorithm which finds automatically the number of
 clusters.
 This is in contrast for example with k-means that the user needs to prespecify
 the number of clusters.
 They also used this approach for direct registration of tractographies
 but only with tractographies from the same subject.
 El Kouby et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "ElKouby2005"

\end_inset

 created a ROI-based connectivity matrix where the 
\begin_inset Formula $i,j$
\end_inset

th entry of the matrix holds the number of tracks which connect 
\begin_inset Formula $ROI_{i}$
\end_inset

 to 
\begin_inset Formula $ROI_{j}$
\end_inset

.
 K-means was used afterwards on the rows of the matrix to cluster the tracks.
 This technique can be used for clustering bundles across subjects.
\end_layout

\begin_layout Standard
Brun et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "brun2004clustering"

\end_inset

 used the mean and covariance of the track as the feature space and normalized
 cuts based on a graph theoretic approach for the segmentation.
 Ding et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Ding2003a"

\end_inset

 used k-nearest neighbours, another agglomerative approach, applied to correspon
ding track segments.
 Corouge et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "corouge2004towards"

\end_inset

 used different types of track distances, e.g.
 Hausdorff distances, and other geometric properties such as torsion and
 curvature, and in 
\begin_inset CommandInset citation
LatexCommand cite
key "Corouge2004"

\end_inset

 and 
\begin_inset CommandInset citation
LatexCommand cite
key "Corouge2006"

\end_inset

 used Generalized Procrustes Analysis and Principal Components Analysis
 (PCA) to analyze the shape of bundles.
\end_layout

\begin_layout Standard
O'Donnell et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "ODonnell_IEEETMI07"

\end_inset

 created a tractographic atlas using spectral embedding and expert anatomical
 labeling.
 They then automatically segmented using spectral clustering and expressed
 the tracks as points in the embedded space to the closest existing atlas
 clusters.
 The full affinity matrix was too big to compute, therefore they used the
 Nystrom approximation: working on a subset and avoid generating the complete
 affinity/distance matrix.
 Later in 
\begin_inset CommandInset citation
LatexCommand cite
key "o2009tract"

\end_inset

 they tried group analysis on prespecified bundles.
\end_layout

\begin_layout Standard
Maddah et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Maddah_MICCA2005"

\end_inset

 used B-spline representations of tracks referenced to an atlas, and then
 the tracks were clustered based on the labeled atlas.
 Later Maddah et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "maddah2006statistical"

\end_inset

 using a similar track representation (quintic B-splines) calculated a model
 for each bundle as the average and standard deviation of that parametric
 representation.
 In that way they created an atlas which is used as a prior for expectation
 maximization (EM) clustering of the Corpus Callosum tracks into Witelson
 subdivisions 
\begin_inset CommandInset citation
LatexCommand cite
key "witelson1989hand"

\end_inset

 using population averages.
 Later in 
\begin_inset CommandInset citation
LatexCommand cite
key "Maddah_IEEEBI2008"

\end_inset

 Maddah et al.
\begin_inset space ~
\end_inset

 it is showed that it is possible to combine spatial priors with metrics
 for the shape of the tracks in order to guide the clustering process.
\end_layout

\begin_layout Standard
Jonasson et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "jonasson2005fiber"

\end_inset

 created a large 
\begin_inset Formula $N\times N$
\end_inset

 co-occurrence matrix, where 
\begin_inset Formula $N$
\end_inset

 is the number of the fibers to cluster.
 The co-occurrence (affinity) matrix contained the number of times that
 two fibers share the same voxel.
 They then used spectral clustering.
 Jianu et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "jianu2009exploring"

\end_inset

 presented a new method for visualizing and navigating through tractography
 data combining dendrograms from hierarchical clustering along with 3D-
 and 2D-embeddings using the approximation that Chalmers
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "chalmers1996linear"

\end_inset

 introduced for the technique of Eades
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "eades1984heuristic"

\end_inset

.
\end_layout

\begin_layout Standard
Durrleman et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Durrleman2009"

\end_inset

 introduced electrical current models of fibre bundles where a fibre is
 seen as a set of wires sending information in one direction at constant
 rate.
 Currents have good diffeomorphic properties and can be used for registration
 of bundles as shown in 
\begin_inset CommandInset citation
LatexCommand cite
key "Durrleman2009"

\end_inset

 and later in 
\begin_inset CommandInset citation
LatexCommand cite
key "durrleman2010registration"

\end_inset

.
 This methodology does not impose point-to-point or fibre-to-fibre correspondenc
es, however it is sensitive to fibre density and orientation of the bundles
 and it is computationally expensive.
\end_layout

\begin_layout Standard
Leemans and Jones 
\begin_inset CommandInset citation
LatexCommand cite
key "leemans17new"

\end_inset

 used affinity propagation (section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Affinity-Propagation"

\end_inset

) to cluster the fronto-occipital fibres, Cingulum and Arcuate Fasciculus
 after reducing the complexity of the data sets using additional frontal
 and occipital boolean masks on the right cerebrum.
 Results however were shown on a very small part of the entire tractography
 where clustering is a much easier problem.
 Later Malcolm et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "malcolm2009filtered"

\end_inset

 used affinity propagation to cluster a full brain tractography created
 using filtered tractography and suggested that affinity propagation is
 not suitable for group clustering.
 
\end_layout

\begin_layout Standard
Ziyan et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "ziyan2009consistency"

\end_inset

 introduced a probabilistic registration and clustering algorithm based
 on expectation maximization (EM) which creates a sharper atlas from a set
 of subjects on three bundles: Corpus Callosum, Cingulate and Fornix.
 This work used an initial spectral clustering 
\begin_inset CommandInset citation
LatexCommand cite
key "ODonnell_IEEETMI07"

\end_inset

 to label the bundles and then updated these labels iteratively while performing
 bundle-wise registration combined using polyaffine integration.
 
\end_layout

\begin_layout Standard
Often, it is useful to use some protocols in order to add prior information
 to the automated learning process.
 Protocols to manually label 
\begin_inset Formula $11$
\end_inset

 major white matter tracts were described in Wakana et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Wakana2007NeuroImage"

\end_inset

 using ROIs to include or exclude tracks generated by deterministic tractography.
 Hua et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Hua2008NeuroImage"

\end_inset

 used regions of interest together with probabilistic tractography in order
 to create probability maps of known fibre bundles.
 
\end_layout

\begin_layout Standard
From this short review we observe two main trends in the literature.
 The first and most common one makes use of track distances and calculates
 distance matrices.
 The most prevailing approaches here for deciphering the distance matrix
 are with Hierarchical and Spectral Clustering which are applied only on
 subsets of the initial tractography.
 The second trend and least common recommends avoiding track distances because
 the computation of the distance matrix is memory intensive.
 In this case, using Dirichlet Processes or Currents or Connectivity based
 parcelation seem to be some viable solutions.
 However, clustering is to be applied in clinical usage or to make neuroscientis
ts' analysis more efficient and practical we need algorithms that can provide
 useful clusters and cluster descriptors in minimum time.
 None of the papers described in this literature review provide a solution
 to this issue of efficiency and most of the methods would require from
 many hours to many days to run on a standard sized data set.
 The method we propose in this document can provide a solution to this problem
 and it is an extensive update of our preliminary work described in Garyfallidis
 et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "EGMB10"

\end_inset

.
 
\end_layout

\begin_layout Standard
Most authors agree that unsupervised learning with tractographies is a difficult
 problem as the data sets are very large, dense, cluttered with noisy tracks
 which could have no anatomic relevance and bundles which are more than
 often tangled together in many areas.
 Furthermore, we observe that there is a strong disagreement on the number
 of clusters (from 10 to 60).
 Because of the difficulty of the problem an international contest was also
 organized by SchLab in Pittsburgh University (PBC Brain Connectivity Challenge
 - IEEE ICDM) in 
\begin_inset Formula $2009$
\end_inset

.
 However, the competition did not conclude to any directly viable solutions.
 We think that in order to find big clusters a lot of anatomical prior knowledge
 needs to be introduced in a way that is not yet established.
 Nevertheless, the clustering that we propose concentrates on reducing the
 complexity of the data rather than finding bundles with anatomical relevance.
 We believe this step is more useful at this stage of tractography analysis
 research.
\end_layout

\begin_layout Subsection
Data sets
\begin_inset CommandInset label
LatexCommand label
name "sub:QB-Data-sets"

\end_inset


\end_layout

\begin_layout Standard
We experimented with QuickBundles using simulations, 
\begin_inset Formula $10$
\end_inset

 human tractographies collected and processed by ourselves, and one tractography
 with segmented bundles which was available online.
\end_layout

\begin_layout Standard

\series bold
Simulated trajectories.

\series default
 We generated three different bundles of parametric paths sampled at 
\begin_inset Formula $200$
\end_inset

 points.
 The tracks were made from different combinations of sinusoidal and helicoidal
 functions.
 Each bundle contained 150 tracks.
 For the red bundle in Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:simulated_orbits"

\end_inset

 a pencil of helical tracks all starting at the same point on a cylinder
 was generated by linearly varying the pitch of the helices; the green bundle
 was made up from a divergent pencil of rays on a sinusoidally corrugated
 sheet; the blue bundle was similarly made from a divergent rays on a sinsusoida
lly corrugated sheet, with the rays undergoing sinusoidal modulated lateral
 bending over a range of amplitudes.
 The data set contained 
\begin_inset Formula $450$
\end_inset

 tracks in total.
\end_layout

\begin_layout Standard

\series bold
Human subjects.
 
\series default
We collected data from 
\begin_inset Formula $10$
\end_inset

 healthy subjects at the MRC-CBU 3T scanner (TIM Trio, Siemens), using Siemens
 advanced diffusion work-in-progress sequence, and STEAM
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "merboldt1992diffusion,MAB04"

\end_inset

 as the diffusion preparation method.
 The field of view was 
\begin_inset Formula $240\times240\,\mathrm{mm}^{2}$
\end_inset

, matrix size 
\begin_inset Formula $96\times96$
\end_inset

, and slice thickness 
\begin_inset Formula $2.5$
\end_inset


\begin_inset space ~
\end_inset

mm (no gap).
 
\begin_inset Formula $55$
\end_inset

 slices were acquired to achieve full brain coverage, and the voxel resolution
 was 
\begin_inset Formula $2.5\times2.5\times2.5\,\mathrm{mm}^{3}$
\end_inset

.
 A 
\begin_inset Formula $102$
\end_inset

-point half grid acquisition
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Yeh2010"

\end_inset

 with a maximum 
\begin_inset Formula $b$
\end_inset

-value of 
\begin_inset Formula $4,000$
\end_inset


\begin_inset space ~
\end_inset


\begin_inset Formula $\mathrm{s/mm}^{2}$
\end_inset

 was used.
 The total acquisition time was 
\begin_inset Formula $14'\,21''$
\end_inset

 with TR=
\begin_inset Formula $8,200\textrm{\,\ ms}$
\end_inset

 and TE=
\begin_inset Formula $69\textrm{\,\ ms}$
\end_inset

.
 The experiment was approved by the Cambridge Psychology Research Ethics
 Committee (CPREC).
\end_layout

\begin_layout Standard
For the reconstruction of the real data sets we used GQI (formula 
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:GQI_analytical"

\end_inset

) with diffusion sampling length 
\begin_inset Formula $1.2$
\end_inset

 and for the tractography propagation we used EuDX (Euler integration with
 trilinear interpolation, see 
\begin_inset CommandInset ref
LatexCommand ref
reference "sec:Euler-Delta-Crossings"

\end_inset

) with 
\begin_inset Formula $1$
\end_inset

 million random seeds, angular threshold 
\begin_inset Formula $60^{\circ}$
\end_inset

, total weighting 
\begin_inset Formula $0.5$
\end_inset

, propagation step size 
\begin_inset Formula $0.5$
\end_inset

 and anisotropy stopping threshold 
\begin_inset Formula $0.0239$
\end_inset

 (see Figs.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:CloseToSelected"

\end_inset

 and 
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:arcuate_close"

\end_inset

).
\end_layout

\begin_layout Standard

\series bold
PBC human subjects
\series default
.
 We also used a few labeled data sets (see Fig.
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:cst_pbc"

\end_inset

, 
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:QB_fornix"

\end_inset

), from the freely available tractography database used in the Pittsburgh
 Brain Completion Fall 
\begin_inset Formula $2009$
\end_inset

 ICDM
\begin_inset Foot
status open

\begin_layout Plain Layout
 
\begin_inset Formula $\texttt{braincompetition.org}$
\end_inset


\end_layout

\end_inset

.
 
\end_layout

\begin_layout Subsection
QuickBundles (QB) Clustering 
\end_layout

\begin_layout Subsubsection
The QB Algorithm
\end_layout

\begin_layout Standard
QB is a suprisingly simple, linear time 
\begin_inset Formula $O(N)$
\end_inset

 (see section
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Complexity"

\end_inset

), distance based clustering algorithm that we created in order to segment
 huge trajectory data sets such as those produced by current state-of-the-art
 tractography generation algorithms 
\begin_inset CommandInset citation
LatexCommand cite
key "Parker2003,WWS+08"

\end_inset

.
 In general, there are very few linear time clustering algorithms.
 Just two are well known in the literature of artificial intelligence, machine
 learning and data mining: CLARANS
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "ng2002clarans"

\end_inset

 and BIRCH
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "zhang1997birch"

\end_inset

.
 QB is different from both of these methods; we will motivate it by describing
 some aspects of BIRCH as a starting point for the presentation of QB.
\end_layout

\begin_layout Standard
BIRCH
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "kogan2007introduction"

\end_inset

 has two key components: first is relatively simple and involves the use
 and updating of clusters' descriptors; second is the construction of a
 tree structure in which the accumulated clusters are held.
 The latter component is aimed at maintaining efficient searchability of
 the database while balancing what is kept in memory and what is on disc
 for very large databases.
 BIRCH uses clustering descriptors which are either directly available for
 each item in the data set or are easily computed from them, e.g.
\begin_inset space ~
\end_inset

squares and products of components; these form specific vectors of a fixed
 dimension of numerical values.
 Each cluster in turn has a descriptor which is an aggregate of the properties
 of the items that belong to it (e.g.
 the sum or mean of the individual descriptor vectors).
 Proceeding by a single sweep through the dataset, items are adjoined to
 clusters on the basis of their proximity to the clusters, subject to a
 maximum cluster size, or they are added as new leaves into the hierarchical
 tree structure in which the evolving clusters are held.
 Updating steps follow which can involve the merging of previously created
 clusters in a k-means fashion
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "steinhaus1956division,macqueen1967some"

\end_inset

.
\end_layout

\begin_layout Standard
It is the linear nature of BIRCH combined with the fixed dimensionality
 of its cluster descriptors that makes it quite fast.
 However, the further steps involving reorganisation of the accumulated
 tree do add some major overheads to BIRCH's performance.
 QB capitalises on these positive features but does not try to create any
 kind of hierarchical structure for the clusters.
 Moreover, while items in BIRCH are fixed dimension vectors with no additional
 structure, in QB each item (track) is a fixed-length ordered sequence of
 points in 
\begin_inset Formula $\mathbb{R}^{3}$
\end_inset

, and uses metrics and amalgamations which take account of, and preserve,
 this structure.
 Furthermore, each item is either added to an existing cluster on the basis
 of a distance between the cluster descriptor of the item and the descriptors
 of the current set of clusters or a new cluster is created.
 Clusters are held in a list which is extended according to need.
 
\end_layout

\begin_layout Standard
The complete QB algorithm is described in formal detail in Alg.
\begin_inset space ~
\end_inset


\begin_inset Formula $\ref{Alg:QuickBundles}$
\end_inset

 and a simple step by step visual example is given in Fig.
\begin_inset space ~
\end_inset


\begin_inset Formula $\ref{Fig:LSC_simple}$
\end_inset

.
 One of the reasons why QB has on average linear time complexity derives
 from the structure of the cluster node: we only save the sum of current
 tracks 
\begin_inset Formula $h$
\end_inset

 in the cluster and the sum is cumulative; moreover there is no recalculation
 of clusters, the tracks are passed through only once and a track is assigned
 to one cluster only.
\end_layout

\begin_layout Standard
\begin_inset Float algorithm
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset ERT
status open

\begin_layout Plain Layout

[th!]
\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
textbf{Input} tracks $T=
\backslash
{s_{1},...,s_{i},...,s_{N}
\backslash
}$, threshold $
\backslash
theta $
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
textbf{Output} clustering $C=
\backslash
{c_{1},...,c_{k},...,c_{M}
\backslash
}$ where cluster $c=(I,h,N)$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$c_{1} 
\backslash
leftarrow ([1],s_{0},1)$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$C 
\backslash
leftarrow 
\backslash
{c_{1} 
\backslash
}$ 
\backslash
# the first track becomes the first cluster
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$M 
\backslash
leftarrow 1$ 
\backslash
# the total number of clusters is 1 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$
\backslash
textbf{For}$ $i=2$ to $N$ 
\backslash
textbf{Do} 
\backslash
# all tracks
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $t 
\backslash
leftarrow T_{i}$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $
\backslash
texttt{alld} 
\backslash
leftarrow 
\backslash
textbf{infinity(M)}$ 
\backslash
# distance buffer
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $
\backslash
texttt{flip} 
\backslash
leftarrow 
\backslash
textbf{zeros(M)}$ 
\backslash
# flipping check buffer
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $
\backslash
textbf{For}$ $k=1$ to $M$ 
\backslash
textbf{Do} 
\backslash
# all clusters
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $v 
\backslash
leftarrow C_{k}.h/C_{k}.n$
\backslash

\backslash
 
\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $d 
\backslash
leftarrow d_{
\backslash
texttt{direct}}(t,v)$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $f 
\backslash
leftarrow d_{
\backslash
texttt{flipped}}(t,v)$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $
\backslash
textbf{If}$ $f < d$ $
\backslash
textbf{Then}$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{6em} $d 
\backslash
leftarrow f$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{6em} $
\backslash
texttt{flip}_{k} 
\backslash
leftarrow 1$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $
\backslash
texttt{alld}_{k} 
\backslash
leftarrow d$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $
\backslash
textbf{EndFor}$ 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $m 
\backslash
leftarrow 
\backslash
min(
\backslash
texttt{alld})$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $l 
\backslash
leftarrow 
\backslash
mathrm{arg min}(
\backslash
texttt{alld})$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $
\backslash
textbf{If}$ $m < 
\backslash
theta$ 
\backslash
textbf{Then} 
\backslash
# append in current cluster 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $
\backslash
textbf{If}$ $
\backslash
texttt{flip}_{l} = 1$ $
\backslash
textbf{Then}$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{6em} $C_{l}.h 
\backslash
leftarrow C_{l}.h + 
\backslash
textbf{reverse}(t)$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $
\backslash
textbf{Else}$ 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{6em} $C_{l}.h 
\backslash
leftarrow t$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $C_{l}.n 
\backslash
leftarrow C_{l}.n + 1$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $
\backslash
textbf{append}(C_{l}.I,i$)
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $
\backslash
textbf{Else}$ 
\backslash
# create new cluster
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $c_{M+1} 
\backslash
leftarrow ([i],t,1)$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $
\backslash
textbf{append}(C,c_{M+1})$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $M 
\backslash
leftarrow M + 1$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $
\backslash
textbf{EndIf}$ 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$
\backslash
textbf{EndFor}$ 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
caption{QuickBundles}
\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset CommandInset label
LatexCommand label
name "Alg:QuickBundles"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
QB creates an online list of cluster nodes.
 The cluster node is defined as 
\begin_inset Formula $c=(I,h,n)$
\end_inset

 where 
\begin_inset Formula $I$
\end_inset

 is the list of the integer indices of the tracks in that cluster, 
\begin_inset Formula $h$
\end_inset

 is an 
\begin_inset Formula $K\times3$
\end_inset

 matrix, the most important descriptor of the cluster, and 
\begin_inset Formula $n$
\end_inset

 is the number of tracks on that cluster.
 
\begin_inset Formula $h$
\end_inset

 is a matrix which can be updated online when a track is added to a cluster
 and is equal to
\begin_inset Formula 
\begin{equation}
h=\sum_{i=1}^{n}s_{i}
\end{equation}

\end_inset

where 
\begin_inset Formula $s_{i}$
\end_inset

 is the 
\begin_inset Formula $K\times3$
\end_inset

 matrix representing track 
\begin_inset Formula $i$
\end_inset

, 
\begin_inset Formula $\Sigma$
\end_inset

 represents matrix addition, and 
\begin_inset Formula $n$
\end_inset

 is the number of tracks in the cluster.
 QB assumes that all tracks have the same number of points 
\begin_inset Formula $K$
\end_inset

, therefore a downsampling of tracks, typically equidistant, is necessary
 before QB starts.
 A short summary of the algorithm goes as follows.
 
\end_layout

\begin_layout Standard
Select the first track 
\begin_inset Formula $s_{1}$
\end_inset

 and place it in the first cluster 
\begin_inset Formula $c_{1}\leftarrow([1],s_{1},1)$
\end_inset

.
 For all remaining tracks (i) go to next track 
\begin_inset Formula $s_{i}$
\end_inset

; (ii) calculate MDF distance between this track and virtual tracks of all
 existing clusters 
\begin_inset Formula $c_{k}$
\end_inset

, where a virtual track is defined on the fly as 
\begin_inset Formula $v=h/n$
\end_inset

; (iii) if the minimum MDF distance is smaller than a distance threshold
 
\begin_inset Formula $\theta$
\end_inset

 add the track to the cluster 
\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none

\begin_inset Formula $c_{j}\leftarrow(I,h,n)$
\end_inset


\family default
\series default
\shape default
\size default
\emph default
\bar default
\noun default
\color inherit
 with the minimum distance and update 
\begin_inset Formula $c_{j}\leftarrow(I\cup[i],h+s,n+1)$
\end_inset

; otherwise create a new cluster 
\begin_inset Formula $c_{M+1}\leftarrow([i],s_{i},1)$
\end_inset

 and increase the total number of clusters 
\begin_inset Formula $M\leftarrow M+1$
\end_inset

.
 The complete algorithm is given in Alg.
\begin_inset space ~
\end_inset


\begin_inset Formula $\ref{Alg:QuickBundles}$
\end_inset

.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways true
status open

\begin_layout Plain Layout
\noindent
\align center
\begin_inset Graphics
	filename last_figures/LSC_algorithm.png
	lyxscale 60
	scale 150
	rotateOrigin center

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
footnotesize{Step-by-step description of QB: Panel (i): 6 unclustered tracks
 (A-F) are presented; the distance threshold used is the MDF distance} 
\end_layout

\end_inset

(Eq.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "eq:direct_flip_distance"

\end_inset

) 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
footnotesize{between B and E.
 The algorithm starts and in (ii) track A was selected, so no other clusters
 exist therefore track A becomes the first cluster (labeled with purple
 color) and the virtual track of that cluster is identical with A as seen
 in (iii), next in (iv) track B is selected and we calculate the MDF distance
 between B and the virtual track of the other clusters.
 At this moment there is only one cluster to compare so QB calculates MDF
 (B,virtual-purple) and this is obviously bigger than threshold (that being
 MDF(B,E)) therefore a new cluster is assigned for B and B becomes the virtual
 track of that cluster as shown in (v).
 In (vi) the next track is selected and this is again far away from both
 purple and blue virtuals therefore another cluster is created and B is
 the virtual of the blue cluster as shown in (vii).
 In (viii) track D is the current track and after we have calculated MDF(D,purpl
e), MDF(D,blue) and MDF(D,green) it is obvious that D belongs to the purple
 cluster as MDF(D,purple) is smaller and lower than threshold as shown in
 (ix).
 However, we see in (x) that things change for the purple cluster because
 the virtual track is not anymore made by only one track but it is the average
 of D and A shown with dashline.
 In (xi) E is the current track and will be assigned at the green cluster
 as shown in (xii) because MDF(E,virtual green) = MDF(E,B) = threshold,
 and in (xiii) we see the updated virtual track for the green cluster which
 is equal to (B+E)/2, where + means track addition.
 In (xiv) the last track is picked and compared with the virtual tracks
 of the other 3 clusters; obviously MDF(F,purple) is the only with smaller
 threshold, therefore F is assigned to the purple cluster in (xv).
 Finally, in (xvi) the virtual purple track is updated as (D+A+F)/3.
 As there are no more tracks to select, the algorithm stops.
 We observe that all three clusters have been found and all tracks have
 been assigned successfully.}
\end_layout

\end_inset

 
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "Fig:LSC_simple"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
Choice of orientation can become an issue when using the MDF distance and
 adding tracks together.
 This happens because the diffusion signal is symmetric around the origin.
 Therefore, the 
\begin_inset Formula $K\times3$
\end_inset

 track can equivalently have its points ordered 
\begin_inset Formula $1,\dots,K$
\end_inset

 or be flipped with order 
\begin_inset Formula $K,\dots,1$
\end_inset

; the diffusion signal does not allow us to distinguish betweeen these two
 directions.
 A step in QB takes account of the possibility of needing to perform a flip
 of a track before adding it to a representative track according to which
 direction produced the MDF value.
 Though the appropriate orientation (direct or flip) of a track was available
 in the MDF calculation at the time it entered a cluster, we allow for the
 possibility this might not be the same later on when the virtual track
 has evolved so it will need to be recalculated.
\end_layout

\begin_layout Standard
One of the reasons why QB has on average linear time complexity derives
 from the fact that we only save the sum of current tracks in the cluster
 and this is achieved cumulatively.
 QB passes through the tracks only once and that a track is assigned to
 one cluster only.
 By contrast, if we were using k-means at every iteration we would have
 to re-assign tracks to clusters and recalculate averages which is computational
ly much more intensive.
 
\end_layout

\begin_layout Standard
QB can be extended for specific applications to contain more information
 about the clusters.
 For example, we could redefine 
\begin_inset Formula $c\leftarrow(I,h,n,h^{(2)})$
\end_inset

 to obtain second order information and in that way we could calculate the
 variance of the cluster where 
\begin_inset Formula 
\[
h^{(2)}\leftarrow(\sum_{i,j}\mathbf{x}_{ij}^{2},\sum_{i,j}\mathbf{y}_{ij}^{2},\sum_{i,j}\mathbf{z}_{ij}^{2},\sum_{i,j}\mathbf{x}_{ij}\mathbf{y}_{ij},\sum_{i,j}\mathbf{y}_{ij}\mathbf{z}_{ij},\sum_{i,j}\mathbf{x}_{ij}\mathbf{z}_{ij})
\]

\end_inset

 and 
\begin_inset Formula $\mathbf{x}_{ij},\,\mathbf{y}_{ij},\,\mathbf{z}_{ij}$
\end_inset

 are the coordinates of the 
\begin_inset Formula $j$
\end_inset

th point of the 
\begin_inset Formula $i$
\end_inset

th track in the cluster.
 Although this alternative would be very useful, as even more refined cluster
 distances could be used which take into account the additional information,
 this is not addressed in this thesis.
\end_layout

\begin_layout Standard
One of the disadvantages of most clustering algorithms is that they give
 different results with different initial conditions; for example this is
 recognised with k-means, expectation-maximization 
\begin_inset CommandInset citation
LatexCommand cite
key "dempster1977maximum"

\end_inset

 and k-centres 
\begin_inset CommandInset citation
LatexCommand cite
key "gonzalez1985clustering"

\end_inset

, where it is common practice to try a number of different random initial
 configurations.
 The same holds for QB so if there are not distinct clusters such that the
 distance between any pair of clusters is supra-threshold, then with different
 permutations of the same tractography we will typically see similar number
 of clusters but different underlying clusters.
 We will examine the robustness of QB in this respect in section
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Comparisons"

\end_inset

.
\end_layout

\begin_layout Subsubsection
Powerful simplifications
\end_layout

\begin_layout Standard
One of the major benefits of applying QB to tractographies is that it can
 provide meaningful simplifications and find structures that were previously
 invisible or difficult to locate because of the high density of the tractograph
y.
 We used QB for example to cluster the corticospinal tract (CST).
 This bundle was part of the datasets provided by the Pittsburgh Brain Competiti
on (PBC2009-ICDM) and it was selected by an expert.
 The result is clearly shown in Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:cst_pbc"

\end_inset

 where every partition is represented by a virtual track.
 To generate this clustering we used a tight threshold of 
\begin_inset Formula $10$
\end_inset


\begin_inset space ~
\end_inset

mm and downsampling to 
\begin_inset Formula $12$
\end_inset

 points.
 We observe that only a few virtual tracks span the full distance from bottom
 to top and that many tracks are broken (i.e.
 shorter than what was initially expected) or highly divergent.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset ERT
status open

\begin_layout Plain Layout

[th!]
\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\align center
\begin_inset Graphics
	filename QB/Thesis/Fig_4_cst_simplification_relabeled.png
	lyxscale 10
	scale 30
	rotateOrigin center

\end_inset


\begin_inset Caption

\begin_layout Plain Layout
Part of the CST bundle (red) consisting of 
\begin_inset Formula $11,041$
\end_inset

 tracks labelled by an expert.
 At first glance it looks as though all tracks have a similar shape, possibly
 converge towards the bottom, and fan out towards the top.
 However, this is a misreading caused by the opaque density when all the
 tracks are visualised.
 QB can help us see the finer structure of the bundle and identify its elements.
 On the right hand side we see the 
\begin_inset Formula $14$
\end_inset

 QB representative tracks (virtuals) of the CST.
 We can now clearly see that several parts which looked homogeneous are
 actually broken bundles e.g.
 dark green (A), light blue (C), or bundles with very different shape e.g.
 light green (B).
 To cluster this bundle took 
\begin_inset Formula $0.1$
\end_inset

 seconds.
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "Flo:cst_pbc"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
Another interesting feature of QB is that it can be used to merge or split
 different structures by changing the distance threshold.
 This is shown in Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:simulated_orbits"

\end_inset

; on the left we see simulated paths made from simple sinusoidal and helicoidal
 functions packed together.
 The colour coding is used to distinguish the three different structures.
 With a lower threshold the three different structures remain separated
 but when we use a higher threshold the red and blue bundles are represented
 by only one cluster; represented by a purple virtual.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset Graphics
	filename last_figures/helix_phantom.png
	lyxscale 60
	scale 70
	rotateOrigin center

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
Left: 
\begin_inset Formula $3$
\end_inset

 bundles of simulated trajectories; red, blue and green consisting of 
\begin_inset Formula $150$
\end_inset

 tracks each.
 All 
\begin_inset Formula $450$
\end_inset

 tracks are clustered together using QB.
 Middle and Right: virtual tracks using thresholds 
\begin_inset Formula $1$
\end_inset

 and 
\begin_inset Formula $8$
\end_inset

 respectively.
 At low threshold the underlying structure is reflected in a more detailed
 representation.
 At higher threshold, closer bundles merge together.
 Here the red and blue bundle have merged together in one cluster represented
 by the purple virtual track.
 
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "Flo:simulated_orbits"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
Similarly, with the simulations shown in Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:simulated_orbits"

\end_inset

 we can see the same effect on real tracks, e.g.
 those of the fornix shown at the left panel of Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:QB_fornix"

\end_inset

.
 Different number of clusters can be obtained at different thresholds.
 In that way we can stress thinner or larger sub-bundles inside other bigger
 bundles.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset Graphics
	filename last_figures/LSC_simple.png
	lyxscale 30
	scale 60

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
Left: QB clustering of the fornix bundle.
 The original fornix is shown in black (
\begin_inset Formula $1,076$
\end_inset

 tracks).
 All tracks were equidistantly downsampled at 3 points.
 With a 
\begin_inset Formula $5$
\end_inset

 mm threshold QB generates 
\begin_inset Formula $22$
\end_inset

 clusters (top right).
 With 
\begin_inset Formula $10$
\end_inset

 mm it generates 
\begin_inset Formula $7$
\end_inset

 (bottom left) and with 
\begin_inset Formula $20$
\end_inset

 mm the whole fornix is determined by one cluster only (bottom right).
 The colour encodes cluster label.
 Right: an example of a full tractography (
\begin_inset Formula $0.25\times10^{6}$
\end_inset

 tracks) being clustered using QB with a distance threshold of 
\begin_inset Formula $10$
\end_inset

 mm.
 
\begin_inset Formula $763$
\end_inset

 virtual tracks were produced which is a huge simplification of the initial
 tractography.
 Every track shown here represents an entire cluster from 
\begin_inset Formula $10$
\end_inset

 to 
\begin_inset Formula $5,000$
\end_inset

 tracks each.
 These can be thought as fast access points to explore the entire data set.
 The colour here encodes track orientation.
 
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "Flo:QB_fornix"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
A full tractography containing 
\begin_inset Formula $250,000$
\end_inset

 tracks was clustered using QB with a distance threshold of 
\begin_inset Formula $10$
\end_inset


\begin_inset space ~
\end_inset

mm (see Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:QB_fornix"

\end_inset

).
 We produced a useful reduction of the initial tractography leaving only
 
\begin_inset Formula $763$
\end_inset

 virtual tracks.
 Bundles smaller than 
\begin_inset Formula $10$
\end_inset

 tracks were removed.
 Every track shown here represents an entire cluster containing from 
\begin_inset Formula $10$
\end_inset

 to 
\begin_inset Formula $5,000$
\end_inset

 tracks each.
 The virtual tracks have a great usage as fast access points to explore
 the complete tractography (see Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:QB_fornix"

\end_inset

).
 
\end_layout

\begin_layout Subsubsection
Complexity and timings
\begin_inset CommandInset label
LatexCommand label
name "sub:Complexity"

\end_inset


\end_layout

\begin_layout Standard
To apply QB to a data set we need to specify three key parameters: 
\begin_inset Formula $K$
\end_inset

, the fixed number of downsampled points per track; 
\begin_inset Formula $\theta$
\end_inset

 the distance threshold, which controls the heterogeneity of clusters; and
 
\begin_inset Formula $N$
\end_inset

 the size of the subset of the tractography on which the clustering will
 be performed.
 When 
\begin_inset Formula $\theta$
\end_inset

 is higher, fewer more heterogeneous clusters are assembled, and conversely
 when 
\begin_inset Formula $\theta$
\end_inset

 is low, more clusters of greater homogeneity are created.
\end_layout

\begin_layout Standard
The complexity of QB is in the best case linear time 
\begin_inset Formula $\mathcal{O}(N)$
\end_inset

 with the number of tracks 
\begin_inset Formula $N$
\end_inset

 and worst case 
\begin_inset Formula $\mathcal{O}(N^{2})$
\end_inset

 when every cluster contains only one track.
 The average case is 
\begin_inset Formula $\mathcal{O}(MN)$
\end_inset

 where 
\begin_inset Formula $M$
\end_inset

 is the number of clusters.
 However, because 
\begin_inset Formula $M$
\end_inset

 is usually much smaller than 
\begin_inset Formula $N$
\end_inset

 (
\begin_inset Formula $M\ll N$
\end_inset

) we can neglect 
\begin_inset Formula $M$
\end_inset

 and denote it only as 
\begin_inset Formula $\mathcal{O}(N)$
\end_inset

 as it is common in complexity theory.
 
\end_layout

\begin_layout Standard
We created the following experiment to investigate this claim and we found
 empirically that the average case is actually 
\begin_inset Formula $\mathcal{O}(N)$
\end_inset

 for tractographies (see Fig.
\begin_inset space ~
\end_inset


\begin_inset Formula $\ref{Flo:Speed1}$
\end_inset

).
 In this experiment we timed the duration of QB clustering of tractographies
 containing from 
\begin_inset Formula $\num{e5}$
\end_inset

 to 
\begin_inset Formula $\num{e6}$
\end_inset

 tracks, with different initial number of points per track (
\begin_inset Formula $3,\,6,\,12$
\end_inset

 and 
\begin_inset Formula $18$
\end_inset

) and different QB thresholds (
\begin_inset Formula $10,\,15,\,20,\,25$
\end_inset


\begin_inset space ~
\end_inset

mm).
 The final factor, not shown explicitly in these diagrams, is the underlying
 structure of the data which is expressed by the resulting number of clusters.
 These results were obtained on a single thread of an Intel(R) CPU at 2.50GHz
 on a standard PC.
 The results can be seen in Fig.
\begin_inset space ~
\end_inset


\begin_inset Formula $\ref{Flo:Speed1}$
\end_inset

.
 We see how the linearity of the QB algorithm with respect to 
\begin_inset Formula $N$
\end_inset

 only reduces slightly even when we use a very low threshold such as 
\begin_inset Formula $10$
\end_inset


\begin_inset space ~
\end_inset

mm which can generate many thousand of clusters.
 This experiment concludes that QB is suitable for fast clustering.
 Even when the threshold value becomes impressively low (
\begin_inset Formula $10$
\end_inset


\begin_inset space ~
\end_inset

mm) the linearity is only slightly disturbed.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset ERT
status open

\begin_layout Plain Layout

[th!]
\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\align center
\begin_inset Graphics
	filename 2x2+leg-box.png
	lyxscale 30
	scale 35
	rotateOrigin center

\end_inset


\begin_inset Caption

\begin_layout Plain Layout
Time comparisons of QB using different number of points per track, different
 distance thresholds and different number of tracks.
 QB is a very efficient algorithm whose performance is controlled by just
 three parameters.
 (1) the initial downsampling 
\begin_inset Formula $K$
\end_inset

 of the tracks exemplified in four sub-diagrams: 3 points (A), 6 points
 (B) 12 points (C), 18 points (D).
 (2) the distance threshold 
\begin_inset Formula $\theta$
\end_inset

 in millimeters shown in 4 colours: 10
\begin_inset space ~
\end_inset

mm (blue), 15
\begin_inset space ~
\end_inset

mm (green), 20
\begin_inset space ~
\end_inset

mm (red), 25
\begin_inset space ~
\end_inset

mm (cyan).
 We used a full tractography to generate these figures without removing
 or preselecting any parts.
 Random subsets of the tractography were chosen with size 
\begin_inset Formula $N$
\end_inset

 from 
\begin_inset Formula $\numrange{e5}{e6}$
\end_inset

 (x-axis).
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "Flo:Speed1"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
Furthermore, the memory usage of QB is 
\begin_inset Formula $O(M)$
\end_inset

 where 
\begin_inset Formula $M$
\end_inset

 is the number of clusters and because this is usually much smaller than
 
\begin_inset Formula $N$
\end_inset

 we consider memory consumption to be negligible.
 Because in QB we store only the indices of the tracks, even for very large
 tractographies 
\begin_inset Formula $20$
\end_inset

 or more clusterings can be stored simultaneously in the RAM of a simple
 notebook without any problems.
 Memory efficiency is therefore another feature of QB.
\end_layout

\begin_layout Standard
We compared QB with 
\begin_inset Formula $12$
\end_inset

 point tracks and distance threshold at 
\begin_inset Formula $\theta=10$
\end_inset


\begin_inset space ~
\end_inset

mm versus some timings reported from other state of the art methods found
 in the literature (Tab.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:timings"

\end_inset

).
 Unfortunately, timings were very rarely reported because most algorithms
 were very slow on full data sets.
 Nonetheless the speedup that QB offers is obviously of great importance
 and even real-time on data sets of less than 
\begin_inset Formula $20,000$
\end_inset

 tracks (see Tab.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:timings"

\end_inset

).
 It holds also the prospect of real-time clustering on massive tractographies
 using standard parallelization techniques (see section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Parallel-version"

\end_inset

).
\end_layout

\begin_layout Standard
\begin_inset Float table
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
small
\backslash
addtolength{
\backslash
tabcolsep}{-5pt}
\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="4" columns="5">
<features tabularvalignment="middle">
<column alignment="center" valignment="top" width="0">
<column alignment="center" valignment="top" width="0">
<column alignment="center" valignment="top" width="0">
<column alignment="center" valignment="top" width="0">
<column alignment="center" valignment="top" width="0">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Number of tracks (
\begin_inset Formula $N$
\end_inset

)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Algorithms
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Timings (secs)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
QB (secs)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Speedup
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $1000$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Wang et al.
 
\begin_inset CommandInset citation
LatexCommand cite
key "wang2010tractography"

\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $30$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $0.07$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $429$
\end_inset


\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $60,000$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Wang et al.
 
\begin_inset CommandInset citation
LatexCommand cite
key "wang2010tractography"

\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $14,400$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $14.7$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $980$
\end_inset


\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $400,000$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Visser et al.
 
\begin_inset CommandInset citation
LatexCommand cite
key "Visser2010"

\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $75,000$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $160.1$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $468$
\end_inset


\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
QB run on 
\begin_inset Formula $K=12$
\end_inset

 point tracks and distance threshold at 
\begin_inset Formula $\theta=10$
\end_inset


\begin_inset space ~
\end_inset

mm compared with some timings reported from other state of the art methods
 found in the literature.
 Timings were very rarely reported until today as most algorithms were very
 slow on full data sets.
 Nonetheless, we can observe in this table that the speedup that QB offers
 is substantial.
 
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "Flo:timings"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Subsubsection
Virtual tracks, exemplar tracks and other descriptors.
\end_layout

\begin_layout Standard
The virtual tracks created by QB have very nice properties as they represent
 an average track which can stand as the most important feature of the cluster
 that they belong to.
 However, now that we have segmented our tractography into small bundles
 we can calculate many more potentially important descriptors for the cluster.
 The Cluster Spread (CS) for instance can be computed for any cluster 
\begin_inset Formula $c$
\end_inset

 as a vector of length 
\begin_inset Formula $K$
\end_inset

 whose 
\begin_inset Formula $j$
\end_inset

-th component is 
\begin_inset Formula $\sum_{x\in c}|x_{j}-v_{j}|^{2}/n.$
\end_inset

 Here, 
\begin_inset Formula $x_{j}$
\end_inset

 is the 
\begin_inset Formula $j$
\end_inset

-th point in the track 
\begin_inset Formula $x$
\end_inset

 in cluster 
\begin_inset Formula $c$
\end_inset

, 
\begin_inset Formula $v_{j}$
\end_inset

 is the corresponding point of the virtual track, and 
\begin_inset Formula $n$
\end_inset

 is the size of the cluster.
 CS provides a profile of the tightness or looseness of the cluster along
 the length of the virtual track.
 Many other similar or higher order statistics can be readily computed in
 an analogous fashion.
 One of the most useful features is the calculation of exemplars.
\end_layout

\begin_layout Standard

\series bold
Exemplars
\series default
.
 Another fruitful idea relating to the virtual track is to identify a correspond
ing descriptor for the bundle which actually belongs to the tractography.
 In other words to find an exemplar or medoid track.
 Virtual tracks do not necessarily coincide with real tracks as they are
 just the outcome of large amalgamations.
 There are many strategies for how to select good exemplars for the bundles.
 A very fast procedure that we use in this work is to find which real track
 from the cluster is closest (by MDF distance) to the virtual track.
 We call this exemplar track 
\begin_inset Formula $e_{1}$
\end_inset

 such that 
\begin_inset Formula $e_{1}={\displaystyle \argmin_{x\in C}}\textrm{\,\ MDF}(v,x)$
\end_inset

.
 The computational complexity of finding 
\begin_inset Formula $e_{1}$
\end_inset

 is linear in cluster size, and that will be very useful if we have created
 clusterings with clusters containing more than 
\begin_inset Formula $\sim5,000$
\end_inset

 tracks (depending on system memory).
 
\end_layout

\begin_layout Standard
A different exemplar can be defined as the most similar track among all
 tracks in the bundle, which we denote by 
\begin_inset Formula $e_{2}={\displaystyle \argmin_{x\in C}}\,{\displaystyle \sum_{y\in C}}\mathrm{MDM(}y,x)$
\end_inset

, or if we want to work with tracks with possibly different numbers of points
 we could instead use 
\begin_inset Formula $e_{3}={\displaystyle \argmin_{x\in C}}\,{\displaystyle \sum_{y\in C}}\mathrm{MAM(}y,x)$
\end_inset

.
 Identification of exemplar tracks of type 
\begin_inset Formula $e_{2}$
\end_inset

 and 
\begin_inset Formula $e_{3}$
\end_inset

 will be efficient only for small bundles of less than 
\begin_inset Formula $\sim5,000$
\end_inset

 tracks because we need to calculate all pairwise distances in the bundle.
 Many applications of the exemplars will be discussed later.
 
\end_layout

\begin_layout Standard
In summary, a virtual (centroid) track is the average of all tracks in the
 cluster.
 We call it virtual because it doesn't really exist in the real data set
 and to distinguish it from exemplar (medoid) tracks which are again descriptors
 of the cluster but are represented by real tracks.
 
\end_layout

\begin_layout Subsection
\begin_inset CommandInset label
LatexCommand label
name "sub:Comparisons"

\end_inset

Comparisons within- and between-subjects
\end_layout

\begin_layout Subsubsection
Comparison of clusterings
\begin_inset CommandInset label
LatexCommand label
name "sub:Comparison-of-clusterings"

\end_inset


\end_layout

\begin_layout Standard
We have found rather few systematic ways available in the literature to
 compare different clustering results for tractographies directly, beyond
 that of 
\begin_inset CommandInset citation
LatexCommand cite
key "moberts2005evaluation"

\end_inset

 who quantified the agreement between a clustering and a `gold standard'
 tractography labelled by their team.
 We have used a more symmetrical measure of agreement between two clusterings
 that do not require a prior labelled data set.
 It is called Optimised Matched Agreement (OMA).
 As with the Adjusted Rand Index
\begin_inset ERT
status open

\begin_layout Plain Layout

~
\end_layout

\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "moberts2005evaluation"

\end_inset

, OMA requires the calculation of the 
\begin_inset Formula $M\times N$
\end_inset

 cross-classification matrix 
\begin_inset Formula $X=(x_{ij})$
\end_inset

 which counts the number of streamlines in the intersection of all pairs
 of clusters, one from each of the two clusterings.
 Here 
\begin_inset Formula $\mathcal{A}=\{A_{i}:i=1\dots M\}$
\end_inset

 and 
\begin_inset Formula $\mathcal{B}=\{B_{j}:j=1\ldots N\}$
\end_inset

 are the two clusterings, and 
\begin_inset Formula $x_{ij}=|A_{i}\cap B_{j}|$
\end_inset

.
 As there is no 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
emph{a priori}
\end_layout

\end_inset

 correspondence or 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
emph{matching}
\end_layout

\end_inset

 between the clusters in 
\begin_inset Formula $\mathcal{A}$
\end_inset

 and those in 
\begin_inset Formula $\mathcal{B}$
\end_inset

, and vice versa, we need to find one empirically.
 If 
\begin_inset Formula $j=\pi(i)$
\end_inset

 is such a matching then the matched agreement is 
\begin_inset Formula $\mathrm{MA}(\pi)=\sum_{i=1}^{M}x_{i,\pi(i)}$
\end_inset

.
 A matching 
\begin_inset Formula $\pi$
\end_inset

 that yields OMA by maximising 
\begin_inset Formula $\mathrm{MA}(\pi)$
\end_inset

 can be found using the Hungarian Algorithm
\begin_inset ERT
status open

\begin_layout Plain Layout

~
\end_layout

\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Kuhn1955"

\end_inset

.
 The interpretation of the OMA statistic is analogous to that of the well-known
 Kappa measure of inter-rater agreement
\begin_inset ERT
status open

\begin_layout Plain Layout

~
\end_layout

\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "altman1995"

\end_inset

, with the range 61% to 80% corresponding to a `good' strength of agreement.
\end_layout

\begin_layout Standard
As well as the computational overheads in calculating the cross classification
 matrix, a further fundamental disadvantage of these methods is that they
 do not work with clusterings of different tractographies.
 Being able to compare results of clusterings is crucial for creating stable
 brain imaging procedures, and therefore it is necessary to develop a way
 to compare different tractography clusterings or different sets of streamlines
 from the same subject or different subjects.
\end_layout

\begin_layout Standard
Although we recognise that these are difficult problems, we propose the
 following approach with three novel comparison functions which we call
 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
emph{coverage}
\end_layout

\end_inset

, 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
emph{overlap}
\end_layout

\end_inset

 and 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
emph{bundle adjacency}
\end_layout

\end_inset

 (BA).
\end_layout

\begin_layout Standard
If 
\begin_inset Formula $S$
\end_inset

 and 
\begin_inset Formula $T$
\end_inset

 are sets of streamlines, and 
\begin_inset Formula $\theta>0$
\end_inset

 is selected as a threshold, we say that 
\begin_inset Formula $s\in S$
\end_inset

 has a 
\begin_inset Formula $\theta$
\end_inset

-neighbour in 
\begin_inset Formula $T$
\end_inset

 if 
\begin_inset Formula $\min_{t\in T}[\mathrm{MDF}(s,t)<\theta]$
\end_inset

.
 We define the coverage of 
\begin_inset Formula $S$
\end_inset

 by 
\begin_inset Formula $T$
\end_inset

 as the fraction of streamlines in 
\begin_inset Formula $S$
\end_inset

 that have a 
\begin_inset Formula $\theta$
\end_inset

-neighbour in 
\begin_inset Formula $T$
\end_inset

: 
\begin_inset Formula 
\[
\mathrm{coverage}(S,T)=|\{s\in S\,\mathrm{has~a}~\theta\mathrm{-neighbour~in}~T\}|/|S|.
\]

\end_inset

 Coverage ranges between 0 (when no streamline in S has a close enough neighbour
 in T) and 1 (when every streamline in S has a neighbour in T).
\end_layout

\begin_layout Standard
We define the overlap of 
\begin_inset Formula $T$
\end_inset

 in 
\begin_inset Formula $S$
\end_inset

 as the average number of 
\begin_inset Formula $\theta$
\end_inset

-neighbours in 
\begin_inset Formula $T$
\end_inset

 for streamlines in 
\begin_inset Formula $S$
\end_inset

: 
\begin_inset Formula 
\[
\mathrm{overlap}(S,T)=\sum_{s\in S}|\{t\in T:t\,\mathrm{is~a}~\theta\mathrm{-neighbour~of}~s\}|/|S|.
\]

\end_inset

 Overlap can take any non-negative value, with higher values indicating
 possible redundancy of 
\begin_inset Formula $T$
\end_inset

 in 
\begin_inset Formula $S$
\end_inset

; if 
\begin_inset Formula $T$
\end_inset

 has several similar streamlines then this will tend to boost overlap.
\end_layout

\begin_layout Standard
BA is a symmetric measure of the similarity of the two sets of streamlines
 
\begin_inset Formula $S$
\end_inset

 and 
\begin_inset Formula $T$
\end_inset

.
 BA is the average of the 
\begin_inset Formula $\theta$
\end_inset

-coverages of 
\begin_inset Formula $T$
\end_inset

 by 
\begin_inset Formula $S$
\end_inset

 and of 
\begin_inset Formula $S$
\end_inset

 by 
\begin_inset Formula $T$
\end_inset

: 
\begin_inset Formula 
\[
\mathrm{BA}(S,T)=(\mathrm{coverage}(S,T)+\mathrm{coverage}(T,S))/2.
\]

\end_inset

 BA ranges between 0, when no streamlines of S or T have neighbours in the
 other set, and 1 when they all do.
\end_layout

\begin_layout Standard
If 
\begin_inset Formula $S$
\end_inset

 is a good approximation to 
\begin_inset Formula $T$
\end_inset

 then 
\begin_inset Formula $S$
\end_inset

 will have high coverage of 
\begin_inset Formula $T$
\end_inset

; if 
\begin_inset Formula $S$
\end_inset

 has low redundancy as an approximation to 
\begin_inset Formula $T$
\end_inset

 then the overlap of 
\begin_inset Formula $S$
\end_inset

 in 
\begin_inset Formula $T$
\end_inset

 will be low; and if 
\begin_inset Formula $S$
\end_inset

 and 
\begin_inset Formula $T$
\end_inset

 are globally similar then BA will be high.
 More details on BA are presented in section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Tightness-comparisons-1"

\end_inset

 and there is a detailed explanation of classification measures like MA
 in section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Measures-to-compare"

\end_inset

.
\end_layout

\begin_layout Subsubsection
Robustness under reordering
\end_layout

\begin_layout Standard
One of the disadvantages of most clustering algorithms is that they give
 different results with different initial conditions; for example this is
 recognised with k-means, expectation-maximization 
\begin_inset CommandInset citation
LatexCommand cite
key "dempster1977maximum"

\end_inset

 and k-centers 
\begin_inset CommandInset citation
LatexCommand cite
key "gonzalez1985clustering"

\end_inset

 where it is common practice to try a number of different random initial
 configurations.
 The same holds for QB so if there are not distinct clusters such that the
 distance between any pair of clusters is supra-threshold and the diameter
 of all clusters is sub-threshold, then with different permutations of the
 same tractography we will typically see similar number of clusters but
 different underlying clusters.
 We will examine the robustness of QB in this respect.
\end_layout

\begin_layout Standard
As a first step we recorded the numbers of QB clusters in 
\begin_inset Formula $20$
\end_inset

 different random orderings of the tractographies of 
\begin_inset Formula $10$
\end_inset

 human subjects.
 We first removed short streamlines shorter than 
\begin_inset Formula $40$
\end_inset


\begin_inset ERT
status open

\begin_layout Plain Layout

~
\end_layout

\end_inset

mm and downsampled the streamlines at 
\begin_inset Formula $12$
\end_inset

 points.
 Then we applied QB with threshold at 
\begin_inset Formula $10$
\end_inset


\begin_inset ERT
status open

\begin_layout Plain Layout

~
\end_layout

\end_inset

mm.
 The mean number of clusters was 
\begin_inset Formula $2645.9$
\end_inset

 (min 
\begin_inset Formula $1937.6$
\end_inset

; max 
\begin_inset Formula $3857.8$
\end_inset

; s.d.
\begin_inset ERT
status open

\begin_layout Plain Layout

~
\end_layout

\end_inset


\begin_inset Formula $653.8$
\end_inset

).
 There is therefore a considerable between-subject variation in this metric.
 By contrast the within-subject variability of the number of clusters across
 random orderings is rather small, with mean standard deviation 
\begin_inset Formula $12.7$
\end_inset

 (min 
\begin_inset Formula $7.3$
\end_inset

; max 
\begin_inset Formula $17.4$
\end_inset

).
 This suggests a good level of consistency in the data reduction achieved
 by QB.
\end_layout

\begin_layout Standard
Next we investigated how consistent QB clusterings are when data sets are
 re-ordered.
 Twelve different random orderings were generated for each of 
\begin_inset Formula $10$
\end_inset

 tractographies and the corresping QB clusterings were computed with MDF
 threshold 
\begin_inset Formula $10$
\end_inset


\begin_inset ERT
status open

\begin_layout Plain Layout

~
\end_layout

\end_inset

mm.
 For each subject the $66$ pairings of QB clusterings were compared using
 the optimised matched agreements index and then averaged.
 Across subjects the mean OMA was 74.1% (
\begin_inset Formula $\pm0.39$
\end_inset

%) which can be interpreted as a good level of agreement
\begin_inset ERT
status open

\begin_layout Plain Layout

~
\end_layout

\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "altman1995"

\end_inset

.
\end_layout

\begin_layout Standard
As well as checking that QB created sets of centroids with good coverage
 and overlap statistics, we went on to show that the performance of QB generalis
es to sets of streamlines different from the training set, and is superior
 to a random sample of streamlines.
 We split each of the 10 tractographies randomly into two halves 
\begin_inset Formula $T_{1}$
\end_inset

 (training set) and 
\begin_inset Formula $T_{2}$
\end_inset

 (test set).
 The QB clustering at distance threshold 
\begin_inset Formula $10$
\end_inset


\begin_inset ERT
status open

\begin_layout Plain Layout

~
\end_layout

\end_inset

mm was derived for 
\begin_inset Formula $T_{1}$
\end_inset

.
 Denote by 
\begin_inset Formula $C_{1}$
\end_inset

 and 
\begin_inset Formula $c_{1}$
\end_inset

 the set of centroids and the number of them.
 Let 
\begin_inset Formula $R_{1}$
\end_inset

 be a random subset of 
\begin_inset Formula $T_{1}$
\end_inset

 of size 
\begin_inset Formula $c_{1}$
\end_inset

.
 Using the measures described in previous section we found that with distance
 threshold 
\begin_inset Formula $10$
\end_inset


\begin_inset ERT
status open

\begin_layout Plain Layout

~
\end_layout

\end_inset

mm the mean coverage (s.d.) of 
\begin_inset Formula $T_{1}$
\end_inset

 by 
\begin_inset Formula $C_{1}$
\end_inset

 was 
\begin_inset Formula $99.96$
\end_inset

% (
\begin_inset Formula $\pm0.007$
\end_inset

%), of 
\begin_inset Formula $T_{2}$
\end_inset

 by 
\begin_inset Formula $C_{1}$
\end_inset

 was 
\begin_inset Formula $99.31$
\end_inset

% (
\begin_inset Formula $\pm0.08$
\end_inset

%) and of 
\begin_inset Formula $T_{2}$
\end_inset

 by 
\begin_inset Formula $R_{1}$
\end_inset

 was 
\begin_inset Formula $90.49$
\end_inset

% (
\begin_inset Formula $\pm0.41$
\end_inset

%).
 The mean overlap (s.d.) at this threshold of 
\begin_inset Formula $C_{1}$
\end_inset

 in 
\begin_inset Formula $T_{1}$
\end_inset

 was 
\begin_inset Formula $2.44$
\end_inset

 (
\begin_inset Formula $\pm0.08$
\end_inset

), of 
\begin_inset Formula $C_{1}$
\end_inset

 in 
\begin_inset Formula $T_{2}$
\end_inset

 was 
\begin_inset Formula $2.44$
\end_inset

 (
\begin_inset Formula $\pm0.08$
\end_inset

), and of 
\begin_inset Formula $R_{1}$
\end_inset

 in 
\begin_inset Formula $T_{2}$
\end_inset

 was 
\begin_inset Formula $5.57$
\end_inset

 (
\begin_inset Formula $\pm0.50$
\end_inset

).
\end_layout

\begin_layout Standard
The same analyses were performed with QB clusterings for distance threshold
 
\begin_inset Formula $20$
\end_inset

mm and with distance threshold 
\begin_inset Formula $20$
\end_inset

mm.
 (Note that though we have selected the same values here for the two thresholds
 they do not have to be the same.) We found that with distance threshold
 
\begin_inset Formula $10$
\end_inset

mm the mean coverage (s.d.) of 
\begin_inset Formula $T_{1}$
\end_inset

 by 
\begin_inset Formula $C_{1}$
\end_inset

 was 
\begin_inset Formula $99.99$
\end_inset

% (
\begin_inset Formula $\pm0.004$
\end_inset

%), of 
\begin_inset Formula $T_{2}$
\end_inset

 by 
\begin_inset Formula $C_{1}$
\end_inset

 was 
\begin_inset Formula $99.91$
\end_inset

% (
\begin_inset Formula $\pm0.02$
\end_inset

%) and of 
\begin_inset Formula $T_{2}$
\end_inset

 by 
\begin_inset Formula $R_{1}$
\end_inset

 was 
\begin_inset Formula $95.86$
\end_inset

% (
\begin_inset Formula $\pm0.62$
\end_inset

%).
 The mean overlap (s.d.) at this threshold of 
\begin_inset ERT
status open

\begin_layout Plain Layout

$C_1$
\end_layout

\end_inset

 in 
\begin_inset Formula $T_{1}$
\end_inset

 was 
\begin_inset Formula $3.54$
\end_inset

 (
\begin_inset Formula $\pm0.18$
\end_inset

), of 
\begin_inset Formula $C_{1}$
\end_inset

 in 
\begin_inset Formula $T_{2}$
\end_inset

 was 
\begin_inset Formula $3.54$
\end_inset

 (
\begin_inset Formula $\pm0.18$
\end_inset

), and of 
\begin_inset Formula $R_{1}$
\end_inset

 in 
\begin_inset Formula $T_{2}$
\end_inset

 was 
\begin_inset Formula $6.53$
\end_inset

 (
\begin_inset Formula $\pm0.93$
\end_inset

).
\end_layout

\begin_layout Standard
We conclude from these analyses that QB has good coverage and overlap properties
 with respect to the training set and to the test set of streamlines, while
 an equivalent random selection of streamlines has worse coverage and overlap.
 Moreover the performance of QB is better with the lower closeness threshold.
 The poor performance of random subsets is to be expected as they will oversampl
e in denser parts of the tractography space, and undersample in sparser
 regions.
 
\end_layout

\begin_layout Standard
\begin_inset Note Note
status open

\begin_layout Plain Layout
As mentioned earlier, QB shares the behaviour of most clustering algorithms
 in that different orderings of the tracks give rise to different clusterings.
 As a first step towards examining the robustness of QB in this respect
 we recorded the numbers of QB clusters in 
\begin_inset Formula $20$
\end_inset

 different random orderings of the tractographies  of 
\begin_inset Formula $10$
\end_inset

 human subjects acquired as described in section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:QB-Data-sets"

\end_inset

.
 We removed short tracks shorter than 
\begin_inset Formula $40$
\end_inset

 mm and downsampled the tracks at 
\begin_inset Formula $12$
\end_inset

 points.
 Then we applied QB with threshold at 
\begin_inset Formula $10$
\end_inset


\begin_inset space ~
\end_inset

mm.
 The mean number of clusters was 
\begin_inset Formula $2645.9$
\end_inset

 (min 
\begin_inset Formula $1937.6$
\end_inset

; max 
\begin_inset Formula $3857.8$
\end_inset

; s.d.
 
\begin_inset Formula $653.8$
\end_inset

).
 There is therefore a considerable between-subject variation in this metric.
 By contrast, the within-subject variability of the number of clusters across
 random orderings is rather small, with mean standard deviation 
\begin_inset Formula $12.7$
\end_inset

 (min 
\begin_inset Formula $7.3$
\end_inset

; max 
\begin_inset Formula $17.4$
\end_inset

).
 This suggests an encouraging level of robustness in terms of the numbers
 of clusters that QB creates.
 We now consider ways of measuring and comparing the contents of the clusters
 in a clustering.
\end_layout

\end_inset


\end_layout

\begin_layout Subsubsection
Measures to compare classifications
\begin_inset CommandInset label
LatexCommand label
name "sub:Measures-to-compare"

\end_inset


\end_layout

\begin_layout Standard
Considerable attention has been paid to measuring the performance of one
 or more classifiers in the context of supervised learning, see for instance
 
\begin_inset CommandInset citation
LatexCommand cite
key "Kuncheva2004"

\end_inset

.
 We now outline some of these metrics before applying them to the comparisons
 we are interested in.
 Let 
\begin_inset Formula $\mathcal{A}=\{A_{1},A_{2},\ldots,A_{m}\}$
\end_inset

 and 
\begin_inset Formula $\mathcal{B}=\{B_{1},B_{2},\ldots,B_{n}\}$
\end_inset

 be two classifications of 
\begin_inset Formula $N$
\end_inset

 items.
 Let the number of items in 
\begin_inset Formula $A_{i}$
\end_inset

 and 
\begin_inset Formula $B_{j}$
\end_inset

 be 
\begin_inset Formula $a_{i}$
\end_inset

 and 
\begin_inset Formula $b_{j}$
\end_inset

, with 
\begin_inset Formula $t_{ij}$
\end_inset

 items in the intersection 
\begin_inset Formula $A_{i}\cap B_{j}$
\end_inset

.
 There are a number of ways for measuring the similarity or dissimilarity
 of 
\begin_inset Formula $\mathcal{A}$
\end_inset

 and 
\begin_inset Formula $\mathcal{B}$
\end_inset

.
 The first two, Gini Purity and Maximum Likelihood Accuracy, are based on
 ways we might estimate the 
\begin_inset Formula $\mathcal{A}$
\end_inset

-labels if we just have the 
\begin_inset Formula $\mathcal{B}$
\end_inset

-labelling, or vice versa.
\end_layout

\begin_layout Standard

\series bold
Purity.

\series default
 Suppose we have a probability distribution 
\begin_inset Formula $P$
\end_inset

=
\begin_inset Formula $(p_{1},p_{2},\ldots,p_{m})$
\end_inset

 such that the probability that any item has label 
\begin_inset Formula $i$
\end_inset

 is 
\begin_inset Formula $p_{i}$
\end_inset

.
 Not knowing what this for any item is we apply 'probability matching' and
 randomly estimate a label from the set 
\begin_inset Formula $\{1,2,\ldots,m\}$
\end_inset

 by random selection using the same distribution 
\begin_inset Formula $P$
\end_inset

.
 Then, the probability of assigning the correct label is 
\begin_inset Formula $\sum p_{i}^{2}$
\end_inset

; this is the Purity of the distribution.
 The purity of a distribution lies in the range 
\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none

\begin_inset Formula $[\frac{1}{m},1]$
\end_inset

.
 The upper limit occurs when 
\begin_inset Formula $P$
\end_inset

 assigns probability 
\begin_inset Formula $1$
\end_inset

 to just one label (i.e.
 a very pure, concentrated distribution); the lower limit occurs when all
 
\begin_inset Formula $m$
\end_inset

 labels have equal probability 
\begin_inset Formula $\frac{1}{m}$
\end_inset

.
 We now extend this to the case when we have some additional information
 about the item, namely the label that is assigned to it in a different
 classification 
\begin_inset Formula $\mathcal{B}$
\end_inset

.
\end_layout

\begin_layout Standard
If 
\begin_inset Formula $P_{\mathcal{A}|B_{j}}$
\end_inset


\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none
 is the observed conditional probability distribution 
\begin_inset Formula $(p_{i|j}=\frac{t_{ij}}{b_{j}},\, i=1,\ldots,m)$
\end_inset

 of 
\begin_inset Formula $\mathcal{A}$
\end_inset

 given 
\begin_inset Formula $B_{j},$
\end_inset

 then we define the Purity of 
\begin_inset Formula $\mathcal{A}$
\end_inset

 with respect to 
\begin_inset Formula $\mathcal{B}$
\end_inset

 as 
\begin_inset Formula $\mathrm{purity}(\mathcal{A}|\mathcal{B})={\displaystyle \sum_{j=1}^{n}\frac{b_{j}}{N}\thinspace\mathrm{purity}(P_{\mathcal{A}|B_{j}})}$
\end_inset

.
 In terms of the matrix 
\begin_inset Formula $T=(t_{ij})$
\end_inset

 this is the 
\begin_inset Formula $\mathcal{B}$
\end_inset

-weighted average of the impurities of the rows of 
\begin_inset Formula $T.$
\end_inset

 We similarly define 
\begin_inset Formula $\mathrm{purity}(\mathcal{B}|\mathcal{A})$
\end_inset

 and it is equal to the 
\begin_inset Formula $\mathcal{A}$
\end_inset

-weighted average of the impurities of the columns of 
\begin_inset Formula $T.$
\end_inset

 In what follows we will use the symmetrised value 
\begin_inset Formula $\mathrm{purity}(\mathcal{A},B)=[\mathrm{purity}(\mathcal{A}|\mathcal{B})+\mathrm{purity}(\mathcal{B}|\mathcal{A})]/2$
\end_inset

.
\end_layout

\begin_layout Standard

\series bold
Maximum probability matching.

\series default
 Another way to estimate a label for each item is to assign it the label
 with maximum probability 
\begin_inset Formula $i_{\mathrm{max}}=\argmax\, p_{i}$
\end_inset

.
 The Random Accuracy in this case is 
\begin_inset Formula $p_{i_{\mathrm{max}}}=\max_{i}p_{i}$
\end_inset

.
 When we do this conditional on the 
\begin_inset Formula $\mathcal{B}$
\end_inset

-label and average over those labels, we get the Maximum Probability Matching
 of 
\begin_inset Formula $\mathcal{A}$
\end_inset

 conditional on 
\begin_inset Formula $\mathcal{B}$
\end_inset

, 
\begin_inset Formula 
\[
\mathrm{MPM}(\mathcal{A}|\mathcal{B})={\displaystyle \sum_{j=1}^{n}\frac{b_{j}}{N}\thinspace\max_{i}p_{i|\mathcal{B_{j}}}.}
\]

\end_inset

We define 
\begin_inset Formula $\mathrm{MPM}(\mathcal{B}|\mathcal{A})$
\end_inset

 similarly, 
\begin_inset Formula $\mathrm{MPM}(\mathcal{B}|\mathcal{A})={\displaystyle \sum_{i=1}^{m}\frac{a_{i}}{N}\thinspace\max_{j}p_{j|\mathcal{A_{i}}}}$
\end_inset

.
 A further simplification is to use the symmetrized value
\begin_inset Formula 
\[
\mathrm{MPM(\mathcal{A},}\mathcal{B})=[\mathrm{MPM}(\mathcal{A}|\mathcal{B})+\mathrm{MPM}(\mathcal{B}|\mathcal{A})]/2.
\]

\end_inset


\end_layout

\begin_layout Standard

\series bold
Correctness and completeness 
\series default
(splitting and lumping pairs of items).

\series bold
 
\series default
For the next two metrics the focus moves to comparison of the labels assigned
 by 
\begin_inset Formula $\mathcal{A}$
\end_inset

 and 
\begin_inset Formula $\mathcal{B}$
\end_inset

 to pairs of items.
 Differences in the partitions
\begin_inset Formula $\mathcal{A}$
\end_inset

 and 
\begin_inset Formula $\mathcal{B}$
\end_inset

 are reflected in two ways.
 Items assigned the same label by 
\begin_inset Formula $\mathcal{A}$
\end_inset

 are said to be split by 
\begin_inset Formula $\mathcal{B}$
\end_inset

 if their 
\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none

\begin_inset Formula $\mathcal{B}$
\end_inset

-labels are not equal; alternatively items assigned different 
\begin_inset Formula $\mathcal{A}$
\end_inset

-labels are said to be lumped 
\family default
\series default
\shape default
\size default
\emph default
\bar default
\noun default
\color inherit
by 
\begin_inset Formula $\mathcal{B}$
\end_inset

 if they are assigned the same 
\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none

\begin_inset Formula $\mathcal{B}$
\end_inset

-label.
 Note that what is lumped (split) by 
\begin_inset Formula $\mathcal{B}$
\end_inset

 will equally be lumped by 
\family default
\series default
\shape default
\size default
\emph default
\bar default
\noun default
\color inherit

\begin_inset Formula $\mathcal{A}$
\end_inset

.
\end_layout

\begin_layout Standard
The total number of pairs from 
\begin_inset Formula $N$
\end_inset

 items is 
\begin_inset Formula $\mathtt{pairs(\mathcal{A})=}\binom{N}{2}=\frac{N(N-1)}{2}$
\end_inset

.
 The number of pairs assigned the same 
\begin_inset Formula $\mathcal{A}$
\end_inset

-labels is 
\begin_inset Formula ${\displaystyle \mathtt{together}(\mathcal{A})=\sum_{i=1}^{m}\binom{a_{i}}{2}}$
\end_inset

.
 The number of pairs assigned different labels is 
\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none

\begin_inset Formula $\mathtt{apart}(\mathcal{A})=\mathtt{pairs}(\mathcal{A})-\mathtt{together}(\mathcal{A})$
\end_inset

.
 This can also be written as 
\begin_inset Formula ${\displaystyle \sum_{1\le i\ne i'\le m}a_{i}a_{i'}}$
\end_inset

 which in turn can be expressed in terms of the cumulative sum of 
\begin_inset Formula $(a_{i})$
\end_inset

 which is an efficient way of programming these calculation of sums of all
 products with unequal subscripts.
 The number of 
\begin_inset Formula $\mathcal{A}$
\end_inset

-pairs split by 
\begin_inset Formula $\mathcal{B}$
\end_inset

 is 
\begin_inset Formula 
\[
\mathtt{split}(\mathcal{A}|\mathcal{B}){\displaystyle =\sum_{i=1}^{m}}\bigl({\displaystyle \sum_{1\le j\ne j'\le n}n_{ij}n_{ij'}}\bigr)=\mathtt{lumped}(\mathcal{B}|\mathcal{A}).
\]

\end_inset

 Similarly, 
\begin_inset Formula 
\[
\mathtt{lumped}(\mathcal{A}|\mathcal{B}){\displaystyle =\sum_{j=1}^{n}}\bigl({\displaystyle \sum_{1\le i\ne i'\le m}n_{ij}n_{i'j}}\bigr)=\mathtt{split}(\mathcal{B}|\mathcal{A}).
\]

\end_inset


\end_layout

\begin_layout Standard
Completeness and Correctness are defined in terms of these quantities: 
\begin_inset Formula 
\[
\mathtt{completeness}(\mathcal{A}|\mathcal{B})=1-\mathtt{split}(\mathcal{A}|\mathcal{B})/\mathtt{together}(\mathcal{A})
\]

\end_inset

 and 
\begin_inset Formula 
\[
\mathtt{correctness}(\mathcal{A}|\mathcal{B})=1-\mathtt{lumped}(\mathcal{A}|\mathcal{B})/\mathtt{apart}(\mathcal{A}).
\]

\end_inset

 Symmetrized measures of completeness and correctness for 
\begin_inset Formula $\mathcal{A}$
\end_inset

 and 
\begin_inset Formula $\mathcal{B}$
\end_inset

 are defined as 
\begin_inset Formula 
\[
\mathtt{\mathtt{completeness}(\mathcal{A},\mathcal{B})=[completeness}(\mathcal{A}|\mathcal{B})+\mathtt{completeness}(\mathcal{B}|\mathcal{A})]/2
\]

\end_inset


\begin_inset Formula 
\[
\mathtt{\mathtt{correctness}(\mathcal{A},\mathcal{B})=[correctness}(\mathcal{A}|\mathcal{B})+\mathtt{correctness}(\mathcal{B}|\mathcal{A})]/2.
\]

\end_inset

For the clusterings encountered in tractography, the number of apart pairs
 in 
\begin_inset Formula $\mathcal{A}$
\end_inset

 is very high, and only a small percentage (e.g.
 
\begin_inset Formula $0.5\%$
\end_inset

) of these pairs will be lumped by 
\begin_inset Formula $\mathcal{B}$
\end_inset

.
 This is because the average cluster size is small by comparison with the
 number of clusters.
 As a consequence, the correctness measure is not a particularly useful
 metric.
 By contrast, the number of together pairs is modest, and the completeness
 measure is more sensitive.
\end_layout

\begin_layout Standard

\series bold
Maximum Agreement (
\begin_inset Formula $\kappa_{\max}$
\end_inset

).
 
\series default
Our fifth metric is Cohen's 
\begin_inset Formula $\kappa$
\end_inset

, which is a well-known measure of agreement between raters on the assignment
 of a set of items to a shared classification scheme.
 It adjusts the agreements (items on which the raters agree) for the number
 of agreements that might have occurred by chance:
\end_layout

\begin_layout LyX-Code
\begin_inset Formula 
\[
\kappa=\mathtt{\frac{p_{agreement}-\mathtt{p_{chance\: agreement}}}{1-p_{chance\: agreement}}}.
\]

\end_inset


\end_layout

\begin_layout Standard
This can be simply represented in terms of the overlap matrix 
\begin_inset Formula $T=(t_{ij})$
\end_inset

 by the formula:
\begin_inset Formula 
\[
\kappa(T)=\frac{{\displaystyle \sum_{i=1}^{M}}t_{ii}/N-{\displaystyle \sum_{i=1}^{M}}r_{i}c_{i}/N^{2}}{1-{\displaystyle \sum_{i=1}^{M}}r_{i}c_{i}/N^{2}},
\]

\end_inset

where 
\begin_inset Formula $r_{i}$
\end_inset

 and 
\begin_inset Formula $c_{j}$
\end_inset

 represent the row and column totals of 
\begin_inset Formula $T$
\end_inset

.
 We have extended 
\begin_inset Formula $T$
\end_inset

 to a square matrix of size 
\begin_inset Formula $M=\max(m,n)$
\end_inset

 by adding, if necessary, rows or columns of zeros.
 When we adapt this measure to the case of comparing two clusterings we
 further need to take into account the lack of prior correspondence between
 the two sets of labels.
 The 
\begin_inset Formula $\kappa_{\max}$
\end_inset

 statistic is the result of maximising 
\begin_inset Formula $\kappa$
\end_inset

 over all possible correspondences:
\begin_inset Formula 
\[
\kappa_{\max}=\max_{\pi}\kappa(T_{\pi})=\frac{{\displaystyle \sum_{i=1}^{M}}t_{i\pi(i)}/N-{\displaystyle \sum_{i=1}^{M}}r_{i}c_{\pi(i)}/N^{2}}{1-{\displaystyle \sum_{i=1}^{M}}r_{i}c_{\pi(i)}/N^{2}},
\]

\end_inset

where 
\begin_inset Formula $T_{\pi}$
\end_inset

 is the matrix 
\begin_inset Formula $T$
\end_inset

 with columns reordered by a permutation 
\begin_inset Formula $\pi.$
\end_inset

 The principal trouble with the 
\begin_inset Formula $\kappa_{\max}$
\end_inset

 statistic is that its computation is 
\begin_inset Formula $O(N!)$
\end_inset

 if all permutations are tried.
 One way out to overcome the problem caused by the size of the search set
 might be to use a randomised search strategy for instance based on a simulated
 annealing approach.
 
\end_layout

\begin_layout Standard

\series bold
Matched Agreement via the Hungarian Method.
 
\series default
An alternative is to look for a simpler quantity that might be optimised.
 One obvious choice is the maximized number of agreements 
\begin_inset Formula $\mu_{\max}={\displaystyle \sum_{i=1}^{M}}t_{i\pi(i)}$
\end_inset

 corresponding to the permutation 
\begin_inset Formula $\pi$
\end_inset

; this is the leading term in the numerator of 
\begin_inset Formula $\kappa_{\max}$
\end_inset

.
 Maximizing the number of agreements amongst all permutations 
\begin_inset Formula $\pi$
\end_inset

 is a classical combinatorial optimization problem (weighted assignment
 problem on a bipartite graph) that can be reformulated as a linear programming
 problem whose efficient solution by the Hungarian Method
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Kuhn1955"

\end_inset

 is well known.
\end_layout

\begin_layout Standard
We have tested various published implementations of the version by Lawler
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Lawler2001"

\end_inset

 of the Hungarian Method and have found that the one by Carpaneto et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Carpaneto1988"

\end_inset

, implemented by them 
\begin_inset CommandInset citation
LatexCommand cite
key "CarpanetoAPC"

\end_inset

 in 
\begin_inset Formula $\textsc{Fortran}$
\end_inset

, is both fast and capable of handling assignment problems of unlimited
 size.
\end_layout

\begin_layout Standard
\begin_inset Float table
wide false
sideways false
status open

\begin_layout Plain Layout
\noindent
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="3" columns="7">
<features tabularvalignment="middle">
<column alignment="center" valignment="top" width="0">
<column alignment="center" valignment="top" width="0">
<column alignment="center" valignment="top" width="0">
<column alignment="center" valignment="top" width="0">
<column alignment="center" valignment="top" width="0">
<column alignment="center" valignment="top" width="0">
<column alignment="center" valignment="top" width="0">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Metric
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Purity
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
MPM
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Comp
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Corr
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
MA
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
MK
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Mean
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
70.8
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
79.2
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
65.5
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
99.9
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
74.1
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
74.0
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Mean S.D
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.51
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.37
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
1.11
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.02
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.39
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" bottomline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
0.39
\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none
Mean and mean standard deviation of six classification comparison metrics
 for 10 different tractographies: Purity,	Maximum Probability Matching (MPM),	Co
mpleteness (Comp),	Correctness (Corr),	Matched Agreement (MA) and	Matched
 Kappa (MK).
 For each of 
\begin_inset Formula $10$
\end_inset

 tractographies the 
\begin_inset Formula $66$
\end_inset

 pairings of QB clusterings for 
\begin_inset Formula $12$
\end_inset

 different orderings were evaluated.
 All are represented as percentages (%).
 Matched agreements use the Hungarian Algorithm to create a mapping between
 each pair of clusters; matched kappa evaluates Cohen's kappa using this
 same optimal mapping.
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "Flo:comparison_metrics"

\end_inset


\end_layout

\end_inset

We calculated the average of each of these comparison metrics for QB clusterings
 of 
\begin_inset Formula $12$
\end_inset


\begin_inset Note Note
status open

\begin_layout Plain Layout
20?
\end_layout

\end_inset

 different orderings for each of 
\begin_inset Formula $10$
\end_inset

 tractographies (see Tab.
 
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:comparison_metrics"

\end_inset

).
 A number of observations are worth making.
 Matched agreement and matched kappa are essentially the same metric (correlatio
n 
\begin_inset Formula $0.97$
\end_inset

).
 Of these two metrics we prefer matched agreement because is both simpler
 to calculate and understand.
 Correctness, for the reasons discussed above, is too insensitive 
\begin_inset Note Note
status open

\begin_layout Plain Layout
(near to ceiling)
\end_layout

\end_inset

 to be of use.
 We would therefore suggest, and on the basis of the mean of the standard
 deviations across pairings, that maximum probability matching, and matched
 agreement are suitable metrics for evaluating tractography clusterings.
 It is also worth noting that maximum probability matching is a simple first
 approximation to the optimal matching identified by the Hungarian method
 although it is not necessarily one-to-one.
\end_layout

\begin_layout Standard
We have noticed that these metrics are all costly to calculate in terms
 of time and memory requirements.
 Therefore, they will not be used further in this study.
 We instead look at ways to compare clusterings of tractographies that will
 work when comparing different tractographies either for the same or different
 subjects.
 These need to be based on metrics for distances between tracks, whether
 virtual tracks, exemplar tracks or raw tracks from the original tractographies.
 This is the subject of the next section.
 
\end_layout

\begin_layout Subsubsection
Bundle Adjacency
\begin_inset CommandInset label
LatexCommand label
name "sub:Tightness-comparisons-1"

\end_inset


\end_layout

\begin_layout Standard
We have found rather few systematic ways to compare different clustering
 results for tractographies in the literature
\series bold
 
\series default

\begin_inset CommandInset citation
LatexCommand cite
key "moberts2005evaluation"

\end_inset

.
 Being able to compare results of clusterings is crucial for creating stable
 brain imaging procedures.
 It is therefore necessary to develop a way to compare different clusterings
 of the same subject or different subjects.
 Although this is a difficult problem, we propose the following solution
 with a metric which we call bundle adjacency (BA).
 BA works as follows: let us assume that we have gathered the exemplar tracks
 from clustering 
\begin_inset Formula $A$
\end_inset

 in 
\begin_inset Formula $E_{A}=\{e_{1},...,e_{|E_{A}|}\}$
\end_inset

 and from clustering 
\begin_inset Formula $B$
\end_inset

 in 
\begin_inset Formula $E_{B}=\{e_{1}^{'},...,e_{|E_{B}|}^{'}\}$
\end_inset

 where 
\begin_inset Formula $|E|$
\end_inset

 denotes the number of exemplar tracks of each clustering 
\begin_inset Formula $E$
\end_inset

.
 The size of set 
\begin_inset Formula $E_{A}$
\end_inset

 does not need to be the same as that of 
\begin_inset Formula $E_{B}$
\end_inset

 (i.e.
 both 
\begin_inset Formula $|E_{A}|\neq|E_{B}|$
\end_inset

 and 
\begin_inset Formula $|E_{A}|=|E_{B}|$
\end_inset

 are acceptable).
 Next, we calculate all pairwise MDF distances between the two sets and
 store them in rectangular matrix 
\begin_inset Formula $D_{AB}$
\end_inset

.
 The minima of the rows of 
\begin_inset Formula $D_{AB}$
\end_inset

 provide the distance to the nearest track in 
\begin_inset Formula ${\cal B}$
\end_inset

 of each track in 
\begin_inset Formula $A$
\end_inset

 (
\begin_inset Formula $E_{A\rightarrow B}$
\end_inset

) and similarly the minima of the columns of 
\begin_inset Formula $D_{AB}$
\end_inset

 the distance to the nearest track in 
\begin_inset Formula $A$
\end_inset

 of each track in 
\begin_inset Formula $B$
\end_inset

 (
\begin_inset Formula $E_{B\rightarrow A}$
\end_inset

).
 From these correspondences we only keep those distances that are smaller
 than a tight threshold 
\begin_inset Formula $\theta$
\end_inset

.
 Then we define BA (Bundle Adjacency) to be
\begin_inset Formula 
\begin{equation}
BA=\frac{1}{2}\left(\frac{|E_{A\rightarrow B}\leq\theta|}{|E_{A}|}+\frac{|E_{B\rightarrow A}\leq\theta|}{|E_{B}|}\right)
\end{equation}

\end_inset


\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
noindent
\end_layout

\end_inset

 where 
\begin_inset Formula $|E_{A\rightarrow B}\leq\theta|$
\end_inset

 denotes the number of exemplars from A which had a neighbour in B that
 is closer than 
\begin_inset Formula $\theta$
\end_inset

 and similarly for 
\begin_inset Formula $|E_{B\rightarrow A}\leq\theta|$
\end_inset

 the number of exemplars from B to A which their distance was smaller than
 
\begin_inset Formula $\theta$
\end_inset

 (see a similar definition of BA in section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Comparison-of-clusterings"

\end_inset

).
 In other words, BA is the mean of the fraction of row minima of 
\begin_inset Formula $D_{AB}$
\end_inset

 that are less than 
\begin_inset Formula $\theta$
\end_inset

 and the fraction of column minima less than 
\begin_inset Formula $\theta$
\end_inset

.
 When 
\begin_inset Formula $BA=0$
\end_inset

 every exemplar from one set was further than 
\begin_inset Formula $\theta$
\end_inset

 to all exemplars in the other set.
 When 
\begin_inset Formula $BA=1$
\end_inset

 all exemplars from one set had a 
\begin_inset Formula $\theta$
\end_inset

-close neighbour in the other set.
 This metric is extremely useful especially when comparing tractographies
 from different subjects because it does not require 
\begin_inset Formula $|E_{A}|=|E_{B}|$
\end_inset

 which was a requirement with the metrics proposed in the previous section.
\end_layout

\begin_layout Standard
We ran an experiment where we compared BA between pairs of 
\begin_inset Formula $10$
\end_inset

 subjects with their tractographies warped in MNI space (see section
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "sub:QB-Data-sets"

\end_inset

).
 This generated 
\begin_inset Formula $\binom{10}{2}=45$
\end_inset

 BA values with 
\begin_inset Formula $\theta=$
\end_inset


\begin_inset Formula $10$
\end_inset

 mm.
 We performed this experiment twice; first by only keeping the bundles with
 more than 
\begin_inset Formula $10$
\end_inset

 tracks (BA10) and secondly by only keeping the bundles with more than 
\begin_inset Formula $100$
\end_inset

 tracks (BA100).
 The average value for BA10 was 
\begin_inset Formula $47\%$
\end_inset

 and standard deviation 
\begin_inset Formula $2.6\%$
\end_inset

.
 As expected BA100 (bigger landmarks) performed better with average value
 of 
\begin_inset Formula $53\%$
\end_inset

 and standard deviation 
\begin_inset Formula $4.9\%$
\end_inset

.
 The difference between BA10 and BA100 is highly significant: Student's
 t
\begin_inset Formula $=4.692$
\end_inset

, df=88, 
\begin_inset Formula $p=1.97\times10^{-5}$
\end_inset

, two-sided; and, as a precaution against non-normality of the underlying
 distributions, Mann-Whitney U = 530., 
\begin_inset Formula $p=5.65\times10^{-5}$
\end_inset

.
 If we think that the small bundles of size 
\begin_inset Formula $<100$
\end_inset

 are more idiosyncratic or possibly more likely to reflect noise in the
 data, whereas larger bundles are more indicative of substantial structures
 and landmarks in the tractographies, then we are encouraged to see that
 on average the virtual tracks of 
\begin_inset Formula $50\%$
\end_inset

 of larger bundles of each tractography lie within 
\begin_inset Formula $10$
\end_inset

 mm of those of the other tractographies.
 This supports the notion that QB can be used to find agreements between
 different brains by concentrating on the larger (more important) clusters.
 Further evidence of this is discussed in section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Atlases-made-easy"

\end_inset

.
\end_layout

\begin_layout Subsection
Parallel version
\begin_inset CommandInset label
LatexCommand label
name "sub:Parallel-version"

\end_inset


\end_layout

\begin_layout Subsubsection
Algorithm
\end_layout

\begin_layout Standard
QB is a very fast algorithm; however we wanted to make it even more efficient
 so that for example it is trivial to cluster hundreds of subjects together
 and use many CPUs or computers simultaneously.
 This could be used to create an atlas of hundreds of subjects in a few
 minutes.
 Therefore, we have extended QB to a parallel version which we call pQuickBundle
s (pQB).
 This algorithm works as follows.
 We first redirect and downsample all tracks.
 Then we put all tracks together and break them into subsets.
 For every subset we assign a new thread and set QB to run on that thread.
 We have therefore many QBs running on different CPUs.
 Then we collect all individual clusterings and start merging them together.
 We can pair every two results together and merge them in a binary fashion
 or just merge all clusterings to the first clustering.
 We can do merging with many different ways.
 We present here the most modest but useful attempt.
\end_layout

\begin_layout Subsubsection

\series bold
Merging
\series default
 two sets of bundles
\end_layout

\begin_layout Standard
We can merge bundles using exemplar tracks or virtual tracks.
 We first set a distance threshold 
\begin_inset Formula $\theta$
\end_inset

 usually the same as the one we used for the QBs in the previous step.
 Let's assume now that we have gathered the virtual tracks from clustering
 
\begin_inset Formula $A$
\end_inset

 in 
\begin_inset Formula $V_{A}=\{v_{1},...,v_{|V_{A}|}\}$
\end_inset

 and from clustering 
\begin_inset Formula $B$
\end_inset

 in
\begin_inset Formula $V_{B}=\{v_{1}^{'},...,v_{|V_{B}|}^{'}\}$
\end_inset

 where 
\begin_inset Formula $|V|$
\end_inset

 denotes the number of virtual tracks of each clustering.
 
\begin_inset Formula $|V_{A}|$
\end_inset

 can be different 
\begin_inset Formula $|V_{B}|$
\end_inset

.
 (a) For every 
\begin_inset Formula $v_{i}^{'}$
\end_inset

 in set 
\begin_inset Formula $V_{B}$
\end_inset

 we find the closest 
\begin_inset Formula $v_{j}$
\end_inset

 in set 
\begin_inset Formula $V_{A}$
\end_inset

 and store the distance between these two tracks.
 Therefore we now have a set of minimum distances from 
\begin_inset Formula $V_{B}$
\end_inset

 to 
\begin_inset Formula $V_{A}$
\end_inset

.
 The size of this set is equal to 
\begin_inset Formula $|V_{B}|$
\end_inset

.
 (b) Finally, we merge those clusters from 
\begin_inset Formula $B$
\end_inset

 whose virtual tracks have minimum distances smaller than 
\begin_inset Formula $\theta$
\end_inset

 into the corresponding clusters of 
\begin_inset Formula $A$
\end_inset

, and if a virtual track in 
\begin_inset Formula $V_{B}$
\end_inset

 has no sub-threshold neighbour in 
\begin_inset Formula $V_{A}$
\end_inset

 then its cluster becomes a new cluster in the merged clustering.
 In that way clusters from the two sets who have very similar features will
 merge together.
 If not, new clusters will be created.
 Using this approach, no information loss will occur from the merge of the
 two sets of clusters.
 
\end_layout

\begin_layout Subsection
Direct applications
\end_layout

\begin_layout Standard
We found that QB has numerous applications from detecting erroneous tracks
 to creating atlases, finding landmarks and guiding registration algorithms.
 Here we present just a few of the strategies that can be further pursued.
\end_layout

\begin_layout Subsubsection
Rapidly detecting erroneous tracks
\end_layout

\begin_layout Standard
It is well known that there are different artifacts seen in tractographies
 caused by subject motion, poor voxel reconstruction, incorrect tracking
 and many other reasons.
 There is no known automatic method to detect these tracks and therefore
 remove them from the data sets.
 The idea here is to use QB to speed up the search for erroneous tracks.
 We will concentrate on tracks that loop one or many times; something that
 it is considered impossible to happen in nature.
\end_layout

\begin_layout Standard
Tracks most likely to be erroneous are those which wind more than one time,
 like a spiral.
 We can detect those with the following approach: let us assume that we
 have a track 
\begin_inset Formula $s$
\end_inset

 and we want to check if it winds: (a) we perform a singular value decomposition
 on the centered track 
\begin_inset Formula $U,\mathbf{d},V=\mathtt{SVD}(s-\bar{s})$
\end_inset

; (b) project the highest singular value 
\begin_inset Formula $\mathbf{d_{0}}$
\end_inset

 to the first column of 
\begin_inset Formula $U,$
\end_inset

 
\begin_inset Formula $U_{o}$
\end_inset

 creating the first component of a two dimensional coordinate 
\begin_inset Formula $p_{x}$
\end_inset

 and the second highest 
\begin_inset Formula $\mathbf{d_{1}}$
\end_inset

 to the second column 
\begin_inset Formula $U_{1}$
\end_inset

 creating the second coordinate 
\begin_inset Formula $p_{y}$
\end_inset

; and (c) calculate the cumulative winding angle on the 2D plane; d) if
 the cumulative angle is more that 
\begin_inset Formula $400^{\circ}$
\end_inset

 it would mean that the initial track 
\begin_inset Formula $s$
\end_inset

 is winding and therefore needs to be removed (see Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:winding"

\end_inset

).
\end_layout

\begin_layout Standard
Winding tracks can be dangerous when we merge clusters because they could
 be close to many different clusters of different shape simultaneously.
 We found that winding tracks often form bundles with many similar tracks.
 As these are usually long tracks, they will not be removed by filters which
 remove short tracks.
 In Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:erroneous_tracks"

\end_inset

 we show an example where 
\begin_inset Formula $161$
\end_inset

 erroneous bundles were automatically detected by our winding method.
 They all had total winding angle higher than 
\begin_inset Formula $500^{\circ}$
\end_inset

.
 To cluster the initial tractography not shown here we used QB with threshold
 
\begin_inset Formula $10$
\end_inset


\begin_inset space ~
\end_inset

mm.
 This is the first known automatic detection system of outliers and erroneous
 tracks for tractography data based on more advanced shape characteristics
 that go beyond simple track length filtering.
 By calculating the number of winding tracks in the data sets over the total
 number of tracks we could have an indicator of the quality of the data
 sets.
 
\end_layout

\begin_layout Standard
We can use QB with a low threshold to reduce the number of tracks while
 avoiding embedding winding tracks into otherwise ordinary clusters and
 then run the winding algorithm just on the exemplar tracks of the bundles
 rather than the entire tractography.
 
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset Graphics
	filename last_figures/winding.png
	scale 50

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
Example of detecting a possibly erroneous 3D bundle (on the left) by projecting
 its exemplar track and counting the winding cumulative angle 
\begin_inset Formula $\sum_{0}^{N}\omega_{i}$
\end_inset

 on the 2D plane as shown on the right, where 
\begin_inset Formula $N$
\end_inset

 is the total number of track segments.
 Usually bundles with total angle higher than 
\begin_inset Formula $400^{\circ}$
\end_inset

 are removed from the data sets as most likely to be erroneous.
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "Flo:winding"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
QB can also simplify detection of tracks which are very dissimilar to others
 and therefore very distant from all other clusters.
 Usually, when we use a QB threshold of about 
\change_inserted 3 1319698782

\begin_inset Formula $10$
\end_inset


\change_unchanged
 mm, the tracks will be part of small bundles containing a few tracks and
 the distance of the bundle they belong to from all other bundles will be
 much higher than average.
 This can give us another detection method for outliers.
 We could find for example which bundles are most distant from all other
 bundles and remove them from the data sets.
\end_layout

\begin_layout Standard
Finally, QB can be used to remove small or broken tracks in an interactive
 way, for example see Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:cst_pbc"

\end_inset

 where the red large bundle has been merged by an expert and then with QB
 we can extract the skeleton of the bundle and see which parts create that
 structure.
 Without QB it would be too difficult to work out that this bundle consists
 of many small or divergent parts.
 In this figure both very diverging, small or broken tracks can be identified
 after the simplification provided by QB.
 
\end_layout

\begin_layout Standard
In summary, we have shown that QB can facilitate a fully automatic, efficient
 and robust detection system for erroneous tracks in specific bundles or
 entire tractographies.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset Graphics
	filename last_figures/erroneous_tracks.png
	lyxscale 30
	scale 65
	rotateOrigin center

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
Example with erroneous tracks detected on real data sets.
 Left: the erroneous bundles on their exact position in the data set from
 the top of the head, Middle: the same from the saggital view.
 Right: the area surrounded by the red box from the middle slightly rotated
 and zoomed.
 The colour encodes different bundle label.
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "Flo:erroneous_tracks"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Subsubsection
Alignments, landmarks and atlases
\begin_inset CommandInset label
LatexCommand label
name "sub:Atlases-made-easy"

\end_inset


\end_layout

\begin_layout Standard
We have used QB to construct a robust tractographic atlas in MNI space from
 10 subjects' data sets.
 Here we explain the steps we used to achieve that.
\end_layout

\begin_layout Standard

\series bold
Alignment
\series default
.
 Tractographies were created using EuDX as described in section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:QB-Data-sets"

\end_inset

 (see section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Acquisition-sequences-in-use"

\end_inset

 for acquisition details).
 The tractographies for all subjects were initially in native space and
 the goal was to warp them in MNI space, using nonlinear registration.
 
\end_layout

\begin_layout Standard
Because the registration of tractographies is generally considered a difficult
 problem with a non-unique solution we wanted to make sure we are using
 a known, well established and robust method.
 We chose therefore, to use 
\begin_inset Formula $\texttt{fnirt}$
\end_inset

 with the same parameters as used with the first steps of TBSS
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Smith2006NeuroImage"

\end_inset

.
 For that reason, FA volumes were generated from the same data sets using
 Tensor fitting with weighted least squares after skull stripping with 
\begin_inset Formula $\texttt{bet}$
\end_inset

 and parameters 
\begin_inset Formula $\texttt{-F -f .2 -g 0}$
\end_inset

.
 These FA volumes were again in native space therefore we needed to warp
 them in MNI space.
 For this purpose, a standard FA template (
\begin_inset Formula $\texttt{FMRIB58}$
\end_inset

) from the FSL toolbox was used as the reference volume.
 However, we wanted primarily to have the displacements which would do a
 point wise mapping from native space to MNI space and we found this to
 be technically very difficult with the FSL tools as they assume that these
 displacements will be applied only on volumetric data and not with point
 data as those used in tractographies.
 Finally, after some considerable effort we found a combination of 
\change_inserted 1 1319032651

\begin_inset Formula $\texttt{flirt}$
\end_inset


\change_unchanged
, 
\begin_inset Formula $\texttt{invwarp}$
\end_inset

, 
\begin_inset Formula $\texttt{fnirtfileutils}$
\end_inset

 and 
\change_inserted 1 1319032687

\begin_inset Formula $\texttt{fnirtfileutils -withaff}$
\end_inset


\change_unchanged
 which gave us the correct displacements.
 The code is available in module (
\begin_inset Formula $\texttt{dipy.external.fsl}$
\end_inset

).
 It is also important to say that we did not use eddy correction with any
 of this type of data sets.
 Eddy correction is unstable with volumes at high b-values because there
 is not enough signal for guiding a correct registration with the other
 volumes at lower b-values.
\begin_inset Note Note
status collapsed

\begin_layout Plain Layout
ELEF: Perhaps I give to much technical information here? Any ideas?
\end_layout

\end_inset

 It is like trying to mach two figures that have no similiarities at all.
 The matching will be certainly poor and error prone.
\end_layout

\begin_layout Standard
After creating the displacements for every subject; these were applied to
 all tractographies in the native space so they are mapped in the MNI space
 of voxel size 
\begin_inset Formula $1\times1\times1\,\textrm{mm}^{3}$
\end_inset

.
 Having all tractographies in MNI space is something very useful because
 we can now compare them against available templates or against each other
 and calculate different statistics.
 However this is not where we stop; we proceed to generate a tractographic
 atlas using QB clusterings.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset ERT
status open

\begin_layout Plain Layout

[h!]
\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\align center
\begin_inset Graphics
	filename last_figures/big_bundles_atlas.png
	lyxscale 50
	scale 50
	rotateOrigin center

\end_inset


\begin_inset Caption

\begin_layout Plain Layout
\begin_inset Formula $14,520$
\end_inset

 clusters created by joining the QB clusterings of 
\begin_inset Formula $10$
\end_inset

 subjects in MNI space.
 Most clusters had a few tracks and only few had many.
 
\begin_inset Formula $20\%$
\end_inset

 of the largest clusters had more than 
\begin_inset Formula $90\%$
\end_inset

 of the total amount of tracks.
 The agreement between different subjects which would be useful for a solid
 atlas with the biggest bundles becoming landmark bundles and the small
 bundles removed as outliers.
\begin_inset Note Note
status open

\begin_layout Plain Layout
MATTHEW: Not sure what largest bundles % is? Is the gradient at 100% then
 the number of tracks taken by the smallest bundles?
\lang british
 IAN: Will check it out.
 ELEF: largest-> those who have most of the tracks.
\end_layout

\end_inset


\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "Flo:atlas_big_bundles"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard

\series bold
Tractographic Atlas.

\series default
 For all subjects: (a) load warped tractography, (b) downsample the tracks
 to have only 
\begin_inset Formula $12$
\end_inset

 points, (c) calculate and store QB clustering with a 
\begin_inset Formula $10$
\end_inset

 mm threshold, (d) merge all clusterings with 
\begin_inset Formula $10$
\end_inset

 mm threshold as explained in section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Parallel-version"

\end_inset

 (merging).
 When creating an atlas by merging many different subjects the most important
 issue is what you remove from the atlas as outliers.
 QB here provides a possible solution for this problem.
 If we plot the number of tracks for each cluster sorted in ascending order
 we can see an interesting pattern (see Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:atlas_big_bundles"

\end_inset

).
 In this diagram we observe that 
\begin_inset Formula $20\%$
\end_inset

 of the largest clusters had more than 
\begin_inset Formula $90\%$
\end_inset

 of the total amount of tracks.
 This shows that there is much agreement between the biggest bundles of
 different subjects.
 We will use this property to create a solid atlas in which we keep the
 biggest bundles (landmarks) and remove the smallest bundles (outliers).
\end_layout

\begin_layout Standard

\series bold
Finding and Using Landmarks
\series default
.
 One can use this atlas or similar atlases created from more subjects in
 order to select specific structures and study these structures directly
 in different subjects without using any of the standard ROI based methods.
\end_layout

\begin_layout Standard
A simple example is given in Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:CloseToSelected"

\end_inset

.
 In the first row we see a tractographic atlas joined by merging the QB
 clusterings of 
\begin_inset Formula $10$
\end_inset

 healthy subjects as described in the previous section.
 From these clusters, represented by their virtual tracks we only keep 
\begin_inset Formula $196$
\end_inset

 biggest clusters i.e.
 those which contain the highest number of tracks, so that we are sure there
 is enough agreement between the different tractographies.
 From these we just pick by way of an example 
\begin_inset Formula $19$
\end_inset

 virtual tracks which correspond to well known bundle structures in the
 literature: 
\begin_inset Formula $1$
\end_inset

 from Genu of Corpus Callosum (GCC), 
\begin_inset Formula $3$
\end_inset

 from the Body of Corpus Callosum (BCC), 
\begin_inset Formula $1$
\end_inset

 from the Splenium (SCC), 
\begin_inset Formula $1$
\end_inset

 from the Pons Cerebellar Peduncle (CP), 
\begin_inset Formula $1$
\end_inset

 from left Arcuate Fasciculus (ARC-L), 
\begin_inset Formula $1$
\end_inset

 from right Arcuate Fasciculus (ARC-R), 
\begin_inset Formula $1$
\end_inset

 from left Inferior Occipitofrontal Fasciculus (IFO-L) and 
\begin_inset Formula $1$
\end_inset

 from right Inferior Occipitofrontal Fasciculus (IFO-R), 
\begin_inset Formula $1$
\end_inset

 from right Fornix (FX-R), 
\begin_inset Formula $1$
\end_inset

 from left Fornix (FX-L), 
\begin_inset Formula $1$
\end_inset

 from the Optic Radiation (OR), 
\begin_inset Formula $1$
\end_inset

 left Cingulum (CGC-L), 
\begin_inset Formula $1$
\end_inset

 from right Cingulum (CGC-R), 
\begin_inset Formula $1$
\end_inset

 from left Corticospinal tract (CST-L), 
\begin_inset Formula $1$
\end_inset

 from right Corticospinal tract (CST-R), 
\begin_inset Formula $1$
\end_inset

 from left Uncinate (UNC-L) and 
\begin_inset Formula $1$
\end_inset

 from right Uncinate (UNC-R).
 These 
\begin_inset Formula $19$
\end_inset

 tracks are coloured randomly.
 On the second row we show, for the first 
\begin_inset Formula $6$
\end_inset

 of these selected representative tracks, the tracks closer than 
\begin_inset Formula $20$
\end_inset

 mm from 
\begin_inset Formula $3$
\end_inset

 arbitrarily selected subjects.
 Similarly, on the third row the tracks closer than 
\begin_inset Formula $15$
\end_inset

 mm to the next 
\begin_inset Formula $7$
\end_inset

 selected tracks.
 Finally, on the last row, we bring the tracks from the same 
\begin_inset Formula $3$
\end_inset

 subjects which are closer than 
\begin_inset Formula $18$
\end_inset

 mm.
 The colours used for the selected tracks are automatically assigned from
 the colours of tracks picked from the atlas.
 We can see significant reliability and continuity both within and between
 subjects even though we have only selected a very small number of representativ
e tracks.
 Using a similar procedure we could create a book of bundles for every subject
 and then compare the subjects at the level of bundles.
 
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset ERT
status open

\begin_layout Plain Layout

[th!]
\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\align center
\begin_inset Graphics
	filename QB/Thesis/Fig_7_close_distance.png
	lyxscale 30
	scale 70

\end_inset


\begin_inset Caption

\begin_layout Plain Layout
A novel way to do comparisons between subjects.
 Correspondence between different subjects (last 
\begin_inset Formula $3$
\end_inset

 rows) and a few landmarks picked from the tractographic atlas generated
 by merging QB clusterings of 
\begin_inset Formula $10$
\end_inset

 subjects (top row).
 The fact, there is such a level of agreement and continuity on the last
 
\begin_inset Formula $3$
\end_inset

 rows from such a few skeletal tracks offers a great prospect for implementing
 new robust ways of statistical comparisons using tractographic data sets.
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "Flo:CloseToSelected"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Subsubsection
QB as input to other learning methods
\end_layout

\begin_layout Standard
We found that QB is of great value as an adjunct to many less efficient
 algorithms e.g.
 hierarchical clustering, affinity propagation, nearest neighbours, spectral
 clustering and other unsupervised and supervised learning methods.
 We present here one example with QB as input to affinity propagation and
 one with QB as input to hierarchical clustering.
\begin_inset Note Note
status collapsed

\begin_layout Plain Layout
MATTHEW: I wonder if you can use QB as input to QB? I mean to try and find
 the 'global minimum' clustering - the 'mean' clustering from all possible
 clusterings with QB - independent therefore of starting point.
 This might be somehow an optimum solution, and would fit naturally with
 parallel solutions so the serial version would be the same as the parallel
 version.
 ELEF: Life is too short - After thesis! But yes you can.
\end_layout

\end_inset


\end_layout

\begin_layout Standard
Most clustering algorithms need to calculate all pairwise distances between
 tracks; meaning that for a medium sized tractography of 
\begin_inset Formula $250,000$
\end_inset

 tracks we would need 
\begin_inset Formula $232$
\end_inset

 GBytes of RAM with single floating point precision.
 Something which is not and will not be available soon in personal computers.
 A naive solution would be to use sparse matrices to approximate the distance
 matrix; however tractographies are densily packed and produce very dense
 distance matrices.
 Therefore, this is not a viable solution.
 The straightforward solution to this problem is to use QB in order to first
 segment in small clusters and then use the representatives (i.e.
 exemplar or virtual tracks) of these clusters with other higher complexity
 operations and merge the clusters together in bigger clusters.
\begin_inset Note Note
status collapsed

\begin_layout Plain Layout
MATTHEW: I remember Fernando talking about large scale diffusion clustering
 methods being partially based on the ability to ignore long distances when
 clustering.
 I guess QB here gives you a reliable measure of a long distance so that
 you can zero out this entry in your connection matrix.
 ELEF: Tractographic data sets are not sparse, so standard sparse methods
  (which FP was referring to are not applicable).
\end_layout

\end_inset

 More precisely we propose to:
\end_layout

\begin_layout Enumerate
Cluster using QB as explained in section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Atlases-made-easy"

\end_inset

.
\end_layout

\begin_layout Enumerate
Gather virtual tracks.
\end_layout

\begin_layout Enumerate
Calculate MDF distance of virtual tracks with themselves.
\end_layout

\begin_layout Enumerate
Use any other clustering method to segment this much smaller distance matrix
 
\begin_inset Formula $D$
\end_inset

.
\end_layout

\begin_layout Standard
In Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:LSC+HC+AP"

\end_inset

 at the left panel we show a result where we used hierarchical clustering
 with single linkage for step (4) with a threshold of 
\begin_inset Formula $20$
\end_inset

 mm using the package 
\begin_inset Formula $\texttt{hcluster}$
\end_inset

 
\begin_inset CommandInset citation
LatexCommand cite
key "eads-hcluster-software"

\end_inset

.
 A known drawback of single linkage is the so-called chaining phenomenon:
 clusters may be brought together due to single elements being close to
 each other, even though many of the elements in each cluster may be very
 distant to each other.
 Chaining is usually considered as a disadvantage as it is driven by local
 neighbours.
 Nevertheless, we can use this property to cluster the corpus callosum (CC)
 all together (shown with dark red in left top of Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:LSC+HC+AP"

\end_inset

) creating a fully automatic CC detection system.
 
\begin_inset Note Note
status open

\begin_layout Plain Layout
MATTHEW: How reliable is this do you think? ELEF: No further comment at
 present - though it is based on a group of 10 subjects.
\end_layout

\end_inset

 Furthermore, we can use different cutting thresholds on the underlying
 dendrogram to amalgamate together different structures e.g.
 see the cingulum bundles in the same panel.
\end_layout

\begin_layout Standard
In the right panel of Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:LSC+HC+AP"

\end_inset

 we see the implementation of step (4) using a more recent algorithm: affinity
 propagation (AP) 
\begin_inset CommandInset citation
LatexCommand cite
key "dueck2009affinity"

\end_inset

, which was earlier identified by us and 
\begin_inset CommandInset citation
LatexCommand cite
key "malcolm2009filtered"

\end_inset

 for being impossible to be used for group analysis or to cluster entire
 tractographies of many thousands of tracks.
 A small outline of how this algorithm works is given in section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Affinity-Propagation"

\end_inset

.
 In the bottom right panel of Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:LSC+HC+AP"

\end_inset

, we observe how nicely AP, after the simplification provided by QB, has
 clustered Arcuate, Longitudinal Occipitofrontal Fasciculus and other structures
 known from the literature.
 The input of AP was the negative distance matrix
\begin_inset Formula $-D$
\end_inset

, the preference weights were set to matrix 
\begin_inset Formula $\mathtt{median}(-D)$
\end_inset

 and the hierarchical clustering parameter was set to 
\begin_inset Formula $20$
\end_inset


\begin_inset space ~
\end_inset

mm.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset ERT
status open

\begin_layout Plain Layout

[th!]
\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\align center
\begin_inset Graphics
	filename last_figures/LSC_with_others.png
	lyxscale 30
	scale 70

\end_inset


\begin_inset Caption

\begin_layout Plain Layout
Two examples where QB output is used to cluster an entire set of 
\begin_inset Formula $10$
\end_inset

 tractographies together and then the result is given as input to hierarchical
 clustering (HC) using single linkage on the left and to affinity propagation
 (AP) on the right.
 Colours encode cluster labels.
 On the left side we see 
\begin_inset Formula $19$
\end_inset

 clusters and on the right 
\begin_inset Formula $23$
\end_inset

.
 QB facilitates significantly the operation of the other two algorithms
 which would not be able to cluster the entire data sets on current computers.
 Pay attention at the top left panel where QB+HC have managed to cluster
 the entire CC as one bundle.
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "Flo:LSC+HC+AP"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
For hierarchical clustering parts we used the software 
\begin_inset Formula $\texttt{hcluster}$
\end_inset

 and for affinity propagation we used the library 
\begin_inset Formula $\texttt{scikit-learn}$
\end_inset

.
 They are both implemented in 
\begin_inset Formula $\texttt{Python}$
\end_inset

.
\end_layout

\begin_layout Subsubsection
Exemplars vs ROIs vs Masks
\end_layout

\begin_layout Standard
Medical practitioners and neuroanatomists often argue that when they use
 multiple spherical or rectangular masks to select some bundles many tracks
 are thrown away because they are small and the mask operations cannot get
 hold of them.
 Our method provides a solution to this problem as it can identify broken
 or smaller bundles inside other bigger bundles which are otherwise very
 difficult or even sometimes impossible to identify visually or with the
 use of masks.
 Our method attacks this problem and suggests a very efficient and robust
 solution which sets the limit for unsupervised clustering of tractographies
 and facilitates tractography exploration and interpretation.
 One can now use exemplar tracks as access points into the full tractography
 and with a single click on that exemplar track obtain the entire bundle.
 Therefore, a super-bundle can be created just with a few clicks, based
 on a selection from exemplar tracks.
 
\end_layout

\begin_layout Standard
In order to create this system we implemented a 3D visualization and interaction
 system for tractographies based on QB in Python and OpenGL.
 This project is available online at 
\begin_inset Formula $\texttt{fos.me}$
\end_inset

.
\end_layout

\begin_layout Subsection
Direct Tractography Registration
\end_layout

\begin_layout Standard
Direct tractography registration is a recently described problem with only
 a small number of publications, and as far as we know there are no publicly
 available solutions.
 By direct registration we mean that no other information apart from the
 tractographies themselves is used to guide the registration.
 This is in contrast to the previous sections where we used FA registration
 mappings applied to tractographies (see section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Atlases-made-easy"

\end_inset

) which is also most commonly used in the literature along with other Tensor
 based methods
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "goh2006algebraic"

\end_inset

.
 
\end_layout

\begin_layout Standard
The current described methodologies on this subject are as follows.
 Leemans et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "leemans2006multiscale"

\end_inset

 uses the invariance of curvature and torsion under rigid registration along
 with Procrustes analysis to co-register together different tractographies.
 Mayer et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "mayer2008bundles"

\end_inset

 used iterative closest point applied to register pre-selected bundles (bundles
 of interest - BOI) , 
\begin_inset CommandInset citation
LatexCommand cite
key "mayerdirect"

\end_inset

 and extended it using probabilistic boosting tree classifiers for bundle
 segmentation in 
\begin_inset CommandInset citation
LatexCommand cite
key "mayer2011supervised"

\end_inset

.
 Durrleman et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "durrleman2010registration"

\end_inset

 reformulated the tracks as currents and implemented a currents based registrati
on.
 Zvitia et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "zvitia2008adaptive"

\end_inset

, 
\begin_inset CommandInset citation
LatexCommand cite
key "Zvitia2010"

\end_inset

, used adaptive mean shift clustering to extract a number of representative
 ﬁbre-modes.
 Each fibre mode was assigned to a multivariate Gaussian distribution according
 to its population thereby leading to a Gaussian Mixture model (GMM) representat
ion for the entire set of fibres.
 The registration between two fibre sets was treated as the alignment of
 two GMMs and is performed by maximizing their correlation ratio.
 A further refinement was added using RANSAC
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "fischler1981random"

\end_inset

 to obtain all 
\begin_inset Formula $12$
\end_inset

 affine parameters.
 Ziyan et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "ZiyanMICCAI07"

\end_inset

 developed a nonlinear registration algorithm based on the log-Euclidean
 polyaffine framework
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Arsigny2009"

\end_inset

.
 However, we will not classify this approach as a direct tractography registrati
on algorithm as the authors first created scalar volumes from the tracks
 and next warped the volumes.
 Therefore, they did not register the tracks in a straight fashion in their
 space.
 
\end_layout

\begin_layout Standard
We now describe our algorithm and show that it is efficient and simple to
 use.
 In addition, it is completely automatic and provides an evidently robust
 direct rigid tractography registration algorithm available in seconds.
 This algorithm could be of great use when comparing healthy versus severely
 diseased brains e.g.
 stroke or vegetative state patients when non-rigid registration is not
 recommended because of severe asymmetries in the diseased brains.
 The algorithm is based on the robustness of QB to find good representative
 descriptors.
\end_layout

\begin_layout Standard
Here we describe a simple algorithm where 
\begin_inset Formula $2$
\end_inset

 tractographies 
\begin_inset Formula $T_{A}$
\end_inset

,
\begin_inset Formula $T_{B}$
\end_inset

 are brought into alignment in native space.
 The main steps of this approach are:
\end_layout

\begin_layout Enumerate
All tracks with length smaller than 
\begin_inset Formula $100$
\end_inset

 mm and longer than 
\begin_inset Formula $300$
\end_inset

 mm are removed from the data sets.
 This reduces the size of tractography to about 
\begin_inset Formula $1/4$
\end_inset

 of its initial size (
\begin_inset Formula $~200,000$
\end_inset

 tracks).
 (This filtering may have different effects depending on brain size.
 We have not investigated this question at present.)
\end_layout

\begin_layout Enumerate
Both tractographies are equidistantly downsampled so every track contains
 only 
\begin_inset Formula $12$
\end_inset

 points.
 
\end_layout

\begin_layout Enumerate
We run QB with distance threshold at 
\begin_inset Formula $10$
\end_inset

 mm for both tractographies.
\end_layout

\begin_layout Enumerate
Collect all exemplar tracks from clusters containing more than 
\begin_inset Formula $0.2\%$
\end_inset

 of total number of tracks.
 Let us assume we have these in 
\begin_inset Formula $E_{A}$
\end_inset

 and 
\begin_inset Formula $E_{B}$
\end_inset

.
\end_layout

\begin_layout Enumerate
Calculate all pairwise distances 
\begin_inset Formula $D=\mathtt{MDF}(E_{A},E_{B})$
\end_inset

 and save them in rectangular matrix 
\begin_inset Formula $D$
\end_inset

.
 
\end_layout

\begin_layout Enumerate
Create a cost function (optimizer) which will try to minimize the symmetric
 minimum distance 
\begin_inset Formula $\mathrm{SMD}=\sum_{i}\min_{j}D(i,j)+\sum_{j}\min_{i}D(i,j)$
\end_inset

.
\end_layout

\begin_layout Enumerate
Use modified Powell's method 
\begin_inset CommandInset citation
LatexCommand cite
key "fletcher1987practical"

\end_inset

 to minimize 
\begin_inset Formula $\mathrm{SMD}$
\end_inset

 over rigid rotations of 
\begin_inset Formula $E_{\mathcal{B}}$
\end_inset

 starting with zeroed initial conditions.
 At each iteration of the optimization, 
\begin_inset Formula $E_{B}$
\end_inset

 will be transformed by a rigid rotation and 
\change_inserted 3 1319745183

\begin_inset Formula $\mathrm{SMD}$
\end_inset


\change_unchanged
 will be recalculated.
 To ensure smooth rotations we use the Rodriguez rotation formula.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset ERT
status open

\begin_layout Plain Layout

[th!]
\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\noindent
\align center
\begin_inset Graphics
	filename QB/Thesis/Fig_9_QB_registration2_only_landscape.png
	lyxscale 30
	scale 140

\end_inset


\begin_inset Caption

\begin_layout Plain Layout
Two tractographies from different subjects before (left) and after rigid
 registration (right) using our method.
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "Flo:direct_registration2"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
In Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:direct_registration2"

\end_inset

 we see the result of this algorithm applied to two tractographies 
\begin_inset ERT
status open

\begin_layout Plain Layout

--
\end_layout

\end_inset

 represented with their exemplar tracks 
\begin_inset ERT
status open

\begin_layout Plain Layout

--
\end_layout

\end_inset

 depicted with orange and purple.
 We can see in the left panel that the orange tractography is misaligned
 with respect to the purple one, and in the right panel we see their improved
 alignment after applying our algorithm.
\end_layout

\begin_layout Standard

\series bold
Metric
\series default
.
 SMD is proposed here for registration of trajectory data sets, but one
 could equally use mutual information 
\begin_inset CommandInset citation
LatexCommand cite
key "maes1997multimodality"

\end_inset

 or the correlation ratio 
\begin_inset CommandInset citation
LatexCommand cite
key "roche1998correlation"

\end_inset

 for registration of volumetric data sets.
 Nonetheless, the advantage of SMD is that it comes from robust landmarks
 generated by QB which bring together local and global components.
 Initially, it was not clear if we should use SMD or just the sum of all
 distances 
\begin_inset Formula $\mathrm{SD}=\sum_{i,j}D(i,j)$
\end_inset

.
 Therefore, we performed an experiment to validate the smoothness and convexity
 of these two cost functions.
 We plotted both functions under a single-axis translation or a single-angle
 rotation of the same tractography as show in Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:direct_registration"

\end_inset

.
 From these two diagrams we can see, that although for translations only
 the SD was entirely convex, with rotations, the SD had stronger local minima
 which is not a good property for registration.
 Furthermore, the SMD had steeper gradients towards the global minimum which
 is a positive indicator for faster convergence.
 
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset ERT
status open

\begin_layout Plain Layout

[th!]
\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\align center
\begin_inset Graphics
	filename last_figures/metrics.png
	lyxscale 30
	scale 120

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
Left: The metric 
\begin_inset Formula $SMD$
\end_inset

 that we chose to optimize for two copies of the same tractography with
 the second copy translated (above) and rotated (below).
 This metric appears to be smooth with a single global minimum and is only
 slightly non-convex with small local minima.
 Right: Another possible candidate metric was the 
\begin_inset Formula $SD$
\end_inset

.
 Although more convex on translations it had stronger local minima with
 rotations.
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "Flo:direct_registration"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard

\series bold
Experiments
\series default
.
 The first large scale experiment took place using the same tractography
 of a single individual copied and transformed 
\change_inserted 3 1319745405

\begin_inset Formula $1,000$
\end_inset


\change_unchanged
 times with range of all three angles from 
\begin_inset Formula $-45^{\circ}$
\end_inset

 to 
\begin_inset Formula $45^{\circ}$
\end_inset

 and range of all x,
\begin_inset space ~
\end_inset

y,
\begin_inset space ~
\end_inset

z translations from 
\begin_inset Formula $-113$
\end_inset

 to 
\begin_inset Formula $113$
\end_inset

 mm.
 Then we registered all transformed tractographies to the static one and
 calculated all pairwise MDF distances storing them in a square matrix 
\begin_inset Formula $D$
\end_inset

.
 We would expect that if the registration was correct then the sum of all
 diagonal elements of 
\begin_inset Formula $D$
\end_inset

 would be close to 
\begin_inset Formula $0$
\end_inset

.
 This was confirmed with both cost functions used SD and SMD getting close
 to zero 
\begin_inset Formula $99.8\%$
\end_inset

 of the time; however, SMD was always closer to perfect alignment than SD,
 having precision of more than 
\begin_inset Formula $7$
\end_inset

 decimals.
 Consequently we chose SMD as a better cost function for direct tractography
 registration.
\begin_inset Note Note
status open

\begin_layout Plain Layout
MATTHEW: I seem to remember that registrations perform a bit differently
 when registering to transformed versions of themselves.
 I haven't got internet, but look for the AIR website (Roger Woods) and
 a comment about registrations to rotated versions of self.
 ELEF: This is just *rigid* registration we tried here.
 Do you think that this is truly an issue? We need to write a full paper
 for direct registration using QB.
\end_layout

\end_inset


\end_layout

\begin_layout Standard
We used GQI-based tractographies from 
\begin_inset Formula $10$
\end_inset

 subjects and we registered all combinations of pairs 
\begin_inset Formula $\binom{10}{2}=45$
\end_inset

.
 Comparing different tractographies is not a trivial problem however, we
 can use the bundle adjacency (BA) metric explained in section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Tightness-comparisons-1"

\end_inset

.
 We are happy to report the mean initial BA was 
\begin_inset Formula $34.8\%\pm8.0\%$
\end_inset

 and the mean final BA after applying our direct registration method was
 
\begin_inset Formula $48.1\%\pm6.1\%$
\end_inset

.
 This was a statistically highly significant improvement (
\begin_inset Formula $t_{\text{\textrm{paired}}}(44)=11.2$
\end_inset

 ,
\begin_inset space ~
\end_inset


\begin_inset Formula $p\leq10^{-13}$
\end_inset

).
 We are planning in the future to compare this registration method against
 other standard methods which are common in the literature.
\begin_inset Note Note
status open

\begin_layout Plain Layout
MATTHEW: Isn't the metric used in the registration related to the metric
 used to assess the registration? Maybe you could some measure of alignment
 of the structural image instead? I mean, transform the structural image
 with the same parameters.
 ELEF: There isn't time!! This is just the first result of using QB for
 registration.
 To develop a full registration algorithm a new chapter or paper is needed.
 
\end_layout

\end_inset


\end_layout

\begin_layout Subsection
Bundle Quality Control
\end_layout

\begin_layout Standard
In many parts of this document we did not consider short tracks.
 That is perfectly valid because (a) the longer tracks are more likely to
 be used as useful landmarks when comparing or registering different subjects
 because it is more likely for them to exist in most subjects, (b) removing
 short tracks facilitates the usage of distance based clustering (no need
 for manually setting the distance threshold) and interaction with the tractogra
phy, (c) someone would first want to see the overall representation of the
 tractography and go to the details later.
 Nonetheless, after having clustered the longer tracks there are many ways
 to assign the smaller bundles to their closest longer bundles.
 For this purpose, we recommend the use of different distance from MDF for
 example the minimum version of MAM referred to as 
\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none

\begin_inset Formula $\textrm{MAM}_{\textrm{min}}$
\end_inset


\family default
\series default
\shape default
\size default
\emph default
\bar default
\noun default
\color inherit
 (see Eq.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "eq:min_average_distance"

\end_inset

).
 
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset ERT
status open

\begin_layout Plain Layout

[ht!]
\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\align center
\begin_inset Graphics
	filename last_figures/arcuate_small_fibers.png
	lyxscale 50
	scale 70
	rotateOrigin center

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
A simple and vigorous strategy for handling short and long tracks together
 by picking a track of interest from one of our atlases.
 Colourmap encodes track length.
 A: one selected atlas track, B: 
\begin_inset Formula $245$
\end_inset

 subject tracks closer than 
\begin_inset Formula $15$
\end_inset

 mm (MDF distance), C: B tracks clustered in 
\begin_inset Formula $23$
\end_inset

 virtuals, D: 
\begin_inset Formula $3,421$
\end_inset

 tracks closer than 
\begin_inset Formula $6$
\end_inset

 mm (MAM distance) from the representatives of B are shown.
 A great number of short tracks have been brought together along with the
 tracks in B.
 In that way we managed to bring together an entire bundle consisting both
 of long and short fibres by just selecting one track.
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "Flo:arcuate_close"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
Some simple strategies for clustering short fibres are discussed.
 The first is for unsupervised clustering and the second one is for supervised
 learning.
\end_layout

\begin_layout Standard
1.
 Cluster the long tracks using QB with distance threshold at 
\begin_inset Formula $10$
\end_inset

 mm and then cluster the short tracks (<
\begin_inset Formula $100$
\end_inset

 mm) to a lower threshold and assign them to their closest long track bundle
 from the first clustering using the 
\begin_inset Formula $\mathrm{MA}\mathrm{M}_{\mathrm{min}}$
\end_inset

 distance.
\begin_inset Note Note
status collapsed

\begin_layout Plain Layout
MATTHEW: I could imagine a short track crossing a large track at right angles
 and still being matched with the larger track.

\emph on
 
\emph default
ELEF: If there is a track like that we want to know about it!
\end_layout

\end_inset


\end_layout

\begin_layout Standard
2.
 Read the tractography of a single subject,
\change_deleted 3 1319745849
 
\change_unchanged
 use a tractographic atlas as the one created in section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Atlases-made-easy"

\end_inset

 and pick one or more representative tracks from that atlas.
 Then, find the closest tracks from the subject to that selected tracks
 using MDF.
 Cluster the closest tracks found from the previous step and for each one
 of these new skeletons find the closest tracks using 
\begin_inset Formula $\mathrm{MA}\mathrm{M}_{\mathrm{min}}$
\end_inset

 distance.
 We should now have an amalgamation of shorter and longer fibres in one
 cluster.
 
\end_layout

\begin_layout Standard
An example of this second strategy is shown in Fig.
\begin_inset space ~
\end_inset


\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:arcuate_close"

\end_inset

.
 First we selected a single track from Arcuate Fasciculus.
 Next, we brought all tracks closer than 
\begin_inset Formula $15$
\end_inset

 mm using the MDF distance.
 Then, we cluster the last tracks to 
\begin_inset Formula $23$
\end_inset

 virtuals using QB with 
\begin_inset Formula $\theta=6.25$
\end_inset

 mm.
 Finally, we bring all tracks with 
\begin_inset Formula $6$
\end_inset

 mm (
\begin_inset Formula $\mathrm{MA}\mathrm{M}_{\mathrm{min}}$
\end_inset

 distance) from the entire tractography.
 Using this simple strategy we were able to bring together from the entire
 data set and with minimum effort a bundle that consists of many shorter
 and longer tracks.
 
\end_layout

\begin_layout Subsection
Discussion and conclusion
\end_layout

\begin_layout Standard
In this chapter we presented a novel and powerful algorithm 
\begin_inset ERT
status open

\begin_layout Plain Layout

--
\end_layout

\end_inset

 QuickBundles (QB).
 This algorithm provides simplifications to the old problem of white matter
 anatomy packing which has recently attracted much scientific attention;
 it can also be used for any trajectory clustering problem and it is recommended
 when large data sets are involved.
 QB can be used with all types of diffusion MRI tractographies which generate
 streamlines (e.g.
 probabilistic or deterministic) and it is independent of the reconstruction
 model.
\end_layout

\begin_layout Standard
In common with mainstream clustering algorithms such as k-means, k-centers
 and expectation maximization (EM), QB is not a global clustering method.
 It can give different results under different initial conditions of the
 data set when there is no obvious distance threshold which can separate
 the clusters into meaningful bundles; for example we should expect different
 clusters under different permutations/orderings of the tracks in a densely
 packed tractography.
 However, we found that there is enough agreement even between two clusterings
 of the same tractography with different orderings.
 If the clusters are truly separable by distances then there is a global
 solution independent of orderings.
 This is often perceivable in smaller subsets of the initial tractography.
 We empirically found that this problem is minimized even with real data
 sets when a low distance threshold of about 
\begin_inset Formula $10-20$
\end_inset

 mm is used.
 
\end_layout

\begin_layout Standard
Furthermore, the output of QB can become input for another recent quick
 algorithm of quadratic time on average 
\begin_inset Formula $O(M^{2})$
\end_inset

 called affinity propagation where now 
\begin_inset Formula $M\ll N$
\end_inset

 therefore, the overall time stays linear on the number of tracks 
\begin_inset Formula $N$
\end_inset

.
 Other algorithms previously too slow to be used on the entire tractography
 can now be used efficiently too e.g.
 kNN, hierarchical clustering and many others.
\end_layout

\begin_layout Standard
We saw that QB is a linear time clustering method based on track distances,
 which is on average linear time 
\begin_inset Formula $O(N)$
\end_inset

 where 
\begin_inset Formula $N$
\end_inset

 is the number of tracks and with worst case 
\begin_inset Formula $O(N^{2})$
\end_inset

 when every track is a singleton cluster itself.
 QB is the fastest known tractography clustering method and even real-time
 on tractographies with less than 
\begin_inset Formula $20,000$
\end_inset

 tracks (depending on system CPU).
 We also showed that is uses a negligible amount of memory.
\end_layout

\begin_layout Standard
QB is fully automatic and very robust.
 It gives good agreements even between different subjects and can be used
 to create tractography atlases at high speed.
 Additionally, it can be used to explore multiple tractographies and find
 correspondences between tractographies, create landmarks used for registration
 or population comparisons.
 
\end_layout

\begin_layout Standard
QB can be used as well for reducing the dimensionality of the data sets
 at the time of interaction; providing an alternative way to ROIs using
 BOIs (bundles of interest) or TOIs (tracks of interest).
 We also showed it can be used to find 
\begin_inset Quotes eld
\end_inset

hidden
\begin_inset Quotes erd
\end_inset

 tracks not visible to the user at first instance.
 Therefore QB opens up the road to create rapid tools for exploring tractographi
es of any size.
\begin_inset Note Note
status open

\begin_layout Plain Layout
Rotation, translation and scale invariant (check).
\end_layout

\begin_layout Plain Layout
Unlearned tracks will be added as new clusters as being very distant from
 all other clusters.
\end_layout

\begin_layout Plain Layout
Contains only one meaningful threshold i.e.
 distance threshold usually easily set in mm.
\end_layout

\begin_layout Plain Layout
Easy understand how it works when think of bundles as cylinders.
\end_layout

\begin_layout Plain Layout
Clusters hold the entire tractography information.
 Complete assignments - no fuzziness.
\end_layout

\end_inset


\end_layout

\begin_layout Standard
The main concept of this clustering method is that a cluster can be represented
 by virtual tracks which are used only during cluster comparisons and not
 updated at every iteration.
\end_layout

\begin_layout Standard
A virtual (centroid) track is the average of all tracks in the cluster.
 We call it virtual because it doesn't need to correspond to an actual track
 in the real data set, and to distinguish it from exemplar (medoid) tracks
 which are again descriptors of the cluster but are represented by actual
 tracks.
 
\begin_inset Note Note
status open

\begin_layout Plain Layout
MATTHEW: Could this explanation go earlier in the chapter?!!!!
\end_layout

\end_inset


\end_layout

\begin_layout Standard
The clustering creates a book of bundles/clusters which have easily obtainable
 descriptors.
 When clusters are held in a tree structure this permits upwards amalgamations
 to form bundles out of clusters, and downwards disaggregation to split
 clusters into finer sub-clusters corresponding to a lower distance threshold.
 However, we did not touch this hierarchical extension of this algorithm
 here and mostly concentrated on one level amalgamations.
\end_layout

\begin_layout Standard
We worked mostly with long tracks but strategies for short tracks or bundles
 are straightforward and documented.
 We also showed an efficient method where QB can speedup finding erroneous
 bundles or detecting structures of specific characteristics.
\end_layout

\begin_layout Standard
We showed results with simulated, single or multiple real subjects and the
 code for QuickBundles is freely available at 
\begin_inset Formula $\texttt{dipy.org}$
\end_inset

 in module
\begin_inset Newline newline
\end_inset

 
\begin_inset Formula $\texttt{dipy.segment.quickbundles}$
\end_inset

.
\end_layout

\begin_layout Standard

\lang british
\begin_inset Note Note
status open

\begin_layout Plain Layout

\lang british
\begin_inset CommandInset bibtex
LatexCommand bibtex
bibfiles "diffusion"
options "ieeetr"

\end_inset


\end_layout

\end_inset


\end_layout

\end_body
\end_document