paper.tex

\documentclass{article}

\usepackage[round]{natbib}
\usepackage[english]{babel}
\usepackage[letterpaper,top=2cm,bottom=2cm,left=3cm,right=3cm,marginparwidth=1.75cm]{geometry}

\usepackage{booktabs}
\usepackage{tabularx}
\usepackage{authblk}
\usepackage[symbol]{footmisc}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{graphicx}
\usepackage{makecell}
\usepackage{url}
% For rotating figures, tables, etc.
%  including their captions - only for supplementary figures
\usepackage{rotating}

% JK: turning this off for the moment as I keep clicking through on links
% to the bibliography while reading the text and it's intensely annoying.
% Can reinstate when we're ready to preprint
\usepackage[hidelinks]{hyperref}

% sets colours for notes to each other in the text
\usepackage{xcolor}
\newcommand{\sally}[1]{\textcolor{red}{#1}}
\newcommand{\red}[1]{\textcolor{red}{#1}}
\newcommand{\green}[1]{\textcolor{green}{#1}}
\newcommand{\blue}[1]{\textcolor{blue}{#1}}

\title{\vspace{-1.5em} \bf Towards Pandemic-Scale Ancestral Recombination Graphs of SARS-CoV-2}

\author[1]{Shing~H.~Zhan}
\author[2,3$\star$]{Anastasia~Ignatieva}
\author[1$\star$]{Yan~Wong}
\author[4]{Katherine~Eaton}
\author[1]{Benjamin~Jeffery}
\author[1]{Duncan~S.~Palmer}
\author[4]{Carmen~Lia~Murall}
\author[5]{Sarah~P.~Otto}
\author[1$\dagger$]{Jerome~Kelleher}
% \affil[1]{\normalsize Big Data Institute, Li Ka Shing Centre for Health
% Information and Discovery, University of Oxford, United Kingdom}
% \affil[2]{Department of Statistics, University of Oxford, United Kingdom}
% \affil[3]{School of Mathematics and Statistics, University of Glasgow, United Kingdom}
% \affil[4]{National Microbiology Laboratory, Public Health Agency of Canada, Canada}
% \affil[5]{Department of Zoology and Biodiversity Research Centre, University of British Columbia, Canada}
% \affil[$\star$]{Joint second author}
% \affil[$\dagger$]{Correspondence: jerome.kelleher@bdi.ox.ac.uk}
\affil[ ]{\mbox{}\vspace{-2.5em}}


\begin{document}
\maketitle

\setlength{\skip\footins}{1em}
\setlength{\footnotemargin}{0.5em}
\renewcommand{\thefootnote}{\arabic{footnote}}
\footnotetext[1]{Big Data Institute, Li Ka Shing Centre for Health
Information and Discovery, University of Oxford, United Kingdom}
\footnotetext[2]{Department of Statistics, University of Oxford, United Kingdom}
\footnotetext[3]{School of Mathematics and Statistics, University of Glasgow, United Kingdom}
\footnotetext[4]{National Microbiology Laboratory, Public Health Agency of Canada, Canada}
\footnotetext[5]{Department of Zoology and Biodiversity Research Centre, University of British Columbia, Canada}
\renewcommand{\thefootnote}{\fnsymbol{footnote}}
\footnotetext[1]{Joint second author}
\footnotetext[2]{Correspondence: jerome.kelleher@bdi.ox.ac.uk}

\vspace{-1em}
\begin{abstract}
Recombination is an ongoing and increasingly important feature of circulating
lineages of SARS-CoV-2, challenging how we represent the evolutionary history
of this virus and giving rise to new variants of potential public health
concern by combining transmission and immune evasion properties of different
lineages. Detection of new recombinant strains is challenging, with most
methods looking for breaks between sets of mutations that characterise distinct
lineages.
In addition, many basic approaches fundamental to the study of viral
evolution assume that recombination is negligible, in that a single
phylogenetic tree can represent the genetic ancestry of the
circulating strains. Here we present an initial version of
\texttt{sc2ts}, a method to automatically detect recombinants
in real time and to cohesively integrate them into a
genealogy in the form of an ancestral recombination graph (ARG),
which jointly records mutation, recombination and genetic
inheritance. We infer two ARGs under
different sampling strategies, and study their properties.
One contains 1.27 million sequences
sampled up to June 30, 2021, and the second is more sparsely sampled,
consisting of 657K sequences sampled up to June 30, 2022.
We find that both ARGs are
% better words?
highly consistent with
known features of SARS-CoV-2 evolution, recovering the basic
backbone phylogeny, mutational spectra, and recapitulating
details on the majority of known recombinant lineages.
Using the well-established and feature-rich \texttt{tskit} library,
the ARGs can also be stored concisely and processed efficiently
using standard Python tools. For example, the ARG for 1.27 million
sequences---encoding the inferred reticulate ancestry,
genetic variation, and extensive metadata---requires
58MB of storage,
and loads in less than a second.
%JK Not sure if we'd actually say it like this, but this is the
% basic gist I'd like to get accross[
The ability to fully integrate the effects of recombination into
downstream analyses, to quickly and automatically detect new recombinants,
and to utilise an efficient and convenient platform for computation
based on well-engineered technologies
makes \texttt{sc2ts} a promising approach.
\end{abstract}

\section{Introduction}
% Recombination is an important force in SARS-CoV-2, recombinants
% have arisen and they have spread to high frequencies
Recombination via template switching is a common feature
of the evolution of coronaviruses~\citep{Graham2010-xe,De_Klerk2022-tt},
including SARS-CoV-2
\citep{VanInsberghe2021-eu,Jackson2021-ik,Ignatieva2022-st}. By bringing
together mutations carried by different lineages, recombination plays an
important role in generating genetic diversity, with recombinant lineages
associated with adaptation to new host species and with the production of more
immune evasive variants~\citep{Graham2010-xe,De_Klerk2022-tt,Tamura2023-ab}.
Early in the COVID-19 pandemic, the levels of genetic diversity
were too low to enable the detection of distinctive recombinant strains.
By late 2020, however, the appearance and
spread of variants of concern (VoC), designated into classes such as Alpha and
Delta which harboured multiple characteristic mutations,
created the conditions required to detect
recombinant strains and their onward transmission~\citep{Jackson2021-ik}.
More recently, the high prevalence of Omicron,
with multiple co-circulating deeply divergent lineages (BA.1 to BA.5), has
accelerated the rate of coinfection and the potential for
recombination~\citep{Bal2022-hq}.
In early 2023, multiple recombinant lineages have successfully
established and spread to high frequency, and accounting for recombinant
ancestry is now essential in understanding the ongoing evolution of SARS-CoV-2.

% Recombination is hard to detect
Detecting recombination in SARS-CoV-2 is difficult and identifying new
recombinant strains is a time-consuming, manual
process~\citep{Smith2023-identifying}.
Most genomic surveys for SARS-CoV-2 recombinants search for mosaic genomes that
combine specific subsets of characteristic mutations from different lineages
\citep[e.g.,][]{VanInsberghe2021-eu,Jackson2021-ik,Wertheim2022-hj,Sekizuka2022-xz}
and as a result can only identify inter-lineage recombination events.
\cite{Turakhia2022-it} presented a
phylogeny-based approach (``RIPPLES'') to identify putative recombinants among
over ten million SARS-CoV-2 genomes, without pre-specifying sets of characteristic
mutations.
RIPPLES finds candidate recombinants by using an existing phylogeny
(built assuming no recombination) and finding potential recombinants
by  scanning for branches containing many mutations.
It then determines if these candidates would be better explained by
recombination by exhaustively breaking each
sequence into segments and attempting to find more parsimonious placements for
each segment on the phylogeny. If such placements are found, the sequence is
identified as a putative recombinant. Although it enables rapid searches for
genomic evidence of recombinants, RIPPLES relies on a SARS-CoV-2 phylogeny that
accounts for only mutations, treats recombinants \textit{post hoc}, and is an
incomplete representation of the reticulate evolutionary history of SARS-CoV-2.
As noted by the authors, a \textit{post hoc} treatment of recombination is
possible when recombinant lineages are rare and leave few descendants. However,
the proliferation of recombinant lineages is making this increasingly
untenable; for example, more than half of the sequences sampled in February 2023
are from the recombinant strain XBB and its descendants~\citep{Chen2022-pz}. This also means that
future evolution of SARS-CoV-2 is likely to involve multiple sequential
recombination events on top of existing recombinant lineages, creating a highly
reticulated genealogy.

% PARA 3: but you can't use a phylogeny when there's recomb. Need a joint
% model of mutation and recomb. ARGs, tskit, forward refs to ARG section for
% details.

It is well known that recombination distorts phylogenies~\citep{Schierup2000-fg}
and affects the results of downstream analyses, such as inference of
selection~\citep{Anisimova2003-vr}. Standard phylogenetic methods do
not account for recombination
\citep[e.g.,][]{Ronquist2012-zw,Minh2020-lr,Guindon2003-zd}, and there is
no standard method for incorporating the effects of recombination into
phylogenetic analyses.
Ancestral Recombination Graphs (ARGs) are a means of describing such
network-like ancestry~\citep{Griffiths1981-lw,Gusfield2014-qw}, but
until recently lacked software support and sufficiently scalable
inference methods to be of practical use.
However, approaches to infer ARGs now exist that can scale to tens of
thousands of human genomes and beyond
\citep{Speidel2019-yh,Kelleher2019-ba,Schaefer2021-yg,Zhang2023-lf}, dealing with levels of recombination far in excess of those seen in viral
phylogenies. The ``succinct tree sequence'' is an ARG data structure
which has led to significant computational advances across a range
of
applications~\citep{Kelleher2016-wk,Kelleher2018-xc,Kelleher2019-ba,Ralph2020-efficiently,
Wohns2022-th}, and the supporting \texttt{tskit} software library
is now widely used in population genetics applications.
The methods in \texttt{tskit} have been developed to support millions of
whole human genomes~\citep{Kelleher2019-ba}, and so it is particularly well suited
to representing large SARS-CoV-2 genealogies,
which currently encompasses over 15 million sequences in the GISAID
database~\citep{Shu2017-hp}.
See Section~\ref{sec:args} for more details on ARGs and the succinct
tree sequence data structure.

\begin{figure} \centering
\includegraphics[width=0.7\textwidth]{figures/overview_sc2ts.pdf}
\caption{\label{fig:overview_sc2ts}
A schematic of the \texttt{sc2ts} method.
The genetic relationships among SARS-CoV-2 genomes is
reconstructed by using the Li and Stephens
model to infer attachment paths for samples to an existing ARG (curved lines).
Each daily iteration involves three stages:
attachment of new samples to the growing ARG (A, B);
reconstruction of trees relating the samples under each attachment node (C);
and parsimony-based tree topology adjustments (D, E).
In the absence of  recombination, \texttt{sc2ts}
infers an ARG that is a single tree relating the samples (A).
When recombination is detected, \texttt{sc2ts} infers
an ARG that concisely encodes a sequence of local trees relating segments
of the sample genomes (B). Additionally,
mutation-collapsing nodes (D) and reversion-push nodes (E) are inserted to
make more parsimonious placements of mutations that should be shared or should
not be immediately reverted, respectively.}
\end{figure}

% [PARA 4: we present sc2ts based on these recent advances. Summarise
% results, with forward refs to Results sections.]
Here we present a preliminary version of \texttt{sc2ts}, a novel method
for inferring ARGs for SARS-CoV-2 at pandemic scale, in real time.
Building on the open-source \texttt{tskit} library, the method
explicitly reconstructs genealogies with both
mutation and recombination, which
can be conveniently and efficiently analysed using standard Python data science
tools. As illustrated in Figure~\ref{fig:overview_sc2ts}, inference is based on incrementally adding batches of sequences
based on their collection dates and proceeds in three phases.
First, possible paths connecting each sample to the current ARG
are inferred (allowing for recombination) using the
Li and Stephens (LS) model (Figure~\ref{fig:overview_sc2ts}A, B);
the LS ``copying process'' is a Hidden Markov Model (HMM)
approximating the effects of mutation
and recombination, widely used in large-scale genomics (Section~\ref{sec:ls}).
Then, since many samples
in a batch can share an attachment path,
we infer phylogenetic trees for each of these clusters separately using standard methods
(Figure~\ref{fig:overview_sc2ts}C; Section~\ref{sec:sample-cluster-tree-inference}).
Finally, we attach the trees for these sample clusters to the current
ARG and apply some parsimony-based heuristics to address issues
introduced by the inherent greediness of this strategy
(Figure~\ref{fig:overview_sc2ts}D, E; Section~\ref{sec:parsimony-heuristics}).
Using the current preliminary version of \texttt{sc2ts}, we infer two
large ARGs (with 1,265,685 and 657,239 samples, respectively) and study
the properties of these ARGs to illustrate the power of the method
and to inform subsequent development.
We find that these
ARGs accurately capture known phylogenetic
relationships (Section~\ref{sec:backbone_phylogeny}) and
mutational spectra (Section~\ref{sec:mutation_spectrum}),
and automatically identify  the majority of known recombinant
lineages (Sections~\ref{sec:jackson_recombs} and \ref{sec:pango_x_lineages})
with a high level of precision in the
genomic location of recombination breakpoints
(Section~\ref{sec:breakpoint_intervals})
and relationship between parental sequences
(Section~\ref{sec:parent_divergence}).
We hope that these benefits of accurate joint estimation of
genetic inheritance with mutation and recombination will generate community
interest and development of the \texttt{sc2ts} method, and
more generally in applying the efficient and mature software
of the \texttt{tskit} ecosystem to pandemic-scale SARS-CoV-2 data.

\section{Results}
\subsection{Inferred ARGs}
The goals of this preliminary study are to illustrate the utility of
\texttt{sc2ts} and to investigate the properties of the
inferred ARGs to inform subsequent development.
We work with a
representative subset of the available data, limited to
inferences that can be performed on a single server in
a few weeks (see below for further details on timings and
computer hardware used). The cut-off dates for sampling
are arbitrary.
We inferred two ARGs, which we refer to as the
``Wide'' and ``Long'' ARGs throughout.
The Wide ARG is densely sampled but time-limited and
includes 1.27 million sequences collected up to June 30, 2021
which
pass some quality-control filters (Section~\ref{sec:data_preprocessing})
and  have a maximum delay between sampling and submission dates of 30 days
(Section~\ref{sec:filtering_time_travellers}).
For the Long ARG, we randomly sub-sample a maximum of 1,000 genomes per
day (again restricting the delay between sampling and submission to 30 days)
and include an additional year's worth of samples (to June 30, 2022).

\begin{table}
\begin{center}
    \begin{tabular}{llrlrl}\toprule
        &  & \multicolumn{2}{l}{Wide ARG} & \multicolumn{2}{l}{Long ARG} \\
    \cmidrule{3-6}
    \multicolumn{2}{l}{Sample filtering}
             & \multicolumn{2}{l}{collection $\leq$ 2021-06-30}
                 & \multicolumn{2}{l}{collection $\leq$ 2022-06-30} \\
           & & \multicolumn{2}{l}{max-delay=30} & \multicolumn{2}{l}{max-delay=30} \\
        &  & &                               & \multicolumn{2}{l}{max-daily=1000} \\
    \midrule

    Nodes & & 1,453,347 & & 783,231  & \\
     & \emph{Node type} & \\
    & Sample & 1,265,685 & (87.09\%) & 657,239 & (83.91\%) \\
    & Daily sample cluster tree  & 102,709 & (7.07\%) & 51,807 & (6.61\%)  \\
    & Reversion push & 40,538 & (2.79\%) & 34,358 & (4.39\%)  \\
    & Mutation collapse & 40,292 & (2.77\%) & 37,749 & (4.82\%)  \\
    & Recombination & 4,123 & (0.28\%) & 2,078 & (0.27\%) \\
    \cmidrule{2-6}
    % Edges & & 1,458,146 &  & 785,539 & \\
    %  & \emph{Parent node type} & \\
    % & Samples & 610,729 & (41.88\%) & 319,626 & (40.69\%) \\
    % & UPGMA   & 470,545 & (32.27\%) & 156,881 & (19.97\%)  \\
    % & Reversion push & 184,608 & (12.66\%) & 144,991 & (18.46\%)  \\
    % & Mutation collapse &186,218  & (12.77\%) & 160,833 & (20.47\%)  \\
    % & Recombination & 6,046 & (0.41\%) & 3,208 & (0.41\%) \\
    % \cmidrule{2-6}
    Mutations & & 1,231,193 &  & 1,062,072 & \\
    & Per node per genome & 0.83  & $\pm$1.40 & 1.36 & $\pm$1.72 \\
    & Per sample per genome & 0.77  & $\pm$1.39 & 1.38 & $\pm$1.77 \\
    & Per site per ARG & 41.23  & $\pm$108.16    & 36.10 & $\pm$80.03\\
    \cmidrule{2-6}
    % size as reported by ls -lh
    \multicolumn{2}{l}{Compressed size (inc metadata)} & 58MB & & 37MB&  \\
    \multicolumn{2}{l}{Bytes/sample (exc metadata)}  & 8.29  & & 10.83 \\
    Load time & & 0.9s & & 0.5s & \\
    \bottomrule
\end{tabular}
\end{center}
\caption{\label{tab:args}Summary of the inferred ARGs. Nodes are classified
as either samples or by the inference process that produced them
(see Sections~\ref{sec:sample-cluster-tree-inference},
\ref{sec:parsimony-heuristics} and \ref{sec:treatment_recombinants} for
details).
The mean and standard deviation ($\pm$)
are reported for the number of mutations per node, sample and site.
}
\end{table}

The properties of the inferred ARGs are summarised in Table~\ref{tab:args}.
The majority of the nodes in the ARGs represent sample genomes
(Wide ARG: 87\%, Long ARG: 84\%), with the remainder mostly representing
the ancestral sequences inferred
from daily sample clusters (Section~\ref{sec:sample-cluster-tree-inference})
and parsimony heuristics (Section~\ref{sec:parsimony-heuristics}).
Both ARGs contain 29,422 sites (Section \ref{sec:data_preprocessing}),
and a large number of mutations.
The average number of mutations per site is high, although some
of this may be explained by outlier sites with artefactually high
mutation counts (see Figure~\ref{fig:breakpoint-distribution}).
Despite this, however, the number of mutations per node
is small. In the Wide ARG, for example, we have a mean of 0.77
mutations per sampled genome, demonstrating that most
added samples fit into the ARG parsimoniously.
See Section~\ref{sec:mutation_spectrum}
for more analysis of the patterns of inferred mutations.
Recombination plays a relatively minor role, with $<0.3\%$
of nodes in the ARGs representing inferred recombination events.
Of these recombination nodes, the majority are ancestral
to only one sample (Wide ARG: 63.1\%, Long ARG: 63.3\%).
We analyse signals of recombination in Sections~\ref{sec:jackson_recombs},
\ref{sec:breakpoint_intervals}, \ref{sec:parent_divergence},
and \ref{sec:pango_x_lineages}.

Table~\ref{tab:args} also summarises some of the computational
properties of the inferred ARGs.
The ARGs are encoded as a ``succinct tree sequence'' using
the \texttt{tskit} library, which provides an extensive
suite of operations for constructing and analysing ARGs
(Section~\ref{sec:args}). For example, the Wide ARG
which contains complete genomes (with imputed missing data)
for around 1.2 million samples, along with extensive sample
and debugging metadata, requires only 58MB of space (compressed
using the \texttt{tszip} utility). The majority of this
space is used by the metadata, which when discarded results in
an encoding that requires an average of only 8.29 bytes per
SARS-CoV-2 genome stored.
Loading these ARGs takes less than a second, and they can be interactively
analysed using Jupyter notebooks~\citep{Kluyver2016-jupyter}
on a standard laptop. The majority
of the analyses in this preprint can be carried out in seconds
with the \texttt{tskit} Python API, using a few gigabytes of RAM.

Inferring these ARGs does require substantial computation.
The Wide ARG required 17 days to infer on a server with
128 threads and 512 GB RAM (2x AMD EPYC 7502 @ 2.5GHz). The Long ARG
required 23 days on a (much older) machine with 40 threads and 256 GB RAM (2x
Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz). The majority of the time is spent
on running the LS HMM (Section~\ref{sec:ls}) to find the copying
path for each sequence, and the process
is therefore highly amenable to distributing across multiple machines.
We therefore anticipate that further development of the
inference methods and scaling out across multiple servers will
enable inferences at the full pandemic scale.


\subsection{Backbone phylogeny}
\label{sec:backbone_phylogeny}
% FIXME the development of thought is quite muddled here, we should rearrange
% in later revisions. We should discuss the expected lack of recombination
% and our likely false +s in one paragraph.
Compared to organisms like humans that recombine in every generation,
recombination is relatively rare in SARS-CoV-2, with recombination nodes accounting
for $<$0.3\% of the inferred ancestry (Table~\ref{tab:args}). As a result,
relationships between strains can often be represented by a
single phylogenetic tree, particularly when looking at
a subset of strains. We expect ARGs to be particularly treelike
early in the pandemic, when co-infection was less likely and divergence between
lineages relatively low.

\begin{figure} \centering
\includegraphics[width=\textwidth]{figures/cophylogeny_wide.pdf}
\caption{\label{fig:cophylogeny}
Tanglegram comparing a local tree from the Wide ARG
(sampled to mid-2021) and an ``all-time'' global Nextstrain tree
(downloaded on 2023-01-21).
Phylogenies are pruned down to those samples present as tips in both datasets, with the horizontal axis representing time (tips end at the sample collection date). Light grey
lines match the corresponding samples between the two trees; black circles
indicate identical sample partitions between the two trees. Terminal branches
are colour-coded according to the Pango lineage status assigned to the tip
samples.
The tanglegram was generated using the Neighbor-Net algorithm
\citep{Scornavacca2011-mg} implemented in Dendroscope version 3.8.5
\citep{Huson2012-ys}. See Figure~\ref{fig:cophylogeny_long} for the equivalent
cophylogeny for the Long ARG.}
\end{figure}

A classic tree-based summary of SARS-CoV-2 ancestry is provided
by the Nextstrain project \citep{Hadfield2018-ef}. The trees
available from Nextstrain are based on small subsamples of the
dataset, and early in the pandemic tend to be restricted to
a sample of strains that were not thought to be recombinants.
For validation purposes, we compare our ARGs with
a downloaded Nextstrain tree, restricted to the time period
covered by each ARG. To enable this, we ``simplify''~\citep{Kelleher2018-xc}
the \texttt{sc2ts} ARGs to a backbone containing only those samples
present in the Nextstrain tree. This results in a small set of shared
samples (Wide ARG: 180, Long ARG: 88), none of which are
assigned to Pango recombinant lineages by Nextclade (see Section
\ref{sec:pango_x_lineages}).

The \texttt{sc2ts} backbone phylogenies for these Nextstrain subsamples contain small amounts of recombination,
with 7 recombination nodes in the Wide ARG backbone (8 for the Long ARG
backbone). However, the recombination events involve minor, local topological rearrangements
where recombination only occurs between close relatives.
(The majority of these recombinations are likely
false-positives, as discussed in Section~\ref{sec:false_positives}.)

Figure~\ref{fig:cophylogeny} compares the backbone phylogeny of the Wide ARG with
a GISAID global ``all-time'' tree from Nextstrain.
We illustrate the backbone phylogeny by visualising
a single tree in the middle of the viral genome,
although other regions of the
genome show almost identical topologies.
It is clear that the backbone topology of the \texttt{sc2ts} tree shows very close
agreement with the Nextstrain tree. The sample genomes cluster by their
assigned Pango lineage status, and many variants and their descendants form
identical monophyletic clades in both the trees (e.g., the Alpha and Delta VoC
clades, labelled).
Figure~\ref{fig:cophylogeny_long} shows the same comparison for the Long ARG, with similar results.

Figure~\ref{fig:cophylogeny} also reveals some notable differences between the
trees. Firstly, the \texttt{sc2ts} tree is generally less well resolved,
particularly in early 2020 when sampling density was much lower than
later in the pandemic. Resolution early in the pandemic could be improved
by using a tree inferred using classical phylogenetic
approaches for the first few months of the pandemic, before the scale of
data began to overwhelm these methods. Indeed, this is the approach
taken by UShER~\citep{Turakhia2021-ur}. Using a pre-existing tree for
the early stages of the pandemic would be straightforward in \texttt{sc2ts},
and the main reason we did not do this for the initial
version under consideration here was to evaluate the algorithm's
performance with sparse data. Given
the simplicity of the algorithm, tree inference for the early pandemic
is surprisingly good.
The second difference between the \texttt{sc2ts} and Nextstrain trees
that we would like to highlight are a few non-identical sample partitions
near the tips (e.g.\ criss-crossing assignments within the Alpha clade). It is unclear what
particular differences between the phylogenetic reconstruction algorithms are driving these
differences, and more study is required to characterise and address them.
Finally, some branch lengths differ substantially between the \texttt{sc2ts}
and Nextstrain trees.
As discussed in Section~\ref{sec:node_dating}, the dating of nodes
other than samples in \texttt{sc2ts} is currently quite crude,
but there are likely straightforward expedients that would yield
substantial improvements.


\subsection{Mutational spectrum}
\label{sec:mutation_spectrum}
The ARGs inferred by \texttt{sc2ts} and represented using the \texttt{tskit}
library (Section~\ref{sec:args}) are a joint estimate of the genealogy with recombination and mutation. Unlike most approaches to
phylogenetic analysis, mutations are included in the \texttt{tskit}
data model alongside
the topological representation of genetic inheritance.
This has many
advantages, for example allowing us to compute statistics of the observed
sequences efficiently~\citep{Kelleher2016-wk,Ralph2020-efficiently} and
to provide high levels of data compression~\citep{Kelleher2019-ba}.
The same idea has recently been used to represent
SARS-CoV-2 data in UShER's ``mutation annotated tree''
format~\citep{Turakhia2021-ur}.

\begin{figure} \centering
\newcolumntype{Y}{>{\centering\arraybackslash}X}
\begin{tabularx}{\textwidth}{cY}
\hspace*{-2.5mm}\includegraphics[width=0.32\textwidth]{figures/mutational_spectra.pdf}
&
\begin{tabular}[b]{llrlr}\toprule
            & \multicolumn{2}{l}{Wide ARG} & \multicolumn{2}{l}{Long ARG} \\
% \midrule
    \cmidrule{2-5}
Total      & \multicolumn{2}{l}{1,213,193} & \multicolumn{2}{l}{1,062,072} \\
Private     & 758,903 & (62.55\%) &  767,111 & (72.23\%)\\
    \cmidrule{2-5}
Transitions & 873,487   & (72.00\%) & 783,773  & (73.80\%) \\
Transversions & 326,053 & (26.88\%) & 270,333  & (25.45\%) \\
    \cmidrule{2-5}
Insertions  & 6,191  & (0.51\%) & 2,814   & (0.26\%) \\
Deletions   & 7,462  & (0.62\%) & 5,152   & (0.49\%) \\
    \cmidrule{2-5}
Recurrent   & 74,719 & (6.16\%) & 50,099 & (4.72\%) \\
Of which reversions  & 72,617 & (5.99\%) & 48,226 & (4.54\%) \\
\bottomrule
\vspace{-2mm}
\end{tabular}\\
(A) & (B)\\
\end{tabularx}
\caption{\label{fig:mutational_spectra}
(A) Mutational spectrum in
the Wide ARG compared to \cite{Yi2021-sc}. Mutations are
categorised by type (i.e., inherited state $>$ derived state). The percentages
of each mutation type from the Wide ARG are represented by blue bars and the
percentages from Yi et al.\ by orange bars, with the darker colours representing
one direction (e.g., C$>$U) and the lighter colours the reverse (e.g., U$>$C).
(B) Summary of mutations in the Long and Wide ARGs. Private mutations
occur on terminal branches. Insertions are mutations in which the
inherited state is the gap state ``-'' and the derived state is a
nucleotide, and vice versa for deletions. See Section~\ref{sec:args}
for precise definitions of mutations, and the recurrent and reversion
classifications.
}
\end{figure}

The properties of the mutations inferred in the Wide and Long
ARGs are summarised in Figure~\ref{fig:mutational_spectra}B. In both
cases we have a large number of mutations, and a majority of these
(Wide ARG: 62.55\%, Long ARG: 72.23\%) are private to a single
sample, i.e.\ on terminal branches.
Although the average number
of mutations per sample is small
(Wide ARG: 0.77, Long ARG: 1.38; Table~\ref{tab:args}),
the average number per site
is large in both ARGs (Wide ARG: 41.23; Long ARG: 36.10; Table~\ref{tab:args}).
However, a lower median count (Wide ARG: 14; Long ARG: 13),
suggests that these high mutation counts are
partly driven by some hypermutable sites (e.g., site 28,271 has over 7,000
mutations in the Wide ARG; Figure~\ref{fig:breakpoint-distribution}), which
may be artefactual.

The current version of \texttt{sc2ts} infers a
large number of ``reversion'' mutations, with 5.99\% of all mutations in the Wide
ARG (Long ARG: 4.54\%) reverting the state change of the immediately
ancestral mutation (see Section~\ref{sec:args} for precise definitions).
These are symptomatic of both data quality issues
such as ``time travellers'' (Section~\ref{sec:filtering_time_travellers})
and problematic sites (e.g., 28,271 as discussed in
Section~\ref{sec:breakpoint_intervals}; see also
Figure~\ref{fig:pango_XB_gisaid_graph} for an example of multiple
reversions at this site), as well as indicating opportunities
for improvement in tree building heuristics
(Section~\ref{sec:parsimony-heuristics}).
For example, the current ``reversion push'' operation
only eliminates reversions in the newly added portion of the tree
and not reversions around earlier nodes created by this algorithm.

Despite the presence of some artefactual mutations,
the properties of the mutations inferred by \texttt{sc2ts} largely follow
established results.
In Figure~\ref{fig:mutational_spectra}A we compare the mutational spectrum
in the Wide ARG to the results of \cite{Yi2021-sc},
who reconstructed a SARS-CoV-2 phylogeny of
over 350,000 genomes sampled globally from 2019-12-24 to 2021-01-12
and classified the mutations occurring along the phylogeny.
We categorised all single nucleotide
mutations in the Wide ARG by type (defined by the inherited and derived states),
excluding mutations inherited by only a single sample (which are
more likely to be sequencing errors).
Similarly, we took the data for single nucleotide mutations from
\citet[][\url{https://github.com/ju-lab/SC2_evol_signature}]{Yi2021-sc}, excluding
mutations occurring along terminal branches, and tallied them up by type.
Figure~\ref{fig:mutational_spectra}A shows that the mutational spectrum from the
Wide ARG (based on 448,825 mutations) matches that reported by \citet[based on
92,344 mutations]{Yi2021-sc}. In both spectra, C-to-U mutations and G-to-U
mutations occur more frequently than U-to-C and U-to-G, respectively. Similar
results are obtained when including the mutations inherited by only a single
sample or those occurring on terminal branches (data not shown).

\subsection{Early recombinants}
\label{sec:jackson_recombs}
RNA viruses are known to recombine at high rates when cells are co-infected
\citep{Simon2011-rna}, and recombination has been widely
documented to be commonplace in animal and human coronaviruses
\citep{Su2016-epidemiology}. While recombination in SARS-CoV-2
was shown early on to be frequent \emph{in-vitro}
\citep{Gribble2021-coronavirus}, the relatively slow accumulation
of genetic diversity early in the pandemic hampered efforts to
detect recombinant strains. A number of early studies relying on
analysing patterns of linkage disequilibrium and searching for
mosaic genomes carrying characteristic mutations of different
lineages either failed to detect recombination or posited
that this occurred at low rates \citep[e.g.,~][]{Nie2020-phylogenetic,Tang2020-origin,VanInsberghe2021-eu,Varabyou2021-rw}.
The first clear evidence of recombinant lineages was presented by
\citet{Jackson2021-ik}, who performed a careful analysis of sequences
circulating in the UK in late 2020 to early 2021 and found
evidence of multiple independent recombination events and onward
transmission.
By searching for samples combining genomic segments from Alpha (B.1.1.7) and
from the parental lineage B.1.1 based on a list of 22 Alpha-defining mutations,
they found 16 recombinant sequences from 8 putative
origins (groups A to D and four singletons).
These findings are closely replicated in both the Wide and Long ARGs.

\begin{table} \centering
\begin{tabular}{lllr@{--}lr@{+}l}
\toprule
Group        & Sequences & Method & \multicolumn{2}{c}{Interval}
    & \multicolumn{2}{r}{Parent lineages} \\
\midrule
A (XA)       & 4   & Jackson        &  21,256&21,615 & B.1.177&B.1.1.7 \\
             &     &\texttt{sc2ts} &  21,256&22,228 & B.1.177.18&B.1.1.7 \\
\cmidrule{3-7}
B            & 2   & Jackson        &  6,529&6,955 & B.1.36&B.1.1.7  \\
             &     &\texttt{sc2ts} &  6,529&6,955 & B.1.36&B.1.1.7  \\
\cmidrule{3-7}
C            & 3   &Jackson        &  25,997&27,443 &  B.1.1.7&B.1.221 \\
             &     & \texttt{sc2ts} &  25,997&27,973 &  B.1.1.7&B.1.221 \\
\cmidrule{3-7}
D            & 3   & Jackson        &  21,576&23,064 &  B.1.36.17&B.1.1.7 \\
             &     & \texttt{sc2ts} &  22,445&23,064 &  B.1.36.39&B.1.1.7 \\
\bottomrule
\end{tabular}
\caption{\label{tab:jackson}Comparison of recombination breakpoint
intervals and parent lineages for Groups A-D
reported by \cite{Jackson2021-ik} with the corresponding
recombination events in the Wide ARG.
The second column gives the number of sequences in the group,
limited to the samples considered by \citet{Jackson2021-ik}.
The 3SEQ~\citep{boni2007exact} coordinates
reported by Jackson et al.\ have been altered as follows: we
add one to both left and right coordinates to correspond to the
\texttt{tskit} definition of inheritance on either side of a breakpoint,
and add one to the right coordinate to make the intervals
right-exclusive. See Section~\ref{sec:treatment_recombinants}
for a precise definition of sequence inheritance at recombination
events and the corresponding breakpoint intervals.
Details for all 16 sequences are
given in Table~\ref{tab:jackson_supplement}.}
\end{table}

The Wide ARG contains 15 of these 16 recombinant sequences
(sample MILK-103C712 was removed during preprocessing; see
Section~\ref{sec:data_preprocessing}).
Table~\ref{tab:jackson} shows the groups of sequences identified by Jackson
et al.\ as likely independent recombination events with onward transmission.
In each case we have a corresponding recombination node in the Wide ARG,
from which all the sequences in the group descend. The parent lineages
and breakpoint intervals agree closely (see
Section~\ref{sec:breakpoint_intervals} for more details on breakpoint
intervals).
For groups B, C and D,
these recombination nodes form clades consisting only of the identified
sequences.
Group A sequences were subsequently given the Pango XA designation
following onward transmission,
and there are 44 XA designated samples in the Wide ARG (including the
4 sequences analysed by Jackson et al.). The Group A recombination
node forms a monophyletic clade of these 44 samples.
Table~\ref{tab:jackson_supplement} shows the details for each of the
16 sequences individually and showing generally a strong concordance
in mosaic structure and parent lineages
(including sample CAMC-CB7AB3, which is inferred to have two breakpoints under both
methods).

The Long ARG contains 5 of the sequences: two each from groups A and B
and sample QEUH-1067DEF. These cluster under three recombination nodes, as expected,
and have identical breakpoint intervals and parental lineage assignments
to those of Wide ARG.
The recombination nodes for Group B and sample QEUH-1067DEF are ancestral only
to the sequences involved.
The recombination node for group A forms a monophyletic clade of all
5 XA samples present in the Long ARG
(Figure~\ref{fig:pango-simple-origin-graph}A)

\subsection{Recombination breakpoint intervals}
\label{sec:breakpoint_intervals}
It is rarely possible to be precise about the position on the genome at which
a recombinant sequence switches from inheriting from one parent to another.
Even if we observe the recombinant sequence before subsequent
divergence occurs (during onward transmission), there is no way to identify the exact breakpoint if the two parent sequences
are similar.
Here we define the
interval within which a particular breakpoint may have occurred
as the genome coordinates over which the
sequences for the left and right parent nodes are identical. The right-hand
extreme of the breakpoint interval is chosen by the LS HMM Viterbi algorithm
(Section~\ref{sec:ls}), and the left endpoint is then derived by directly
comparing the parent sequences. See Section~\ref{sec:treatment_recombinants} for
a precise definition of breakpoints and their intervals.

\begin{figure}
\centering
\includegraphics[width=\textwidth]{figures/wide_arg_recombination_intervals.pdf}
\caption{\label{fig:breakpoint-distribution}
Distribution of recombination breakpoints and mutations along the genome in
the Wide ARG. Top panel shows the intervals for 1,769 breakpoints associated
with 1,522 recombination nodes with at least two descending samples, plotted along the genome
as line segments (coloured by interval width). The inset histogram shows the
distribution of these interval widths (truncated at 10kb).
The bottom panel shows the number of intervals that span
each site along the genome (left axis, orange) and the number of mutations
per site (right axis, blue).
The top-ten sites by mutation count are annotated.
See Figure~\ref{fig:long_arg_breakpoint_distribution} for the equivalent plot
for the Long ARG.}
\end{figure}

Figure~\ref{fig:breakpoint-distribution} shows the distribution of breakpoint
intervals and patterns of recurrent mutation along the genome in the Wide ARG
(see Figure~\ref{fig:long_arg_breakpoint_distribution} for the same information
for the Long ARG). We focus on the Wide ARG here as it covers roughly the same time
period as the analyses of \cite{Turakhia2022-it}, facilitating comparisons
of the results. To reduce the effect of artefactual recombinants, we
consider only the breakpoints associated with the 1,522 recombination nodes
that are ancestral to more than one sample. The mean length of these intervals
is 1,685 bases (median 962), and the length distribution is summarised in the
inset histogram in Figure~\ref{fig:breakpoint-distribution}.

Although further work is required to filter
out spuriously inferred recombination events (see
Section~\ref{sec:false_positives})
we can draw some preliminary conclusions from Figure~\ref{fig:breakpoint-distribution}.
The number of intervals spanning a site (orange curve) is lower at the ends of the genome,
as expected due to the lack of information about recombination with few flanking sites.
In addition, the number of intervals spanning a site often
drops near the beginning of a gene.
This is particularly apparent at the ORF8/N gene interface,
with the N gene containing fewer potential recombination breakpoints
than other genes, in agreement with the results of \cite{Turakhia2022-it}.
Noticeable declines in the number of spanning intervals are also
seen near the beginning of S, M, ORF7a, and ORF8.
These declines are sometimes associated with hypermutated sites (e.g., 27,384 and 28,271
near the beginning of ORF7a and N, respectively), as expected because sites that
undergo mutation at high rates are more likely to differ between the
parents of a recombinant and so provide information about the location of a breakpoint.
This pattern may, however, also be an artefact of sequencing errors causing sites
to appear different between the parents when they are not
(see the discussion of potential errors at site 28,271 below).
In other cases, however, the drop in intervals does not appear to
coincide with hypermutability and may reflect shifts in the actual
rate of recombination between genes.
Indeed, template switching is known to occur at hotspots,
which often involve transcription-regulatory sequences preceding genes
in SARS-CoV-2~\citep{Yang2021-characterizing}.
The rate of recombination may also depend on the relative abundance
of different subgenomic RNA intermediates that span different genes,
affecting the availability of templates and influencing the rate of
homologous recombination~\citep{Kim2020-gt,Zou2021-sars}.
%\sally{[Sally: KATHERINE, WHAT DO YOU THINK?  These not-full-length subgenomic RNA intermediates may provide a lot of homologous templates for some genes more than other (e.g., fig 3B here).]}

It is important to note that the precise endpoints of intervals can be
somewhat arbitrary, because they are defined by the sequence differences
that happen to be present in the recombinant's parents.
Thus, breakpoint intervals will tend to be truncated at sites that are
hypermutable, either due to increased information about parentage
or spurious inferences caused by sequencing errors.
It is therefore helpful to compare the number of intersecting intervals
with the number of mutations per site (Figure~\ref{fig:breakpoint-distribution}).
For instance, position 28,271 (located in the ORF8-N intergenic
region) has the largest number of mutations and appears as an endpoint
of 66 intervals.
The 7,572 mutations (including 5,782 insertions and 1,605 deletions)
at this site are an indicator that this
homopolymeric region may be prone to sequencing errors
and potentially should be included in the list of ``problematic sites'' that
are excluded from analysis (Section~\ref{sec:data_preprocessing}).
On the other hand, 58 of the breakpoint intervals have endpoints
within one base of site 27,972, which has undergone 770 mutation events
(including 425 C$>$T, 314 T$>$C). The C$>$T mutation has the effect of
truncating ORF8, and it has been posited that the truncation is neutral
or advantageous for transmission, and disadvantageous within-host
\citep{Jungreis2021-dh}, suggesting that the high rate of recurrent mutation
at this site may be due to selection.

% JK Dropping this for now:
% We further consider the breakpoint intervals within the context of
% recombination breakpoint sequence motifs. It is hypothesised that certain
% palindromic breakpoint sequences (specifically, CAGAC and CAGAT) promote
% template switching during replication in SARS-CoV-2 via formation of
% base-paired stem loops in the genome structure \citep{Gallaher2020-lb}. There
% are 87 occurrences of these two breakpoint sequences in the Wuhan-Hu-1/2019
% reference sequence (Supplementary Table 5). We observe that XX (XX\%) of the
% breakpoint intervals of the HMM-consistent recombinants span at least one of
% the breakpoint sequences.

\subsection{Divergence between recombinant parents}
\label{sec:parent_divergence}
% Paragraph 2, what do we plot, specifically?
In this section we explore the detailed ancestral relationships between
recombinant parents by investigating the patterns of divergence between
them.
We focus on the Long ARG because it covers time
periods where substantial recombination is known to have occurred
and contains samples from 33 Pango X lineages (i.e., those inferred
to have recombinant ancestry; see Section~\ref{sec:pango_x_lineages} for
further analysis).
Figure~\ref{fig:recomb_mrcas} shows the estimated date of the most recent
common ancestor (MRCA) of the parent nodes for each recombination breakpoint,
plotted against the divergence between these parents (i.e., the
total branch length from the parents to their MRCA in the trees to
the immediate left and right of the breakpoint).
As in Section~\ref{sec:breakpoint_intervals}, in these plots we
exclude breakpoints associated with ``singleton'' recombination nodes
(those ancestral to only one sample).
Larger points distinguish those breakpoints which occur in nodes
ancestral to 5 or more samples, comprising 316 breakpoints from
291 recombination nodes. The criterion of 5 descendants matches the
minimum number required to designate a new Pango
lineage~\citep{Rambaut2020-dw}.
Note that as each plotted point represents a breakpoint, a recombination
node with more than one breakpoint (e.g., with 3 or more parents,
comprising ${\sim}10 \%$ of the
recombination nodes in
Figure~\ref{fig:recomb_mrcas}) will be represented by several points.

\begin{figure} \centering
\includegraphics[width=\textwidth]{figures/recombination_node_mrcas.pdf}
\caption{\label{fig:recomb_mrcas}
Date of common ancestry between the parents on either side of recombination
breakpoints, as a function of the divergence time between the parents.
MRCAs of parents associated with Pango designated recombinants (XA, etc)
are identified in orange. Larger symbols represent breakpoints in recombination
nodes ancestral to five or more samples.
Horizontal dotted lines show the four most common MRCA nodes,
which tend to be associated with major outbreaks and with many immediate children.
The stacked histogram shows the distribution of parental divergence times, ranging
from parents that have diverged only a few days ago, to much more
divergent parent lineages. See Figure~\ref{fig:recomb_mrcas_voc_breakdown}
for equivalent plots broken down by parental VoC classification.
}
\end{figure}

% Paragraph 3, what do we say about the dates/nodes of the MRCAs?
The date of the MRCA of recombinant parents is
concentrated in several banded rows in Figure~\ref{fig:recomb_mrcas}. These are
largely due to a few MRCAs shared by many
recombinants (the top four are indicated in the figure).
These shared MRCAs lie near the root of large expansions,
and the majority are associated with large polytomies, likely
indicating a rapid and under-sampled expansion of a clade (e.g. major
Delta and Omicron waves).

% Paragraph 4, what do we say about the divergence times?
Figure~\ref{fig:recomb_mrcas} shows that there is a
large range in divergence times among parents of detected recombinants, with a broad
spread of times from about 10 to 80 weeks prior to the recombination event
and a minor additional peak involving recombination between lineages that diverged
${\sim}95$ weeks ago, corresponding to common ancestors which trace back to early 2020.
This latter group should
contain, for example Delta-Omicron recombinants. More widely, we can
further classify the breakpoints by the VoC combinations of their
parent lineages. Considering only the Alpha, Delta, and Omicron VoC classes,
such a classification reveals that the majority of breakpoints have two Delta or
two Omicron parents, and that Omicron and Delta are the variants associated
with the most recombination (Figure~\ref{fig:recomb_mrcas_voc_breakdown}).
This may reflect either sampling intensity, the
prevalence of cases (which increases the
chance of coinfection and recombination), or possible heterogeneity in
recombination probabilities among lineages.

The time estimates in these figures should be
treated with a degree of caution,
because non-sample nodes are crudely dated in the current version
of \texttt{sc2ts} (see Section~\ref{sec:node_dating}, in particular
for discussion of how these dates might be improved using existing
methods).
Nevertheless, it is clear
that \texttt{sc2ts} can identify recombination between lineages
that are only a few weeks diverged.

\subsection{Recombinant Pango lineages}
\label{sec:pango_x_lineages}
In this section we focus on the detailed ancestry of samples that have been
previously identified as recombinants, i.e., designated as belonging
to a Pango lineage with a name starting with an ``X''.
We focus primarily on the Long ARG
which contains many more recombinants (the status
of Pango X lineages in the Wide ARG is briefly
summarised in Section~\ref{sec-pango_x_wide_arg}).
Designation of samples to Pango lineages is not a straightforward task,
and there can be significant variation between methods \citep{deBernardiSchneider2023-sars}.
Here, we consider two different assignments of Pango lineages to samples in
the Long ARG, those provided by Nextclade and by GISAID.
% GISAID lineage calls come from the pangoLEARN algorithm.
Using the Nextclade assignments, the Long ARG contains 749 samples from
33 Pango X lineages (711 samples from 33 lineages when we
remove singleton recombinants).
In contrast, using the GISAID designations we
have 515 samples from 38 X lineages
(511 samples from 35 lineages when we filter singleton recombinants). Of the GISAID designations,
28 are shared with Nextclade (26 after filtering singletons).
This variation in classifications highlights the uncertainty that exists when
assigning Pango X lineages to samples, and is important to keep in mind
when interpreting the results here.

We focus here on the 749 Nextclade-designated recombinant samples.
There are two samples designated XP which do not descend from a recombination node,
and a likely explanation for the absence of a corresponding recombination event
in the Long ARG is that the characteristic multibase deletion for XP
(\url{https://github.com/cov-lineages/pango-designation/issues/481}) is masked
during our preprocessing (see Section~\ref{sec:data_preprocessing} for details and potential
improvements). Of the remaining X designated samples, 38 are singleton recombinants,
descending from recombination nodes that are ancestral only to that sample
(10 are labelled XZ; 6 are XE;  3 each from XN and XK; 2 each from XC, XS, XV, XQ and XAB;
and 1 sample from each of XB, XM, XJ, XAF, XAH and XAJ). Such samples are likely to
be enriched for sequencing errors and lineage designation artefacts, and so we
exclude them from further analysis in this section. A further 79 samples
(XN: 53, XZ: 16, XAJ: 6, XE:1, XAD: 1, XAH: 1, XAK: 1) trace back
to a most recent recombination node that is likely to be a false positive
(row C in Table~\ref{tab:false_positive}, see Section~\ref{sec:false_positives}).
For simplicity these samples are likewise excluded from further analyses.

The remaining 630 samples (31 Pango lineages) trace back to 50 different
most recent recombination nodes, summarised
in Table~\ref{tab:pango-recombinants}.
These fall into three classes: single origin, multiple origin,
and multiple nested origins, which we discuss in the following sections.

\subsubsection{Single origin}
In the absence of genealogical information, a reasonable initial assumption is
that all sequences
assigned to a given Pango X lineage are descendants of a single recombinant
sequence, arising as a result of a mixed infection followed by onward
transmission. We would expect our ARGs to reveal evolutionary histories of this
nature, where all the samples assigned to a given recombinant lineage
trace back to one recombination node, representing a single originating
recombination event. In the Long ARG, 16 of the 31 Pango recombinants lineages identified by Nextclade
fall into this category (Table~\ref{tab:pango-recombinants})

\begin{figure}
\centering
\includegraphics[width=\textwidth]{figures/Pango_XA_XAG_XD_nxcld_tight_graph.pdf}
\caption{\label{fig:pango-simple-origin-graph} Examples of non-nested
Nextclade Pango X lineages. (A) Subgraph for XA in the Long ARG: all five samples designated
as XA by Nextclade, together with their ancestral lineages, are shown outwards to the nearest
sampled viral genome; dotted lines show ARG continuation. Vertical position of nodes does not
show absolute time, but relative rank (parents above children). Nodes are coloured by Nextclade
Pango designation; smaller symbols are non-sample nodes inserted by \texttt{sc2ts}, whose
Pango status is imputed. Genomic regions inherited by the recombination node are shown;
breakpoints correspond to the rightmost breakpoint position inferred by \texttt{sc2ts}.
(B) Equivalent subgraph for the 17 XAG samples in the Long ARG, with abbreviated labelling. Where
non-XAG samples are ancestral to further unplotted samples, the number of unplotted descendant samples
is marked as ``+1'', ``+7'', etc.
(C) Equivalent subgraphs for both origination events involving the four XD samples in the Long ARG.
Details of the mutations and sample node identities for all three plots are provided in
supplementary Figures \ref{fig:pango_XA_gisaid_graph}, \ref{fig:pango_XAG_gisaid_graph}, and \ref{fig:pango_XD_gisaid_graph}, which also provide alternative GISAID Pango designations.
} \end{figure}

One of the simplest examples is XA, corresponding to group A
of \citet{Jackson2021-ik} as discussed in Section~\ref{sec:jackson_recombs}. 
Figure~\ref{fig:pango-simple-origin-graph}A
shows the exact relationships inferred by \texttt{sc2ts} as a subgraph of the Long ARG.
Here, paths have been traced from all
Nextclade-identified XA samples (in red) to the closest other sample nodes in the ARG.
Sample nodes are plotted as larger circles, but the subgraph also includes intermediate,
non-sample nodes (i.e., inserted by \texttt{sc2ts}, see Sections~\ref{sec:sample-cluster-tree-inference} and \ref{sec:parsimony-heuristics}). Dotted lines show where this subgraph links
to the rest of the ARG. Above recombination nodes, only ancestral nodes are shown, meaning
that the subgraph is not extended to show additional descendants of recombinant parents.

It is clear from the XA subgraph that all the samples labelled XA by Nextclade trace to a
single originating recombination node, whose genome is a composite of a B.1.177.18 lineage
on the left of the genome and a B.1.1.7 lineage on the right. In the subgraph we show
the rightmost genomic position for the recombination breakpoint, here at position 22,227
(corresponding to a breakpoint strictly less than 22,228, see Table~\ref{tab:jackson})

%% Discuss panel B where XAG shares a common recombinant origin with samples from several other X lineages

A more complex single-origin case is XAG, illustrated in
Figure~\ref{fig:pango-simple-origin-graph}B. Here, the XAG samples
all trace back to the same most recent recombination node
(combining BA.1 on the left and BA.2.9 on the right), but
we infer this recombination event to also be the originating event for
all the recombinant samples designated XAA,
and some, but not all, of those identified as XAB, XQ, and XU by Nextclade.

The classification of originating recombination events is dependent on
accurate designation of Pango lineages to samples. It is therefore important
to note that if the GISAID Pango designations are used, many of the samples
marked here as XAB are reclassified as BA.2 and XAG becomes fully monophyletic
(although still not an immediate descendant of the originating recombination node,
see Supplementary Figure \ref{fig:pango_XAG_gisaid_graph}). This is an
independent confirmation of the uncertainty in designation of these XAG-related
samples.

Six of the 16 Nextclade designated lineages (XA, XAC, XAE, XF, XK, and XS) are of the
basic (XA) type with no other Pango designations among their descendants.
The remaining ten are of the XAG type with multiple Pango
designated lineages as additional descendants of the originating recombination event.
In some cases,
these may, however, be a result of erroneous Pango designations.
Table~\ref{tab:pango-recombinants} also shows the official
Pango designated parent lineages and \texttt{sc2ts} inferred parent lineages,
which extensively agree (although \texttt{sc2ts} provides a more precise
parent designation).

\subsubsection{Independent multiple origins}
The \texttt{sc2ts} inference process has no pre-defined knowledge
of Pango X lineage assignments, and there is therefore no particular
requirement that all the samples assigned to a given lineage must trace back to
a single recombination event. Using Nextclade designations, 15 Pango X lineages are
inferred to have multiple
recombinant origins, such that their samples trace to more than one most recent
recombination node in the ARG. Of these, 11 are cases where the
recombinants are independent rather than nested (i.e., there is no overlap in the list
of descendant samples for each recombination node). Most have a single ``main''
recombination event from which the majority of the corresponding recombinant samples
descend and which agrees with the official Pango designated parent lineages
(see Table~\ref{tab:pango-recombinants}, but note that
in XJ and XU there are too few Pango X samples to decide on a ``main'' clade).

Figure~\ref{fig:pango-simple-origin-graph}C shows a simple multiple-origin example, consisting of
the 4 samples labelled XD by Nextclade in the Long ARG. The left hand subgraph
(containing three XD samples, all sampled in France) has an earliest
sample (strain France/HDF-biopath-7747831001/2022) dated 2022-02-26, while the right hand
subgraph has a single XD sample (strain Turkey/HSGM-F12594/2021) dated 2021-12-30.
Both involve an Omicron lineage being inserted into the middle of a Delta genome, but
the breakpoints in each case are slightly different: the start of the Omicron insertion
in the French samples has an estimated rightmost position of 21641bp
(and a leftmost of 21619, not shown) with the insertion end occurring at a rightmost position of 25584
(and a leftmost of 25470). By contrast, the Omicron insertion in the Turkish sample
is inferred to have occurred from position 21619--22578 to position 23605--24130.
The breakpoint difference, the different geographical locations, the time between the
samples, and the fact that the two samples differ at 23 nucleotide positions,
suggests that these may indeed represent independent Delta--Omicron recombinants.
The canonical XD definition is based entirely on samples from northern Europe, particularly
France (\url{https://github.com/cov-lineages/pango-designation/issues/444}) so it seems
plausible that the earlier Turkish sample has been mislabelled as XD by Nextclade.
Indeed, GISAID does not label any of these samples XD (see
Figure~\ref{fig:pango_XD_gisaid_graph} which gives exact mutations and sample identifiers).
Investigation of other multiple-origin examples reveals somewhat similar patterns,
suggesting that most of the simple multiple origin examples are due to incorrect
Pango labelling.

\subsubsection{Nested recombinant origins}
As well as cases where Pango X lineage origins are attributed to independent
recombination events, four Pango X lineages in the Long ARG have Nextclade-designated
samples whose ancestry involves further recombination events
(marked by~\textdagger ~in Table~\ref{tab:pango-recombinants}; the most
complex appears to be XAB). Figure~\ref{fig:complex_origins_graph} plots the earliest example,
XB,  which is present in both the Wide and Long ARGs. The subgraph shows a recombination
between a B.1 sample and B.1.627 sample that leads not only to all the XB-labelled samples
but also to a ``hairball'' of further recombination nodes whose descendants are often not
identified as recombinants by Nextclade (plotted on the left, in blue). A similar
pattern is seen when examining XB in the Wide ARG (see discussion below).

\begin{figure} \centering
\includegraphics[width=\textwidth]{figures/Pango_XB_nxcld_tight_graph.pdf}
\caption{\label{fig:complex_origins_graph}  A subgraph of the Long ARG showing
nested recombination events involving Pango lineage XB. All XB samples
trace to a single recombination node (top centre), but three further recombinations
also descend from this node. The samples descending from these nested
recombinations include 9 that are assigned by Nextclade to
various non-recombinant pre-Alpha lineages (blue).}
\end{figure}

Note that in the Long ARG, the nested recombination events account for only one
XB sample (pink upper left, with 7 mutations above it); moreover, this sample is
not identified as XB by GISAID (Figure~\ref{fig:pango_XB_gisaid_graph})
indicating some uncertainty in lineage assignment in this part of the ARG.
Also note that the number of mutations on the lineages immediately above and
below the recombination node (totalling 18+2) is rather large, suggesting
that the sampled recombinant which induced the recombination node in the Long ARG
is only distantly related to the true originating recombinant. This could account
for complex and potentially artefactual relationships around these nodes and
is likely to be due to undersampling of the XB outbreak. Investigating examples
of nested recombinant origins, and identifying which (if any) of the nested
recombination events may be artefactual, is an important area of future research.

\subsubsection{Wide ARG}
\label{sec-pango_x_wide_arg}
%% Quick overview of the wide ARG.
Because the Wide ARG is restricted to data collected prior
to mid-2021, it contains samples from only three Pango-designated recombinant
lineages: XA, XB, and XC. Both Nextclade and GISAID designate 44
samples as XA, while 237 samples are designated
as XB by Nextclade (231 by GISAID), and 6 as XC by Nextclade (none by GISAID).
After removing singleton recombinants, XA numbers remain unchanged, but XB is
reduced to 235 Nextclade-designated samples (229 GISAID) and XC is reduced to 4
(none in the GISAID designations). We confirm that all samples
designated as XA, XB, or XC by any method have one or more recombination nodes
in their ancestry.

%% Textual description of the 3 recombinant lineages present in the wide ARG
As in the Long ARG, all XA samples in the Wide ARG trace back to a
unique originating recombination node, consistent with Figure \ref{fig:pango-simple-origin-graph}A.
This is the product of a recombination
between a B.1.177.18 sample (specifically the strain Wales/ALDP-115BF41/2021)
which contributed the majority of the genome from the start
to a rightmost position of 22227, and an unknown (inserted) node with imputed
Pango lineage B.1.1.7, which contributed the remaining right hand portion. The
recombination node has five immediate children: four sample leaves (strains
Wales/ALDP-11CF93B/2021, Wales/ALDP-125C4D7/2021, Wales/LIVE-DFCFFE/2021, and
Wales/ALDP-130BB95/2021) and a inserted node which is the
ancestor of all other XA samples in the dataset. The geographical clustering inferred by \texttt{sc2ts}
for these samples matches the findings of \citet{Jackson2021-ik}.

For XB, all samples trace back to an originating node which is the product of a
recombination between a B.1 sample (specifically the strain
England/CAMB-7B47D/2020, which contributed the majority of
the genome from the start up to a rightmost position of 23604bp), and two
UPGMA nodes with imputed Pango lineages B.1.627 (up to a rightmost position of
27389bp) and B.1.36.8 (the remaining fragment of the genome). Figure~\ref{fig:complex_origins_graph}
shows that in the Long ARG the equivalent recombination node has only 2 parents,
with no involvement of  B.1.36.8; it is possible that the third parent in
the Wide ARG is artefactual. As in the Long ARG, additional
non-X-designated samples such as B.1.634 also descend from this recombination,
and there are also a small number of nested recombination nodes. However, all but
one of these nested recombinations are unimportant, being ancestral to negligible fractions
of the Nextclade-designated XB samples. The one exception accounts for about 17\% of the
designated XB nodes and involves a recombination between descendants of the
originating XB recombination. More specifically, strain
USA/TX-HMH-MCoV-43092/2021 is inferred to be a recombinant
between an XB grandparent and its XB grandchild, involving an intermediate UPGMA node.
It seems likely to be an artefactual recombination event caused by undersampling, but it could also reflect true recombination between closely related lineages.

For the few XC-labelled samples, the Wide ARG identifies more than one
originating recombination node. However, as none of the samples designated as
XC by Nextclade are designated as XC by GISAID, these patterns could be due to
mislabelling, and a greater number of XC samples would be needed to draw reasonable
conclusions.

% JK: removing this for now

% \subsection{Novel recombinants}
% Next, we examine recombination nodes in the
% Long ARG (sampled to mid-2022) that represent previously unnamed putative
% recombinant sequences, which do not have a Pango recombinant designation. For
% this proof-of-concept study, we have arbitrarily picked 12 recombination nodes
% that seem plausible (Supplementary Table 4). All these nodes involve Omicron
% subvariants (BA.1, BA.2, BA.4, and BA.5). These recombination nodes (1) were
% inserted into the Long ARG on or after January 8, 2022; (2) have at least 10
% descendent samples; and (3) have no mutational differences (including immediate
% reversions) from their parent nodes.

% TODO: Node  740761 (USA/NC-CDC-LC0668306/2022) is a sister clade proposed1006
% in the UShER public phylogeny. Node 628656 (Scotland/QEUH-37794BB/2022) is a
% sister clade of XAC, and annotated as ``miscBA2BA1PostSpike''.

\subsection{False positive recombinants}
\label{sec:false_positives}
The Long and Wide ARGs both contain several recombination nodes that are
ancestral to a large number of samples.
These inferences of recombination may well be artefacts
caused by the appearance of new variants of concern carrying
more than the expected number of mutations~\citep{otto2021origins}.
Table~\ref{tab:false_positive} shows details of four recombinants
in the Long ARG that are likely false positives and have come to be
the ancestors of a large number of samples. These are the top-four recombinants
from the Long ARG in terms of numbers of descendant samples
(the fifth-largest has substantially fewer, at 14,869 descendant samples),
labelled A--D.
For each row, we show the sample's Pango lineage
designation and the temporal rank of that sample out of all
samples with that designation. Thus, rows C and D are the earliest
samples seen in the Long ARG from the BA.2.9 and BA.3 lineages,
and A and B are respectively the third and eighth earliest samples among 12829 samples
assigned by Nextclade to the B.1.617.2
lineage (Delta VoC). Also shown is the overall ``cost'' of the Viterbi
solution computed by the LS HMM (Section~\ref{sec:ls}), which
is $3 \times$ number of recombinations + number of additional mutations
(for a mismatch ratio of $k=3$).
This column shows that each of these samples was a large
``distance'' from the current ARG when it was added. The mean
HMM cost over all 2,078 recombinants in the Long ARG is 8.39
(median: 7), and the mean cost for the  763 non-singleton
recombinants is 8.5 (median: 7). Thus, the recombinants in
Table~\ref{tab:false_positive} are in the tail of this distribution.

\begin{table} \centering
\begin{tabular}{llllll}
\toprule
& Strain & Descendants & Lineage & Rank in lineage&HMM Cost \\
\midrule
A&Germany/HH-RKI-I-061284/2021 & 178,405    & B.1.617.2   & 3 / 12829& 17  \\
B&India/ILSGS00961/2021        & 177,649    & B.1.617.2   & 8 / 12829& 25   \\
C&Denmark/DCGC-281594/2021     & 127,227    & BA.2.9      & 1 / 10897& 32 \\
D&USA/NJ-GBW-EWR000001/2021    & 127,230    & BA.3        & 1 / 15& 25 \\
\bottomrule
\end{tabular}
\caption{\label{tab:false_positive}
False positive recombination events. Details show are for the top
four recombination nodes in the Long ARG ordered by number of
descending samples. See the text for details on the remaining columns.
Rows are labelled A, B, C and D for ease of reference.}
\end{table}

Occasional evolutionary leaps, in which a large number of mutations
are acquired in sudden jumps,
is a signature feature of
SARS-CoV-2~\citep{Corey2021-sars,otto2021origins,Nielsen2023-host}.
Such ``saltations'' naturally present challenges to \texttt{sc2ts} and the
current HMM parameterization of three mutations per recombination (mismatch
ratio). The first few sequences from these new lineages will be a poor match to
the existing ARG, and the HMM will therefore
search for ways to reduce the number of mutations required by recombining
segments.
This appears to be the case for the Delta VoC,
corresponding to rows A and B of Table~\ref{tab:false_positive},
which emerged and quickly rose to high prevalence in early-mid 2021,
carrying 30 characteristic mutations compared to the reference sequence
\citep{McCrone2022-context}, including nine mutations in the S gene
not seen in earlier VoCs.
The origins of the  Delta VoC are
are illustrated in  Figure~\ref{fig:false_positives},
which shows the large numbers of mutations involved,
and the multiple closely related recombinations inferred
before tree-like behaviour is resumed at the ancestor of
90\% of Delta samples (node \texttt{tsk261771}).
Similar patterns around the emergence of Delta are seen in the Wide ARG
(not shown).

Given these considerations, it  is important to note that the
number of ultimate descendants is not an entirely reliable
indicator of the quality of inferred recombinants.
Further study is required to systematically identify such false
positive recombinants, and to update the topology with a
more parsimonious explanation of the data.

\section{Discussion}
% PARA 1. SARS-CoV-2 is still important, and the vast hordes of data provide an
% unprecedented opportunity to study the evolution of a virus. Even if the
% public health emergency is officially over, it doesn't mean it's not coming
% back and we should be prepared. Recombination is now a vital factor, and
% present a way in which it can be incorporated. Our results are accurate.
Although the COVID-19 pandemic is no longer considered a global emergency
by the WHO, the prevalence of SARS-CoV-2 continues to be high worldwide. This fuels
proliferation of many variants, with more than 600 Pango-designated lineages
circulating globally in the last three months (January to March, 2023;
\url{https://gisaid.org/}; accessed on 2023-03-27).
High prevalence increases the risk of coinfection,
providing opportunities for new phenotypically
distinct recombinants to emerge and spread. Phylogenetic approaches have
been central to responses to the pandemic thus
far~\citep{Attwood2022-ab,Bloom2023-fitness,Abbas2022-reconstruction,
Mclaughlin2022-genomic}.
However, with the rise to high frequency of recombinant
lineages \citep[e.g.\ XBB;][]{Tamura2023-ab},
it is imperative that these methods are updated to incorporate the
effects of recombination, so that future public health interventions are
not based on incomplete and potentially biased evolutionary models.
Here we have introduced the first method to infer an evolutionary
history that jointly estimates genealogies with both mutation and
recombination at pandemic scale and illustrated how this single
structure accurately captures results derived by many different means.

% PARA 2. Here are things that could be improved in the short term. Here's a
% few more, longer term ideas.
Nonetheless, \texttt{sc2ts} is currently ``alpha'' quality software, and
we caution against over interpreting current results. As we have sought
to illustrate throughout, there are some clear areas for improvement.
The pipeline used to identify and
mask erroneous sites in the input alignments is simplistic, and, among
other issues, results in multi-base indels being marked as missing data
(Section~\ref{sec:data_preprocessing}). A more sophisticated
approach~\citep[e.g.,][]{Aksamentov2021-hj} would
likely yield significant improvements and reduce the effect of sites with
artefactually high levels of mutation (e.g., site 28271;
Section~\ref{sec:breakpoint_intervals}).
Using a pre-existing tree built using state-of-the-art phylogenetic methods
for the early stages of the pandemic (Section~\ref{sec:backbone_phylogeny}),
and minor adaptations to standard node-dating methods
(Section~\ref{sec:node_dating}) should help resolve the most notable issues with
the inferred backbone phylogeny (Figure~\ref{fig:cophylogeny}).
Trees constructed from daily sample clusters have a surprisingly large
influence on the overall ARG topology
(Section~\ref{sec:sample-cluster-tree-inference}), and so using a more
sophisticated tree building approach should yield clear improvements.
The unrealistically large number of reversion mutations
(Section~\ref{sec:mutation_spectrum}) may be reduced by
improvements to the current parsimony
heuristics (Section~\ref{sec:parsimony-heuristics}).
A major source of errors are the ``time-traveller'' samples, whose
recorded collection dates are months (or years) too early
(Section~\ref{sec:filtering_time_travellers}).
% Hmm, not sure about this. Wouldn't this also rule out new major lineages
% also (which tend to show up with a bunch of new mutations?) I guess it
% depends on the threshold, and we could set it according to previous
% experience with new lineages?
While it is unclear how
we might solve this problem in general, some simple solutions such
as filtering out sequences that exceed a given cost in the LS HMM
(i.e., number of mutations and recombination switches) may work well
in practice. Such an approach would also reduce the impact of sequences
with high levels of sequencing error (which currently contribute a large number of
mutations).
Taken together, these and other relatively minor improvements should
enable inference over much larger subsets of the dataset and
give a clearer picture of the combined processes of recombination and
mutation over the pandemic so far.

% What are the more blue-sky type things? Basically these are things we don't
% really want to do for the first "real" version of sc2ts.
An attractive feature of \texttt{sc2ts} is that the most difficult part of
the inference problem---finding likely recombinant paths through the existing ARG
for new samples---is solved exactly under a well-defined statistical model,
using established HMM methodology (Section~\ref{sec:ls}).
The implementation currently uses
a single, arbitrarily chosen, maximum likelihood path via the
Viterbi algorithm, but there are numerous ways in which the HMM machinery
could be extended in order to explore the set of possible paths, or to
quantify the uncertainty around it. Similarly, the current parameterization
of the HMM with a single mismatch ratio is simplistic, and it would
be straightforward to condition on per-site mutation rates (and nucleotide-dependent
state transitions, with some additional development).
Recombination breakpoints for the ARG are currently inserted at the
right-most extent of the possible interval
(Section~\ref{sec:breakpoint_intervals}).
More likely locations for the
breakpoint could be chosen within the interval, for example based
on sequence motifs~\citep{Gallaher2020-mr, Yang2020-ct}.
% Hint: there's probably a bunch of other stuff you could do with this
It is likely that the basic machinery of finding matches and quantifying
the uncertainty around them under a well-defined statistical model
in large ARGs would have many applications besides those explored here.

% PARA 3. Handling the data is hard, so methods for making this easier
% are important. Classical phylo approaches don't scale to millions.
% The ability to quickly develop new analyses, at scale, will be crucial.
% Sc2ts is largely composed of other parts being reused, with mainly being
% concerned with masking input alignments. We wrote no C at all. This kind
% of code reuse and ease of analysis is a major bonus.
The vast volume of whole genome sequence data generated during the pandemic
has presented classical phylogenetic methods and software with major
difficulties~\citep{Hodcroft2021-wt}.
Standard interchange formats such as FASTA, Newick and VCF were simply not designed
to deal with millions of samples, and their limitations have come sharply
into focus~\citep{Turakhia2021-ur,de2023maximum}.
Replacements that can scale to millions of genomes have had to be developed
at speed, usually focusing on compiled programming languages to maximise
performance. Here, however, we have developed a new method based on
an existing data structure and library infrastructure, designed
from the beginning to scale to millions of
samples~\citep{Kelleher2016-wk,Kelleher2019-ba}.
The \texttt{sc2ts} package is written entirely in Python
and by reusing existing high-performance components can infer
recombinant viral ancestry at unprecedented scale.
Similarly, all of the analyses shown here are written in Python,
using the \texttt{tskit} API, mostly running in a few seconds
on standard laptop computers (see the Data Availability
section for details of the corresponding Jupyter notebooks).
Retooling methods to scale up rapidly with expanding data and to encode recombination
promises to improve tracking during this and any future pandemic.

\section{Methods}
Sc2ts (pronounced ``scoots'', optionally) is a method for inferring
Ancestral Recombination Graphs (ARGs; see Section~\ref{sec:args})
from densely sampled pandemic-scale data
in real time, in which recombination occurs at a low but significant rate.
% The overall algorithm is summarised in Figure~\ref{fig:overview_sc2ts},
% and with further details in subsequent subsections.
The basic idea is to incrementally update an ARG each day
with the sequences collected on that day (Figure~\ref{fig:overview_sc2ts}).
The first step is to find likely ``copying paths'' under the Li and Stephens model
for each sequence in the daily batch to the current ARG (Section~\ref{sec:ls}).
These copying paths
will mostly consist of a new sample sequence copying from a node in the ARG
with a small number of mutations, and often many samples
from a daily batch will copy from the same ARG node.
The second step is then to ``resolve'' these implied polytomies by using
standard tree-building techniques (Figure~\ref{fig:overview_sc2ts}C;
Section~\ref{sec:sample-cluster-tree-inference}).
This greedy update strategy inevitably leads to unparsimonious
topologies, and the third step is to then increase the
overall parsimony of the inferred ARG by making some simple topological
updates (Figure~\ref{fig:overview_sc2ts}C,D;
Section~\ref{sec:parsimony-heuristics}).
Recombination is inferred as an integral and ongoing part of this
process, requiring only a few additional steps to facilitate
later analysis (Section~\ref{sec:treatment_recombinants}).
The result of inference for a given day is then
a genealogy recording genetic inheritance as well as
mutation and recombination events
for all of the sequences inserted into the ARG up to that day,
which can be conveniently and efficiently analysed using the
mature and feature-rich
\texttt{tskit}
library~\citep{Kelleher2018-xc,Ralph2020-efficiently,Tskit2023-tskit}.

\subsection{Ancestral Recombination Graphs}
\label{sec:args}
The term ``Ancestral Recombination Graph'' was introduced by
Griffiths and colleagues~\citep{Griffiths1991-two,Griffiths1998-ancestral}
and originally defined as an alternative formulation of the coalescent
with recombination stochastic process~\citep{Hudson1983-properties}.
Subsequently, the term ARG came to be used in a more general way to
describe not just realisations of this model, but to any
recombinant genetic ancestry~\citep{Minichiello2006-mapping,Zhang2023-lf}.
While there is some subtlety in the details~\citep{Wong2023-efficient},
we can think of an ARG as being any graph that encodes the
reticulate genetic ancestry of a sample of colinear sequences under
the influence of recombination.
This definition encompasses various types of graphs often
described using the broader term of phylogenetic networks.

The ``succinct tree sequence'' is an ARG data structure
that is both general (in terms of the types of ancestry that can
be described) and computationally efficient~\citep{Wong2023-efficient}.
Originally developed to facilitate large-scale coalescent
simulations~\citep{Kelleher2016-wk}, the methods have been
extended and applied to forward-time
simulations~\citep{Kelleher2018-xc,Haller2018-tree},
calculation of population genetics statistics~\citep{Ralph2020-efficiently}
and ARG inference~\citep{Kelleher2019-ba,Wohns2022-th}.
The succinct tree sequence is based on a simple tabular representation,
which defines a set of nodes, edges, sites and mutations. A \emph{node}
represents a particular genome, which may be an observed sample
or an inferred genetic ancestor. The genetic inheritance between
a pair of nodes along a segment of genome is defined by
the \emph{edge} $(\ell, r, p, c)$, which states that
child node $c$ inherited its genome from parent node $p$
from left coordinate $\ell$ to right coordinate $r$. A \emph{site}
defines a position on the genome
and the ancestral state (allele) at that site.
A \emph{mutation} records the site and node IDs where a mutation
occurs and the derived state (allele).
In addition to these basic elements of the data model, the \texttt{tskit} library
supports additional tables and fields,
the ability to associate arbitrary metadata with
table rows, and facilities to record provenance
information~\citep{Tskit2023-tskit}.

Tskit supports arbitrarily complex patterns of mutation at a particular
site, and it is useful
to define some terminology to classify them.
A mutation's ``parent''
is the first mutation encountered (at that site) on the path to
root from the mutation's node in the local tree corresponding to
the site's position. If no other mutation is encountered on the
path to root, the mutation's parent is null.
The ``derived state'' of a mutation is the allelic state inherited by nodes
in the subtree rooted at the mutation's node (in the local tree), assuming
there are no subsequent descendant mutations.
The ``inherited state'' of a mutation is the derived state of the
parent mutation, if it exists, or the site's ancestral state otherwise.
A ``recurrent mutation`` is a mutation with a non-null parent,
and a ``reversion`` is a recurrent mutation that reverses the state
change of its parent. For example, if we have two mutations
$a$ and $b$, such that $a$ is the parent of $b$,
with state transitions (inherited state $\rightarrow$ derived state)
$a: A\rightarrow T$ and $b: T\rightarrow A$, we define $b$ as a reversion.

Given the node, edge, site and mutation tables we can
efficiently construct the local
genealogical trees along the genome (arising from recombination)
and perform a range of calculations efficiently by
reasoning about the differences between these local
trees~\citep{Kelleher2016-wk,Ralph2020-efficiently}. These
algorithms have led to performance increases of several orders
of magnitude over the state-of-the-art in a range of
applications~\citep{Kelleher2016-wk,Kelleher2018-xc,Kelleher2019-ba,
Ralph2020-efficiently,Baumdicker2022-ep}.
The succinct tree sequence encoding
is also very concise, allowing, for example, for millions of
complete human genomes to potentially be stored in a few gigabytes of
space~\citep{Kelleher2019-ba}.

The \texttt{tskit} library~\citep{Tskit2023-tskit} is a liberally
licensed open source toolkit that provides a comprehensive suite
of tools for working with ARGs. Based on core functionality written
in C, it provides interfaces in C, Python and Rust. The Python interface
is based on NumPy~\citep{Harris2020-array} and provides a convenient
platform for interactive analysis of large-scale data using, for
example, Jupyter notebooks~\citep{Kluyver2016-jupyter} and taking
advantage of the analysis tools in the burgeoning PyData ecosystem.
(It is possible to access the toolkit from R via the \texttt{reticulate}
package, and the \texttt{slendr} library~\citep{Petr2022-slendr}
also provides some native R support. A full R interface would be a
valuable addition to the ecosystem.)
Tskit is mature software, widely used in population genetics, and
has been incorporated into several downstream
applications~\citep[e.g.,][]{Haller2019-slim,Speidel2019-yh,
Terasaki2021-geonomics,
Fan2022-genealogical,Korfmann2022-weak,
Mahmoudi2022-bayesian,Petr2022-slendr,Rasmussen2022-espalier,
Zhang2023-lf}.
It is important to note that
this ecosystem for storing and manipulating ARGs can
generally be used to efficiently record and analyse SARS-CoV-2 genealogies reconstructed
using other methods, not only the \texttt{sc2ts} approach that we describe here.
Note also that there is no requirement that recombination be present,
and the methods are also very efficient when working with a single tree.

\subsection{The Li and Stephens model}
\label{sec:ls}
The Li and Stephens (LS) model \citep{Li2003-ib} is an approximation of the
coalescent with recombination~\citep{Hudson1983-properties} which captures
many of the important features of the joint processes of mutation and
recombination. It is a Hidden Markov Model (HMM) in which a focal genome
is modelled as a
sequence of nucleotides that are probabilistically emitted as
an imperfect mosaic of a set of reference genomes
(Figure~\ref{fig:ls_diagram}).
The LS model is used in a wide variety
of applications in genomics, including modern approaches to
statistical genotype phasing and imputation
\citep{Delaneau2019-wl,Browning2021-cg,Browning2018-nk,Rubinacci2020-pa},
and estimation of parameters such as
recombination rates \citep[e.g.,][]{Hinch2011-tz}
and intensity of selection within and across hosts in viral
sequence data \citep[e.g.,][]{Palmer2019-wa}.
See~\cite{Mcvean2019-linkage} for further review and discussion
of the LS model.

\begin{figure} \centering
\includegraphics[width=0.5\textwidth]{figures/Li_S.pdf}
\caption{\label{fig:ls_diagram} A schematic of the Li and Stephens (LS)
model, in which a focal sequence (bottom) is described as an
imperfect mosaic of the sequences in a reference panel.
Black crosses along the focal sequence show sequencing
errors or mutations.
In the standard formulation, at site $\ell$, the recombination probability is $r_\ell$,
the mutation probability is $\mu_\ell$ and $n$
denotes the size of the reference panel.
The Viterbi algorithm can be used to find a
``copying path'' through the reference panel for a given focal sequence that
maximises the likelihood under these parameters. Unseen states in the reference panel are shown as coloured lines enclosed by
the grey box. The black arrow describes the true path through the data which leads to the emitted
focal sequence below. Examples of transition and
emission probabilities along this trajectory are shown by the red and blue
arrows, respectively.
}
\end{figure}

% The LS HMM is governed by the transition and
% emission matrices, $Q$ and $E$, respectively.
%
% These two
% processes, transition and emission (which encode recombination and mutation,
% respectively), define the generative process of the HMM, by which an imperfect
% mosaic is emitted from a reference panel.
 % of samples and nodes from the ARG
% inserted at earlier collection dates (shown in the lower panel). Here, the
%
% Note that in practice, we do not know the pattern of
% colours along the focal sequence and must infer it using the Viterbi algorithm.

The generative process of the LS model is summarised in
Figure~\ref{fig:ls_diagram}. Here, a transition matrix, $Q$, governs the
process of switching (recombining) between members of the reference panel (the
hidden states). An emission matrix, $E$, allows for differences between the
focal sequence and the hidden state from which it is copied (due to mutation as
well as sequencing error).
Both $E$ and $Q$ may be a function of the reference panel members,
but transitions are generally assumed to be independent of the
hidden states (Figure~\ref{fig:ls_diagram}, pink panel).
This assumption dramatically increases performance as the state space drops to two states (i.e., switching or not switching).
Emissions may also be a function of the nucleotide states,
but in our RNA virus case
we assume that mutations occur from all possible alleles
($A,C,G,U$ and a gap in the alignment, $-$) to any other with
equal probability $\mu_\ell/4$.
This is reasonable for rapidly evolving pathogens,
but we note that setting the number of alleles at site $\ell$ ($a_\ell$)
to the set of observed alleles across all analysed
samples ($\mu_\ell/(a_\ell-1)$) will often be more appropriate
(Figure~\ref{fig:ls_diagram}, blue panel).
We use the Viterbi algorithm~\citep{Viterbi1967-ol}
to find the most likely copying path, given $Q$, $E$, and a set of reference sequences.
Throughout, we refer to the probabilities of mismatching to a
member of the reference panel at site $\ell$ as $\mu_\ell$,
and the probability of recombining between two members of the reference panel
between site $\ell-1$ and site $\ell$ as $r_\ell$. For convenience, $\mu_\ell$ is commonly referred to as the `mutation probability', but we note that this probability of
mismatch also encompasses various other error modes that result in a mismatch
(such as sequencing and alignment errors). Note also that it is these probabilities,
not the rates of mutation and recombination, that are required to fully define
the HMM; see~\cite{Donnelly2010-coalescent} for a discussion of how these
parameters relate to the coalescent process.


% That said, we can approximately relate these probabilities to the coalescent
% process in a population with constant effective population size ($N_e$), which
% LS is approximating to guide intuition. We may consider the probability of
% mutation to represent the probability of mutation along the branches joining
% the focal sequence to the closest member of a coalescent tree containing a
% reference panel with $n$ members, assuming a Poisson process along the branch.
% The expected length of these two branches is $4/n$ in units of $N_e$
% generations, and so the mismatch probability represents the probability of
% observing a mutation or error mode, along these branches. We can then
% approximate the probability of observing a mismatch as: \begin{align*} \mu_\ell
% \approx 1 - \exp{\left(-\frac{4\nu}{n}\right)} \end{align*} where $\nu$ is the
% mutation rate per $\mbox{ploidy} \times N_e$ generations. In the case of
% recombination, suppose that there is some constant underlying recombination
% rate, $\rho$. Assuming that recombination events occur as a Poisson process,
% then the probability of detecting a recombination event before coalescence
% between the focal sequence and the reference panel, between adjacent variant
% sites is \begin{align*} r_\ell = 1 - \exp\left(-\frac{\rho\left(m_{i-1} -
% m_{i}\right)}{n}\right), \end{align*} where $m_j$ is the nucleotide position of
% variant $j$.

The probabilities of these competing processes of mismatch and recombination are
usually controlled by the site-specific parameters $\mu_\ell$ and $r_\ell$,
respectively. For this work we used a slightly different formulation, which
uses one parameter, the mismatch ratio (MMR), to control the relative
importance of mutation and recombination in the HMM. Specifically, an MMR value
of $k$ will prefer $k$ mismatches (mutations) to a single recombination that
 results in copying from a template with no mismatches. To map between
recombination probability and mutation probability for a particular mismatch
ratio, we simply consider the two paths that we wish to be equally likely and
rearrange for the mutation or recombination probability. Consider the simple
case where we assume $\mu_l=\mu$ and $r_l=r$. Without loss of generality,
consider a region of length $m$. Up until this region, two paths are equally
likely, so we can re-scale by the likelihood of observing the focal sequence up
to the site before the region starts, $c$. For an MMR of $k$, we only need to
consider two of the $n$ members of the reference panel.
One, $\mathcal{P}_1$,
for which there are no mismatches, but we need to recombine in, and a second,
$\mathcal{P}_2$, for which there are $k$ (randomly chosen) mismatches in the
region.
The probability of tracking along each of those paths is given by:
\begin{align*}
\mathbb{P}[\mathcal{P}_1] &= \frac{cr}{n}
\alpha^{m-1}\left(1-\mu\right)^m
&\text{recombine to a template with no mismatches,}\\
\mathbb{P}[\mathcal{P}_2] &= c
\alpha^{m}
% \alpha \alpha^{m-1}
\left(1-\mu\right)^{m-k}\mu^k &\text{stay in a template with $k$ mismatches,}
\end{align*}
% Note: brackets here emphasise the two different components
% of alpha: (1-r) = don't leave this reference member,
% and r/n = move out, but back to yourself.
where $\alpha = (1 - r) + r/n$ represents the probability of a sequence not recombining at a
given position with any of the other $n-1$ members of the reference panel
(Figure~\ref{fig:ls_diagram}, red panel) and $c$ is the
likelihood of the path up to this point (assumed to be
equal by construction).
We then
set these path probabilities to be equal, and rearrange to relate $r$ and $\mu$ to one another:
\begin{align*} r = \frac{n\mu^k}{\left(1-\mu\right)^k + \left(n-1\right)\mu^k},
\quad\quad \mu = \frac{1}{\sqrt[k]{\frac{n}{r} - (n-1)} + 1}. \end{align*}
Thus, for lower MMR values ($k$ here),
recombination is increasingly favoured over multiple mutations in a specific ancestral genome.
We use an MMR value of $k=3$ in this work,
because of the relatively high rate of recombination relative
to mutation typical of
coronaviruses~\citep{amoutzias2022remarkable}.
Exploring different mismatch ratios and more sophisticated parameterizations
of the HMM are important avenues for future work.

% A varying degree of missingness is present in the genetic data, that is, for a
% given sample, a subset of nucleotides may be coded as `unknown' at a set of
% positions. As a result, we had to consider missingness in our implementation of
% the LS Viterbi algorithm.

% To deal with this missingness, we could explicitly model the missingness
% structure and generative process, or impute the missing data and revert back to
% our model without missingness. We choose the latter. To do this, we estimate a
% most likely path through the data assuming that missing nucleotides in the
% focal sequence are uninformative to the probability of any given path through
% the reference data. We then simply fill in the missing data in the focal
% sequence by copying it from the relevant positions in the most likely path. To
% ensure that missing data is not considered in determining the most likely
% copying path, we set the emission probability from any state $(A,C,G,U,-)$ to
% `unknown' as 1 (though we could have chosen any constant). As a result,
% emissions of unknown nucleotides will not contribute to differences in path
% probabilities. Note that by iteratively imputing missing data in focal
% sequences in this way, the reference panel being used for any newly observed
% focal sequence is guaranteed to be devoid of missingness.

% One approach would be to `impute' all missing data beforehand using external
% software to hard-call or define a distribution over the possible missing
% nucleotides and use that in the determination of path probabilities. Another,
% which we carry out within \texttt{sc2ts}, is to assume sites with missing
% nucleotides are uninformative to estimating the most likely path through the
% data which gives rise to the focal sequence. If we are able to estimate the
% most likely copying path among these paths, a reasonable imputation method is
% to fill in the missing data in the focal sequence by copying it from the
% relevant positions in the most likely path. To ensure that missing data is
% not considered in determining the most likely copying path, we set the
% emission probability from any state $(A,C,G,U,-)$ to `unknown' as 1 (though
% we could have chosen any constant). As a result, emissions of unknown
% nucleotides will not contribute to differences in path probabilities. Note
% that by iteratively imputing missing data in focal sequences in this way, the
% reference panel being used for any newly observed focal sequence is
% guaranteed to be devoid of missingness.

Genetic data for SARS-CoV-2 contains substantial amounts of missingness,
(nucleotides coded as `unknown' or masked out
as described in Section~\ref{sec:data_preprocessing})
and it is important to account for this missingness in a systematic way.
In \texttt{sc2ts} we automatically impute missing data as samples are added
into the ARG, using the LS model.
To do this, we assume that sites with missing nucleotides are uninformative
to the path probability, by setting the emission probability from any
state $(A,C,G,U,-)$ to `unknown' equal to 1 (though we could have chosen any constant).
As a result, emissions of unknown nucleotides will not contribute to
differences in path probabilities.
Once the most likely copying path is determined, we then attach the
sample to the ARG (see Figure~\ref{fig:overview_sc2ts} and following subsections).
For each newly attached sample we encode its nucleotide sequence by
recording mutations where it differs from the nucleotide sequence
of its parental node, but (importantly) ignoring sites with missing data
in the new sample.
Thus, once the newly added sequence has been attached to the ARG
any missing data is imputed from its parental node. There is therefore
no missing data in the ARG; all missing bases are ``hard called''
at attachment time.
Note that this approach is equivalent to
using state-of-the-art imputation
methods~\citep[e.g.][]{Browning2018-nk,Delaneau2019-wl} with a reference
panel consisting of all sequences in the ARG,
since these methods are also based on the LS model.
% Could say more here, but let's leave it.
Evaluating the accuracy of missing data imputation
using \texttt{sc2ts} is an important facet of future work.

In \texttt{sc2ts}, we use the efficient ARG-based implementation of the
LS Viterbi algorithm from \texttt{tsinfer} \citep{Kelleher2019-ba} to find
the most likely copying path for each sample sequence
among all sequences (sampled and inferred) in the current ARG.
In the majority of cases, with non-recombinant sample sequences,
the most likely solution is to copy from one
of the nodes in the ARG that minimises the number of mutations required
to insert the focal sequence. Importantly, because the reference panel here consists of
\emph{every} node in the ARG, we can match to both older sample
sequences or internal nodes representing an inferred ancestral sequence
(see subsequent subsections for details about how these are added).
Thus, when no recombination is present, the LS Viterbi algorithm is
implementing a version of parsimony, in which we are guaranteed to
find a sequence that minimises the number of additional mutations required
to incorporate a newly-added sample into the ARG.
Recombination is then inferred when the most likely solution to the LS
HMM is to copy from more than one ARG node along the genome, for a
given sample sequence.

The Viterbi algorithm enables us to find a path through
the reference panel from among the $n^m$ paths that is provably at the
optimum, under the LS model.
We can solve this massive optimisation problem exactly because the ARG-based
implementation of the LS HMM used in \texttt{tsinfer}~\citep{Kelleher2019-ba}
scales approximately
logarithmically with reference panel size (as opposed to linearly,
for standard matrix-based approaches).
This efficient HMM algorithm is the main reason for \texttt{tsinfer}'s
scalability, and here allows us to find closely matching
sequences and recombination paths among millions of SARS-CoV-2
genomes exactly under a well-defined statistical model.

It is important to note that the Viterbi algorithm only returns \emph{one of}
the copying paths that maximise the likelihood under the given mutation and
recombination parameters. There may be many such paths, from which we choose
one arbitrarily. Also, the present choice of using a single mismatch-ratio
parameter to control the likelihood of recombination vs mutation may lead to
relatively flat likelihood spaces where many different paths have equal
likelihood.
There are many possibilities in using established HMM methodology
to reason about and explore the space of possible matches, which may be a
fruitful avenue for future work.
% Could talk about running the HMM in reverse here, but is there much
% point? It's confusing the main point which is that there lots of
% ways in which this could be done properly.
Examples include stochastic traceback~\citep[e.g.,][]{rasmussen2014genome}
through the collection of paths at the global optimum to glean further
information about the likelihood surface, and determine whether there are
downstream implications for our conclusions. Here, we have considered the
Viterbi algorithm to make statements about the most likely paths through the
data. The machinery used here can be modified to run the forwards and backwards
algorithms that determine the probability of observing a focal sequence,
integrating over all possible paths through the data, under the LS
model~\citep{Palmer2023-efficient}.
This presents an opportunity to estimate parameters of interest
under the LS model at pandemic scale.

\subsection{Tree inference from HMM daily sample clusters}
\label{sec:sample-cluster-tree-inference}
With tens of thousands of samples being added to the ARG per day,
there are often clusters of hundreds of sequences attaching to the same node
(or more generally, recombinant path;
see Section~\ref{sec:treatment_recombinants}).
While some of these samples will require no extra mutations
(because they are identical to the attachment node), in general there
will be complex patterns of shared mutations among the samples
reflecting their evolutionary relationships. A natural way to infer
these within-cluster evolutionary relationships is to use standard
tree-building algorithms.
We can infer a likely tree relating the
samples in a cluster independently of the other samples in a
daily batch and then attach the tree (and mutations)
to the ARG at the node identified by the HMM.

We currently use the UPGMA algorithm~\citep{Michener1957-tr}
as implemented in SciPy~\citep{Pauli2020-scipy} to build trees from sample
clusters, and then map mutations back to this topology using maximum parsimony.
We chose this approach mainly for simplicity, and because of the
speed and reliability of the SciPy implementation.
An issue with the UPGMA algorithm is that it generates a strictly
binary tree, creating internal nodes
supported by no informative site (i.e., having no mutation immediately
ancestral to them). We avoid such false precision by post-processing
to remove unsupported internal nodes, representing the relationship
between $k$ identical descendants of a node as a polytomy of size $k$.

There are well-known issues with using such a simple algorithm for inferring
evolutionary relationships~\citep{Felsenstein2004-inferring}.
Table~\ref{tab:args} shows that
this within-cluster tree building has a significant influence on the
overall ARG topology,
and therefore applying more sophisticated
tree building methods that keep track of the required mutations
(rather then inferring post-hoc by parsimony) is a likely avenue
for improvements in overall inference quality.

\subsection{Parsimony-increasing heuristics}
\label{sec:parsimony-heuristics}
Attaching trees built from the clusters of samples that copy from
a particular node (or path of nodes for recombinants,
see Section~\ref{sec:treatment_recombinants}) under the
HMM is an inherently greedy strategy
and can produce inferences that are clearly unparsimonious.
The final step in adding a daily batch of samples to the ARG
is therefore to perform some local updates that target specific
types of parsimony violations in the just-updated regions of the
ARG. There are currently two parsimony-increasing operations
applied, which we refer to as ``mutation collapsing'' and ``reversion
pushing'' (Figure~\ref{fig:overview_sc2ts}D, E).

Given a newly attached node, mutation collapsing inspects its siblings from
previous sample days to check if any of them share (a subset of) the mutations that
it carries. If so, we increase the overall parsimony of the inference
by creating a new node representing the ancestor that carried
those shared mutations and make that new node the parent of the
siblings carrying those shared mutations (Figure \ref{fig:overview_sc2ts}D). The patterns of
shared mutations between siblings can be complex, and the current
implementation uses a simple greedy strategy for choosing
the particular mutations to collapse.

The reversion push operation inspects a newly added node to see
if any of its mutations are ``immediate reversions''; that is,
are reversions of a mutation that occurred on the new node's
immediate parent. We increase the overall parsimony of the
inference by ``pushing in'' a new node which descends from the
original parent, and carries all its mutations except those
causing the reversions on the newly added node (Figure \ref{fig:overview_sc2ts}E).

Table~\ref{tab:args} shows that nodes generated by
these operations constitute
roughly the same fraction of the total in both the Long and Wide ARGs,
and contribute significantly to the overall topology.
These nodes are also being chosen by the
LS HMM as likely choices of parent (data not shown)
demonstrating that the heuristics are successfully
capturing features of real sequences.
However, they are both simple
greedy operations, just examining the local parts of the
ARG topology affected by newly added samples. Because the inferred
ARGs still contain a large number of reversion mutations which
are likely to be mostly artefactual (Section~\ref{sec:mutation_spectrum}),
it is clear that there
is room for improvement and that further parsimony-increasing
heuristics
(e.g., resolving reversions beyond those on immediately adjacent edges)
would likely be of benefit.

\subsection{Treatment of recombinants}
\label{sec:treatment_recombinants}
A sample sequence is designated as a recombinant if the most likely
path inferred by the LS HMM for that sample contains at least
one switch between parents. Recombinant sequences are mostly treated
identically to non-recombinants, as we simply need to reason about
a path of parent nodes along genome intervals rather than a
single parent over the whole genome, which is naturally handled by
the succinct tree sequence data structure and \texttt{tskit} library
(Section~\ref{sec:args}). To facilitate
analysis and to help understand the robustness of recombinants
we perform some additional steps in \texttt{sc2ts}.

The LS HMM may infer identical paths and patterns of mutations
for multiple samples in a daily batch, and so we
create a ``recombination'' node (marked with a specific ``flags'' value)
for each distinct recombinant. (This node is not strictly necessary
but makes it convenient to find recombinants for subsequent analysis.)
Variation within a cluster
of recombinant sequences is handled in the same way as non-recombinants
(see Section~\ref{sec:sample-cluster-tree-inference} above for details).
When the Viterbi algorithm implementation used by the LS HMM infers
recombinant ancestry for a given genome, the point at which inheritance
switches from one parent node to another is the last possible
position. The left-most extent of the breakpoint interval is derived
by sequence comparison between the parents, as described in
Section~\ref{sec:breakpoint_intervals}.

For a particular recombination node in an ARG, a  breakpoint
is defined as the location at which inheritance switches from one parent
to another.
In the \texttt{tskit} encoding (see Section~\ref{sec:args}) inheritance between
nodes is defined by edges $(\ell, r, p, c)$, which state that child node $c$
inherits from parent node $p$
over the half-closed genome interval $[\ell, r)$.
For simplicity, suppose that a recombination node $u$ inherits from two
parents $p_1$ and $p_2$ with a breakpoint of $x$.
In the ARG, this is defined by two edges
$(0, x, p_1, u)$ and $(x, L, p_2, u)$ where $L$ is the length of the genome.
Since inheritance intervals are half-closed,
$u$ inherits all positions up to $x$ (exclusive) from parent $p_1$
and all positions from $x$ (inclusive) to the end of the genome from parent
$p_2$. We then define the breakpoint \emph{interval} $[b_\ell, b_r)$ as the
half-closed interval defining the range of possible values for $x$, such
that $b_\ell \leq x < b_r$.

The LS HMM machinery, and the interpretation of inferred recombinant paths and
breakpoint intervals is a central part of \texttt{sc2ts}, and there are many
ways to extend and improve. For example, the current parameterization of using
a single ``mismatch ratio'' is very simplistic (Section~\ref{sec:ls}) and likely
results in a flat likelihood space where many recombinant paths have equal
probability of being chosen.
Post-processing the match results to produce
more parsimonious breakpoints may also be a worthwhile avenue for development.
In particular, we may choose to insert breakpoints for the ARG that are chosen
from within the possible interval, rather than the current approach of taking
the rightmost value. Cases where we have more than two parents may either be
the ``stacking'' of multiple recombination events or instances where
the HMM has chosen to switch to a third sequence rather than back to
the original parent (where this is equally parsimonious). Many putative
recombination events, however, will represent poor quality data, where
a recombinant copying path happens to be a more likely explanation
of a highly divergent sequence.
A thorough analysis of the behaviour of the LS model in the context
of a pandemic-scale ARG may lead to significant improvements in our
ability to identify recombinants and to filter poor quality data.

\subsection{Node dating}
\label{sec:node_dating}
The approach to assigning a date to nodes in \texttt{sc2ts} is currently
ad-hoc, and the inferred timing of events from the ARGs reported
here should be treated with caution (e.g., Figure~\ref{fig:recomb_mrcas}).
Sample nodes (those corresponding to observed sample sequences) are the
most accurately dated, as we use the reported collection date for
these nodes. These are not entirely accurate, but our data filtering
criteria should remove the most egregious errors (see
Section~\ref{sec:filtering_time_travellers}). Other nodes in the ARG
are dated by splitting the time between the attached samples
and the chosen parent nodes equally (in the case of
trees inferred from daily sample clusters,
Section~\ref{sec:sample-cluster-tree-inference}) or by adding arbitrary
small values when creating new nodes using parsimony rules
(Section~\ref{sec:parsimony-heuristics}). Because the ARGs for SARS-CoV-2 are
very treelike, with recombination nodes constituting a tiny fraction
of the overall topology (Table~\ref{tab:args}), existing methods
\cite[e.g.,][]{to2016fast} could likely be adapted to accurately
date the vast majority of the nodes.

\subsection{Data preprocessing}
\label{sec:data_preprocessing}
The findings of this study are
based on sequences and metadata available on GISAID (\url{https://gisaid.org/})
up to 2022-08-22 and accessible at
\url{https://doi.org/10.55876/gis8.230329cd}.
We~removed sequences
if they had ambiguous collection dates, were collected before 2020-01-01
or were isolated from a non-human host.
We aligned sequences to the Wuhan-Hu-1/2019 reference sequence
(GenBank: MN908947.3) using Nextclade v2.3.0~\citep{Aksamentov2021-hj} (dataset tag
2022-07-26T12:00:00Z). We also excluded sequences if they had a
``bad'' quality control status
in any of the four Nextclade columns (``qc.missingData.status'',
``qc.mixedSites.status'', ``qc.frameShifts.status'' and `qc.stopCodons.status'').

We encode ambiguous nucleotide letters (i.e.,
not A, C, G, T, or a gap) in the pairwise genome alignments as missing data
(N). Problematic bases in the alignments, which had two or more Ns or
gaps within a distance of seven bases, are masked as missing data following
the approach used in the ``faToVCF'' tool used by
UShER~\citep{Turakhia2021-ur}.
Sites that are masked by this process are
treated as missing data by the LS HMM (Section~\ref{sec:ls}).
In addition, we exclude 481 problematic sites flagged as prone to
sequencing errors or as highly homoplasic entirely
(\url{https://github.com/W-L/ProblematicSites_SARS-CoV2/},
accessed 2022-09-22).

Although the current masking strategy is simple and robust,
there are significant disadvantages because it excludes, for example, any
deletions of length greater than one base. Exploring more sophisticated
masking strategies is an important route for future improvements.

\subsection{Filtering ``time travellers''}
\label{sec:filtering_time_travellers}
A major source of error in early versions of \texttt{sc2ts} was
the existence of ``time traveller'' sequences: those with erroneously
early collection dates.
For example, an Alpha sample purportedly collected in 2020 from the United States,
before Alpha appeared in the United Kingdom (USA/MN-Mayo-1563/2020),
produced significant topological distortions in
inferred ARGs.
Hence, to exclude such potential ``time travellers'' we employ two filters.

The first filter is a simple threshold on
the time delay between the collection date and submission date.
After some preliminary analysis we settled on a maximum submission delay
of 30 days when building the ARGs described here.
The second filter is to remove any sequence with a collection date that
pre-dates the
time to the most recent common ancestor (tMRCA) of its corresponding clade.
We obtained the tMRCA for each clade from a Nextstrain GISAID reference tree
(downloaded on 2022-08-22), and we used the lower bound of the 95\%
confidence interval of each clade as the minimum date cut-off.
This excludes a further 618 samples not covered by the maximum submission
delay filter.

As the results of \texttt{sc2ts} are sensitive to the existence of
time-travellers, an important aspect of future work is to find better ways
to identify them. One possibility is to use the LS HMM itself to flag
overly divergent sequences and to exclude them from attachment to the ARG.
We might also estimate collection dates by adding these potential time travellers
back to the ARG, allowing an automated assessment of collection date discrepancies.

\subsection{Imputation of Pango lineage for non-sample nodes}

\begin{figure} \centering
\includegraphics[width=.7\textwidth]{figures/imputation.pdf}
\caption{\label{fig:imputation}A schematic of the iterative procedure to impute
Pango lineage for inserted, non-sample nodes. Here, three nodes (question
marks) have unknown Pango lineage (A). The lineage for node N1 can be directly
copied from its parent (L2), which has an identical sequence. The lineage for node N2 must be inferred from
that of the parent (L2) plus the lineage-defining mutation (red X)
on the connecting edge. The lineage for node N3 can then be copied from
that of node N2.}
\end{figure}
While the Pango lineages of sample nodes are imported directly from
GISAID metadata, the lineage status of internal nodes inserted into
the \texttt{sc2ts} ARG must be imputed. We do this using
the list of lineage-defining mutations (based on 90\% consensus
of the sequences analysed) from the
COVID-CG website \citep[][\url{https://covidcg.org/};
accessed on 2022-11-04]{Chen2021-zc}.

For a given non-sample node $u$, if the Pango lineage of its parent or
one of its children is already known, and there are no lineage-defining mutations
on the connecting edge, then $u$ copies this Pango lineage exactly.
Otherwise, the lineage for $u$ is inferred by
matching its full set of mutations against the COVID-CG list.
We apply these two steps to the internal nodes of the ARG iteratively,
as illustrated in Figure~\ref{fig:imputation}, until all internal nodes
are assigned a lineage (where possible---note that sometimes a lineage cannot
be assigned to the children of a recombination node).

This method is fast, as the lineages for most of the internal nodes can be
imputed by copying from the surrounding nodes
(Wide ARG: 80\% of nodes,
Long ARG: 66\%),
and significantly more efficient than extracting
the haplotypes for each internal node and using existing Pangolin assignment tools
\citep{OToole2021-assignment}. The
accuracy depends on the quality of the list of lineage-defining mutations,
as well as the source of lineage designation for the sample nodes: we obtain
slightly different results when using those recorded on GISAID (which uses
pangoLEARN), and those assigned by Nextclade. To gauge the accuracy of
imputation, we have used our method to re-impute the lineage designations of each
sample node using the surrounding information; this results in
99\% of nodes being assigned the same lineage as per the source metadata in the
Wide ARG, and 98\% in the Long ARG.

\section{Acknowledgements}
We gratefully acknowledge all data contributors, i.e., the Authors and their
Originating laboratories responsible for obtaining the specimens, and their
Submitting laboratories for generating the genetic sequence and metadata and
sharing via the GISAID Initiative, on which this research is based.
% Not using these now:
% Also, we
% thank Dr. Morag Graham (the National Microbiology Laboratory, Winnipeg, Canada)
% for kindly providing the genomic coordinates of the breakpoint sequence motifs.
SHZ is supported by the Janssen-Oxford Translational Genomics Fellowship. JK,
YW, and BJ are supported by the Robertson Foundation. AI is supported by the Wellcome Trust.

Computation used the Oxford Biomedical Research Computing (BMRC) facility, a
joint development between the Wellcome Centre for Human Genetics and the Big
Data Institute supported by Health Data Research UK and the NIHR Oxford
Biomedical Research Centre. The views expressed are those of the author(s) and
not necessarily those of the NHS, the NIHR or the Department of Health.

\section{Data Availability}
\label{sec-data-availability}
The source code for \texttt{sc2ts} and notebooks and code used to produce the
results described here available on GitHub:
\begin{itemize}
\item \url{https://github.com/jeromekelleher/sc2ts/}
\item \url{https://github.com/jeromekelleher/sc2ts-paper/}
\end{itemize}

Details of the GISAID data used are available at
\url{https://doi.org/10.55876/gis8.230329cd} and included in the
Supplemental Table (GISAID EPI SET PDF).

The inferred ARGs described here are available on request to those with
the appropriate GISAID data access.

The mapping of \texttt{tskit} IDs to strain and EPI\_ISL identifiers for the
subgraph plots in Supplementary Figures \ref{fig:pango_XA_gisaid_graph}, \ref{fig:pango_XAG_gisaid_graph}, \ref{fig:pango_XD_gisaid_graph} and \ref{fig:pango_XB_gisaid_graph} are at
\url{https://github.com/jeromekelleher/sc2ts-paper/blob/main/data/Subgraph_sample_mapping.txt}.

% \bibliographystyle{abbrvnat}
\bibliographystyle{refstyle}
\bibliography{paper}

\clearpage
\renewcommand\thefigure{S\arabic{figure}}
\renewcommand{\theHfigure}{S\arabic{figure}}
\setcounter{figure}{0}
\renewcommand\thetable{S\arabic{table}}
\renewcommand\theHtable{S\arabic{table}}
\setcounter{table}{0}
\section*{Supplementary Material}

\begin{figure}[h] \centering
\includegraphics[width=\textwidth]{figures/supp_cophylogeny_long.pdf}
\caption{\label{fig:cophylogeny_long}Tanglegram equivalent to that in Figure~\ref{fig:cophylogeny},
but for the Long ARG (i.e., subsampled to mid-2022).}
\end{figure}

% FIXME there's too much vspace between the rows here. Probably best to redo
% the formatting.
\begin{table} \centering
\begin{tabular}{l|c|c|c|c|c} \hline
\multicolumn{1}{c}{} & \multicolumn{3}{c}{\textbf{Jackson et al. (2021)}} &
\multicolumn{2}{c}{\textbf{sc2ts (Wide ARG)}} \\ \hline
\textbf{Sample} &
\textbf{Group} & \textbf{Parents} & \thead{Breakpoint \\
interval(s)} &
\textbf{Parents} & \thead{Breakpoint \\ interval(s)} \\
\hline ALDP-11CF93B & A &
    \thead{B.1.177 \\ Alpha} & 21,256--21,615 &
    \thead{B.1.177.18 \\ Alpha} & 21,256--22,228 \\
ALDP-125C4D7 & A &
    \thead{B.1.177 \\ Alpha} & 21,256--21,615 &
    \thead{B.1.177.18 \\ Alpha} & 21,256--22,228 \\
ALDP-130BB95 & A &
    \thead{B.1.177 \\ Alpha} & 21,256--21,615 &
    \thead{B.1.177.18 \\ Alpha} & 21,256--22,228 \\
LIVE-DFCFFE & A &
    \thead{B.1.177 \\ Alpha} & 18,999--20,296 &
    \thead{B.1.177.18 \\ Alpha} & 21,256--22,228 \\
QEUH-CCCB30 & B &
    \thead{B.1.36.28 \\ Alpha} & 6,529--6,955 &
    \thead{B.1.36 \\ Alpha} & 6,529--6,955 \\
QEUH-CD0F1F & B &
    \thead{B.1.36.28 \\ Alpha} & 6,529--6,955 &
    \thead{B.1.36 \\ Alpha} & 6,529--6,955 \\
MILK-1166F52 & C &
    \thead{Alpha \\ B.1.221.1} & 25,997--27,443 &
    \thead{Alpha \\ B.1.221} & 25,997--27,973 \\
MILK-11C95A6 & C &
    \thead{Alpha \\ B.1.221.1} & 25,997--27,443 &
    \thead{Alpha \\ B.1.221} & 25,997--27,973 \\
QEUH-109B25C & C &
    \thead{Alpha \\ B.1.221.1} & 25,997--27,443 &
    \thead{Alpha \\ B.1.221} & 25,997--27,973 \\
MILK-126FE1F & D &
    \thead{B.1.36.39 \\ Alpha} & 20,704--23,064 &
    \thead{B.1.36.39 \\ Alpha} & 22,445--23,064 \\
RAND-12671E1 & D &
    \thead{B.1.36.39 \\ Alpha} & 20,704--23,064 &
    \thead{B.1.36.39 \\ Alpha} & 22,445--23,064 \\
RAND-128FA33 & D &
    \thead{B.1.36.39 \\ Alpha} & 20,704--23,064 &
    \thead{B.1.36.39 \\ Alpha} & 22,445--23,064 \\
CAMC-CBA018 & n/a &
    \thead{B.1.177 \\ Alpha} & 20,390--21,256 &
    \thead{B.1.177 \\ Alpha} & 17,616--21,256 \\
CAMC-CB7AB3 & n/a &
    \thead{Alpha \\ B.1.177 \\ Alpha} & \thead{3,268--4,476 \\ 20,390--21,256} &
    \thead{Alpha \\ B.1.177 \\ Alpha} & \thead{3,268--5,389\\17,616--21,256} \\
MILK-103C712 & n/a &
    \thead{B.1.177.17 \\ Alpha} & \thead{409--446 \\ 26,802--27,878} &
    n/a & n/a \\
QEUH-1067DEF & n/a &
    \thead{Alpha \\ B.1.177.9} & 10,524--10,871 &
    \thead{Alpha \\ B.1.177} & 7,729--10,871 \\ \hline
\end{tabular}
\caption{\label{tab:jackson_supplement}
Recombinant sequences involving the
Alpha (B.1.1.7) variant reported by \cite{Jackson2021-ik} have recombinant
ancestry in the Wide ARG. The breakpoint intervals and Pango lineage
assignments of the parents were taken from Table 2 (3SEQ results) of Jackson et
al., except the Pango lineage assignment of the parents of group B
recombinants, which were taken from Table 1 (motif-based results).
3SEQ interval coordinate modifications are described in the caption for
Table~\ref{tab:jackson}.}
\end{table}

\begin{figure}
\centering
\includegraphics[width=\textwidth]{figures/long_arg_recombination_intervals.pdf}
\caption{\label{fig:long_arg_breakpoint_distribution}
Distribution of recombination breakpoints and mutations along the genome in
the Long ARG.
Top panel shows the intervals for 851 breakpoints associated
with 763 recombination nodes with at least two descending samples, plotted along the genome
as line segments (coloured by interval width).
Other details as described in Figure~\ref{fig:breakpoint-distribution}.}
\end{figure}


\begin{figure} \centering
\includegraphics[width=\textwidth]{figures/supp_recombination_node_mrcas.pdf}
\caption{\label{fig:recomb_mrcas_voc_breakdown}  Divergence between parent
lineages for recombination events within and among different VoC categories.
There are 78 Alpha+Alpha recombination breakpoints corresponding to 75 recombination nodes
(25 breakpoints / 24 nodes with $\geq5$ descendant samples).
148 breakpoints from 142 nodes are Delta+Delta recombinations (45 breakpoints / 45 nodes with  $\geq5$ descendants),
and 148 breakpoints from 142 nodes are Omicron+Omicron (71 breakpoints / 68 nodes with $\geq5$ descendants).
The equivalent figures for Alpha+Delta are 9 / 8 (3 / 2),
for Alpha+Omicron are 2 / 1 (0 / 0),
and for Delta+Omicron are 20 / 13 (6 / 3).
Note only recombination breakpoints involving lineages
classified into Alpha, Delta, and Omicron VoC categories are plotted above:
all other breakpoints are omitted.
}
\end{figure}

\begin{sidewaystable}
% \begin{footnotesize}
\centering
\begin{tabular}{p{1cm}p{1.2cm}p{4.2cm}lll}
\toprule
Focal Pango  & Num origins & Num focal samples (further split by origin, \textdagger=nested) & Official Pango parents & \textbf{Main clade}: sc2ts parents & \textbf{Main clade}: additional descendants \\
\midrule
\bfseries XA & 1 & \textbf{5} & B.1.1.7 + B.1.177 & B.1.1.7 + B.1.177.18 &  \\
\bfseries XF & 1 & \textbf{2} & B.1.617.2* + BA.1* & AY.4 + BA.1 &  \\
\bfseries XG & 1 & \textbf{32} & BA.1* + BA.2* & BA.1.17 + BA.2 & XAB: 1/48 \\
\bfseries XH & 1 & \textbf{11} & BA.1* + BA.2* & BA.1.20 + BA.2.9 & XAF: 34/35, B.1.1.529: 2, XE: 3/163 \\
\bfseries XK & 1 & \textbf{3} & BA.1* + BA.2* & BA.1.1.1 + BA.2 &  \\
\bfseries XL & 1 & \textbf{10} & BA.1* + BA.2* & BA.1.17.2 + BA.2 & XAB: 1/48, XU: 1/3 \\
\bfseries XR & 1 & \textbf{8} & BA.1.1* + BA.2* & BA.1.1 + BA.2 & XQ: 9/12, XAB: 2/48 \\
\bfseries XS & 1 & \textbf{4} & B.1.617.2* + BA.1.1* & AY.36 + BA.1.1 &  \\
\bfseries XT & 1 & \textbf{1} & BA.1* + BA.2* & BA.2.23 + Unknown & BA.2.23: 1 \\
\bfseries XV & 1 & \textbf{1} & BA.1* + BA.2* & BA.1.1 + BA.2 + Unknown & BA.2: 8 \\
\bfseries XW & 1 & \textbf{11} & BA.1* + BA.2* & BA.1.1.15 + BA.2 & XN: 2/13 \\
\bfseries XY & 1 & \textbf{6} & BA.1* + BA.2* & BA.1.1 + BA.2 & XAF: 1/35 \\
\bfseries XAA & 1 & \textbf{8} & BA.1* + BA.2* & BA.1 + BA.2.9 & XAB: 38/48, XAG: 17/17, XU: 1/3, XQ: 2/12 \\
\bfseries XAC & 1 & \textbf{27} & BA.1* + BA.2* & BA.1.17.2 + BA.2.3 &  \\
\bfseries XAE & 1 & \textbf{9} & BA.1* + BA.2* & BA.1 + BA.2 &  \\
\bfseries XAG & 1 & \textbf{17} & BA.1* + BA.2* & BA.1 + BA.2.9 & XAB: 38/48, XAA: 8/8, XU: 1/3, XQ: 2/12 \\
\bfseries XB & 2 & 57 (\textbf{57}, 1)\textdagger & B.1.631 + B.1.634 & B.1 + B.1.627 & B.1.634: 3, B.1.631: 7 \\
\bfseries XC & 2 & 3 (1, \textbf{2}) & AY.29 + B.1.1.7 & AY.103 + B.1.1.7 &  \\
\bfseries XD & 2 & 4 (1, \textbf{3}) & B.1.617.2* + BA.1* & AY.4 + BA.1.15 &  \\
\bfseries XJ & 2 & 2 (1, 1) & BA.1* + BA.2* & No main clade & No main clade \\
\bfseries XAD & 2 & 5 (1, \textbf{4}) & BA.1* + BA.2* & BA.1.1 + BA.2 &  \\
\bfseries XAF & 2 & 35 (\textbf{34}, 1) & BA.1* + BA.2* & BA.1.20 + BA.2.9 & XH: 11/11, B.1.1.529: 2, XE: 3/163 \\
\bfseries XAH & 2 & 12 (\textbf{10}, 2) & BA.1* + BA.2* & BA.1 + BA.2.10 & XZ: 61/66, XAD: 1/5 \\
\bfseries XQ & 3 & 12 (2, \textbf{9}, 1) & BA.1.1* + BA.2* & BA.1.1 + BA.2 & XR: 8/8, XAB: 2/48 \\
\bfseries XU & 3 & 3 (1, 1, 1) & BA.1* + BA.2* & No main clade & No main clade \\
\bfseries XM & 4 & 47 (2, 1, \textbf{40}, 4) & BA.1.1* + BA.2* & BA.1.1 + BA.2 &  \\
\bfseries XN & 4 & 13 (2, 2, \textbf{7}, 2) & BA.1* + BA.2* & BA.1 + BA.2 &  \\
\bfseries XAJ & 4 & 5 (1, 1, \textbf{3}, 1)\textdagger & BA.2.12.1* + BA.4* & Unknown & BA.5: 1 \\
\bfseries XE & 5 & 163 (3, \textbf{155}, 1, 3, 2)\textdagger & BA.1* + BA.2* & BA.1.17.2 + BA.2 &  \\
\bfseries XZ & 5 & 66 (1, 1, \textbf{61}, 2, 1) & BA.1* + BA.2* & BA.1 + BA.2.10 & XAH: 10/12, XAD: 1/5 \\
\bfseries XAB & 8 & 48 (1, \textbf{38}, 1, 2, 2, 2, 18, 2)\textdagger & BA.1* + BA.2* & BA.1 + BA.2.9 & XAA: 8/8, XAG: 17/17, XU: 1/3, XQ: 2/12 \\
\bottomrule
\end{tabular}
\caption{\label{tab:pango-recombinants}
Summary of the Pango X-lineages in the Long ARG (excluding XP and XAK whose samples are entirely filtered out, see text). In cases of multiple origins, most X-lineages have a single ``main clade'' (in bold).
Sc2ts inferred parents for the main clade are based on Nextclade designations, imputed where necessary.
Official parents are taken from Pango designation alias key (TODO: explain the asterisks). Note that B.1.617.2 is the origin of the Delta VoC, which includes all AY.* classes. Additional descendants within the main clade are summarised giving their Nextclade Pango designation and clade count / total ARG count (the latter being omitted for non-recombinant designations).
}
\end{sidewaystable}

\begin{figure} \centering
\includegraphics[width=0.5\textwidth]{figures/Pango_XA_gisaid_large_graph.pdf}
\caption{\label{fig:pango_XA_gisaid_graph}
Detailed version of Figure~\ref{fig:pango-simple-origin-graph}A. All single nucleotide mutations are listed
with the inherited nucleotide state, followed by the reference genome position, followed by the derived
nucleotide state (after mutation). Recurrent mutations (see Section~\ref{sec:args}) are highlighted in bold,
with reversions indicated by lowercase nucleotide letters. Sample nodes are shown with \texttt{tskit} IDs,
which can be mapped to GISAID EPI ISL identifiers and strain names using supplementary file \protect\path{Subgraph_sample_mapping.txt}. In contrast to
Figure~\ref{fig:pango-simple-origin-graph}A, Pango lineages shown here are those assigned by GISAID rather
than Nextclade; however, in the specific case of XA, Nextclade and GISAID exactly agree on the lineage designations.
}
\end{figure}

\begin{sidewaysfigure} \centering
\includegraphics[width=\textwidth]{figures/Pango_XAG_gisaid_large_graph.pdf}
\caption{\label{fig:pango_XAG_gisaid_graph}
Detailed version of Figure~\ref{fig:pango-simple-origin-graph}B, with node and mutation labels as in
Figure~\ref{fig:pango_XA_gisaid_graph} (see supplementary file \protect\path{Subgraph_sample_mapping.txt}
to map tskID to EPI\_ISL and strain name).
Note that GISAID Pango designations assign an extra 2 samples to XAG, which are labelled as XAB
by Nextclade in Figure~\ref{fig:pango-simple-origin-graph}B. This renders XAG fully monophyletic.
However, unlike Nextclade, the GISAID designations also mark many of the samples in this subgraph
as non-recombinants (designating them BA.2), and in general we find that Nextclade assignments
agree more with our ARG structure than GISAID assignments.
}
\end{sidewaysfigure}

\begin{figure} \centering
\includegraphics[width=0.6\textwidth]{figures/Pango_XD_gisaid_large_graph.pdf}
\caption{\label{fig:pango_XD_gisaid_graph}
Detailed version of Figure~\ref{fig:pango-simple-origin-graph}C, with node and mutation labels as in
Figure~\ref{fig:pango_XA_gisaid_graph} (see supplementary file \protect\path{Subgraph_sample_mapping.txt}
to map tskID to EPI\_ISL and strain name).
Note that GISAID does not designate any nodes as XD in the Long ARG, hence no recombinant Pango lineages
are marked in this plot. From inspection of the samples, we believe the GISAID designations to be erroneous
in this case.
}
\end{figure}

\begin{sidewaysfigure} \centering
\includegraphics[width=\textwidth]{figures/Pango_XB_gisaid_large_graph.pdf}
\caption{\label{fig:pango_XB_gisaid_graph}
Detailed version of Figure~\ref{fig:complex_origins_graph}, with node and mutation labels as in
Figure~\ref{fig:pango_XA_gisaid_graph} (see supplementary file \protect\path{Subgraph_sample_mapping.txt}
to map tskID to EPI\_ISL and strain name).
The GISAID XB designations agree with the Nextclade ones except in two cases: an unplotted
singleton recombinant nested within the main grouping, and the tsk285180 node marked BA.1 in this plot,
but which is labelled XB by Nextstrain.
}
\end{sidewaysfigure}

\begin{figure} \centering
\includegraphics[width=1\textwidth]{figures/false_positive_top2_nxcld_large_graph.pdf}
\caption{\label{fig:false_positives}
Subgraph of the Long ARG, focusing on two likely false positive
recombination nodes at the start of the Delta wave
(``focal'' nodes, in red, corresponding to rows A and B in
Table~\ref{tab:false_positive}).
The path to node \texttt{tsk261771} has been expanded:
this node (in gold) is the ancestor of ${\sim}89.8\%$ of Delta samples in the Long ARG,
and represents a large polytomy with 107 immediate children. Of the remaining Delta samples, most (8.4\%)
are descendants of the node \texttt{tsk232088} on the far right,
a sibling of the earliest focal recombination node.
The suspected-incorrect recombination path to an early B.1 MRCA is also shown.
Note the large amount of additional recombination (black nodes) among close
descendants of the focal nodes.
% Why say this here and not at for the other subgraph figs?
%See supplementary file \protect\path{Subgraph_sample_mapping.txt} to  map tskID to EPI\_ISL and strain name.
}
\end{figure}


\end{document}