forked from jeromekelleher/sc2ts-paper
-
Notifications
You must be signed in to change notification settings - Fork 0
/
paper.tex
2352 lines (2208 loc) · 129 KB
/
paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass{article}
\usepackage[round]{natbib}
\usepackage[english]{babel}
\usepackage[letterpaper,top=2cm,bottom=2cm,left=3cm,right=3cm,marginparwidth=1.75cm]{geometry}
\usepackage{booktabs}
\usepackage{tabularx}
\usepackage{authblk}
\usepackage[symbol]{footmisc}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{graphicx}
\usepackage{makecell}
\usepackage{url}
% For rotating figures, tables, etc.
% including their captions - only for supplementary figures
\usepackage{rotating}
% JK: turning this off for the moment as I keep clicking through on links
% to the bibliography while reading the text and it's intensely annoying.
% Can reinstate when we're ready to preprint
\usepackage[hidelinks]{hyperref}
% sets colours for notes to each other in the text
\usepackage{xcolor}
\newcommand{\sally}[1]{\textcolor{red}{#1}}
\newcommand{\red}[1]{\textcolor{red}{#1}}
\newcommand{\green}[1]{\textcolor{green}{#1}}
\newcommand{\blue}[1]{\textcolor{blue}{#1}}
\title{\vspace{-1.5em} \bf Towards Pandemic-Scale Ancestral Recombination Graphs of SARS-CoV-2}
\author[1]{Shing~H.~Zhan}
\author[2,3$\star$]{Anastasia~Ignatieva}
\author[1$\star$]{Yan~Wong}
\author[4]{Katherine~Eaton}
\author[1]{Benjamin~Jeffery}
\author[1]{Duncan~S.~Palmer}
\author[4]{Carmen~Lia~Murall}
\author[5]{Sarah~P.~Otto}
\author[1$\dagger$]{Jerome~Kelleher}
% \affil[1]{\normalsize Big Data Institute, Li Ka Shing Centre for Health
% Information and Discovery, University of Oxford, United Kingdom}
% \affil[2]{Department of Statistics, University of Oxford, United Kingdom}
% \affil[3]{School of Mathematics and Statistics, University of Glasgow, United Kingdom}
% \affil[4]{National Microbiology Laboratory, Public Health Agency of Canada, Canada}
% \affil[5]{Department of Zoology and Biodiversity Research Centre, University of British Columbia, Canada}
% \affil[$\star$]{Joint second author}
% \affil[$\dagger$]{Correspondence: [email protected]}
\affil[ ]{\mbox{}\vspace{-2.5em}}
\begin{document}
\maketitle
\setlength{\skip\footins}{1em}
\setlength{\footnotemargin}{0.5em}
\renewcommand{\thefootnote}{\arabic{footnote}}
\footnotetext[1]{Big Data Institute, Li Ka Shing Centre for Health
Information and Discovery, University of Oxford, United Kingdom}
\footnotetext[2]{Department of Statistics, University of Oxford, United Kingdom}
\footnotetext[3]{School of Mathematics and Statistics, University of Glasgow, United Kingdom}
\footnotetext[4]{National Microbiology Laboratory, Public Health Agency of Canada, Canada}
\footnotetext[5]{Department of Zoology and Biodiversity Research Centre, University of British Columbia, Canada}
\renewcommand{\thefootnote}{\fnsymbol{footnote}}
\footnotetext[1]{Joint second author}
\footnotetext[2]{Correspondence: [email protected]}
\vspace{-1em}
\begin{abstract}
Recombination is an ongoing and increasingly important feature of circulating
lineages of SARS-CoV-2, challenging how we represent the evolutionary history
of this virus and giving rise to new variants of potential public health
concern by combining transmission and immune evasion properties of different
lineages. Detection of new recombinant strains is challenging, with most
methods looking for breaks between sets of mutations that characterise distinct
lineages.
In addition, many basic approaches fundamental to the study of viral
evolution assume that recombination is negligible, in that a single
phylogenetic tree can represent the genetic ancestry of the
circulating strains. Here we present an initial version of
\texttt{sc2ts}, a method to automatically detect recombinants
in real time and to cohesively integrate them into a
genealogy in the form of an ancestral recombination graph (ARG),
which jointly records mutation, recombination and genetic
inheritance. We infer two ARGs under
different sampling strategies, and study their properties.
One contains 1.27 million sequences
sampled up to June 30, 2021, and the second is more sparsely sampled,
consisting of 657K sequences sampled up to June 30, 2022.
We find that both ARGs are
% better words?
highly consistent with
known features of SARS-CoV-2 evolution, recovering the basic
backbone phylogeny, mutational spectra, and recapitulating
details on the majority of known recombinant lineages.
Using the well-established and feature-rich \texttt{tskit} library,
the ARGs can also be stored concisely and processed efficiently
using standard Python tools. For example, the ARG for 1.27 million
sequences---encoding the inferred reticulate ancestry,
genetic variation, and extensive metadata---requires
58MB of storage,
and loads in less than a second.
%JK Not sure if we'd actually say it like this, but this is the
% basic gist I'd like to get accross[
The ability to fully integrate the effects of recombination into
downstream analyses, to quickly and automatically detect new recombinants,
and to utilise an efficient and convenient platform for computation
based on well-engineered technologies
makes \texttt{sc2ts} a promising approach.
\end{abstract}
\section{Introduction}
% Recombination is an important force in SARS-CoV-2, recombinants
% have arisen and they have spread to high frequencies
Recombination via template switching is a common feature
of the evolution of coronaviruses~\citep{Graham2010-xe,De_Klerk2022-tt},
including SARS-CoV-2
\citep{VanInsberghe2021-eu,Jackson2021-ik,Ignatieva2022-st}. By bringing
together mutations carried by different lineages, recombination plays an
important role in generating genetic diversity, with recombinant lineages
associated with adaptation to new host species and with the production of more
immune evasive variants~\citep{Graham2010-xe,De_Klerk2022-tt,Tamura2023-ab}.
Early in the COVID-19 pandemic, the levels of genetic diversity
were too low to enable the detection of distinctive recombinant strains.
By late 2020, however, the appearance and
spread of variants of concern (VoC), designated into classes such as Alpha and
Delta which harboured multiple characteristic mutations,
created the conditions required to detect
recombinant strains and their onward transmission~\citep{Jackson2021-ik}.
More recently, the high prevalence of Omicron,
with multiple co-circulating deeply divergent lineages (BA.1 to BA.5), has
accelerated the rate of coinfection and the potential for
recombination~\citep{Bal2022-hq}.
In early 2023, multiple recombinant lineages have successfully
established and spread to high frequency, and accounting for recombinant
ancestry is now essential in understanding the ongoing evolution of SARS-CoV-2.
% Recombination is hard to detect
Detecting recombination in SARS-CoV-2 is difficult and identifying new
recombinant strains is a time-consuming, manual
process~\citep{Smith2023-identifying}.
Most genomic surveys for SARS-CoV-2 recombinants search for mosaic genomes that
combine specific subsets of characteristic mutations from different lineages
\citep[e.g.,][]{VanInsberghe2021-eu,Jackson2021-ik,Wertheim2022-hj,Sekizuka2022-xz}
and as a result can only identify inter-lineage recombination events.
\cite{Turakhia2022-it} presented a
phylogeny-based approach (``RIPPLES'') to identify putative recombinants among
over ten million SARS-CoV-2 genomes, without pre-specifying sets of characteristic
mutations.
RIPPLES finds candidate recombinants by using an existing phylogeny
(built assuming no recombination) and finding potential recombinants
by scanning for branches containing many mutations.
It then determines if these candidates would be better explained by
recombination by exhaustively breaking each
sequence into segments and attempting to find more parsimonious placements for
each segment on the phylogeny. If such placements are found, the sequence is
identified as a putative recombinant. Although it enables rapid searches for
genomic evidence of recombinants, RIPPLES relies on a SARS-CoV-2 phylogeny that
accounts for only mutations, treats recombinants \textit{post hoc}, and is an
incomplete representation of the reticulate evolutionary history of SARS-CoV-2.
As noted by the authors, a \textit{post hoc} treatment of recombination is
possible when recombinant lineages are rare and leave few descendants. However,
the proliferation of recombinant lineages is making this increasingly
untenable; for example, more than half of the sequences sampled in February 2023
are from the recombinant strain XBB and its descendants~\citep{Chen2022-pz}. This also means that
future evolution of SARS-CoV-2 is likely to involve multiple sequential
recombination events on top of existing recombinant lineages, creating a highly
reticulated genealogy.
% PARA 3: but you can't use a phylogeny when there's recomb. Need a joint
% model of mutation and recomb. ARGs, tskit, forward refs to ARG section for
% details.
It is well known that recombination distorts phylogenies~\citep{Schierup2000-fg}
and affects the results of downstream analyses, such as inference of
selection~\citep{Anisimova2003-vr}. Standard phylogenetic methods do
not account for recombination
\citep[e.g.,][]{Ronquist2012-zw,Minh2020-lr,Guindon2003-zd}, and there is
no standard method for incorporating the effects of recombination into
phylogenetic analyses.
Ancestral Recombination Graphs (ARGs) are a means of describing such
network-like ancestry~\citep{Griffiths1981-lw,Gusfield2014-qw}, but
until recently lacked software support and sufficiently scalable
inference methods to be of practical use.
However, approaches to infer ARGs now exist that can scale to tens of
thousands of human genomes and beyond
\citep{Speidel2019-yh,Kelleher2019-ba,Schaefer2021-yg,Zhang2023-lf}, dealing with levels of recombination far in excess of those seen in viral
phylogenies. The ``succinct tree sequence'' is an ARG data structure
which has led to significant computational advances across a range
of
applications~\citep{Kelleher2016-wk,Kelleher2018-xc,Kelleher2019-ba,Ralph2020-efficiently,
Wohns2022-th}, and the supporting \texttt{tskit} software library
is now widely used in population genetics applications.
The methods in \texttt{tskit} have been developed to support millions of
whole human genomes~\citep{Kelleher2019-ba}, and so it is particularly well suited
to representing large SARS-CoV-2 genealogies,
which currently encompasses over 15 million sequences in the GISAID
database~\citep{Shu2017-hp}.
See Section~\ref{sec:args} for more details on ARGs and the succinct
tree sequence data structure.
\begin{figure} \centering
\includegraphics[width=0.7\textwidth]{figures/overview_sc2ts.pdf}
\caption{\label{fig:overview_sc2ts}
A schematic of the \texttt{sc2ts} method.
The genetic relationships among SARS-CoV-2 genomes is
reconstructed by using the Li and Stephens
model to infer attachment paths for samples to an existing ARG (curved lines).
Each daily iteration involves three stages:
attachment of new samples to the growing ARG (A, B);
reconstruction of trees relating the samples under each attachment node (C);
and parsimony-based tree topology adjustments (D, E).
In the absence of recombination, \texttt{sc2ts}
infers an ARG that is a single tree relating the samples (A).
When recombination is detected, \texttt{sc2ts} infers
an ARG that concisely encodes a sequence of local trees relating segments
of the sample genomes (B). Additionally,
mutation-collapsing nodes (D) and reversion-push nodes (E) are inserted to
make more parsimonious placements of mutations that should be shared or should
not be immediately reverted, respectively.}
\end{figure}
% [PARA 4: we present sc2ts based on these recent advances. Summarise
% results, with forward refs to Results sections.]
Here we present a preliminary version of \texttt{sc2ts}, a novel method
for inferring ARGs for SARS-CoV-2 at pandemic scale, in real time.
Building on the open-source \texttt{tskit} library, the method
explicitly reconstructs genealogies with both
mutation and recombination, which
can be conveniently and efficiently analysed using standard Python data science
tools. As illustrated in Figure~\ref{fig:overview_sc2ts}, inference is based on incrementally adding batches of sequences
based on their collection dates and proceeds in three phases.
First, possible paths connecting each sample to the current ARG
are inferred (allowing for recombination) using the
Li and Stephens (LS) model (Figure~\ref{fig:overview_sc2ts}A, B);
the LS ``copying process'' is a Hidden Markov Model (HMM)
approximating the effects of mutation
and recombination, widely used in large-scale genomics (Section~\ref{sec:ls}).
Then, since many samples
in a batch can share an attachment path,
we infer phylogenetic trees for each of these clusters separately using standard methods
(Figure~\ref{fig:overview_sc2ts}C; Section~\ref{sec:sample-cluster-tree-inference}).
Finally, we attach the trees for these sample clusters to the current
ARG and apply some parsimony-based heuristics to address issues
introduced by the inherent greediness of this strategy
(Figure~\ref{fig:overview_sc2ts}D, E; Section~\ref{sec:parsimony-heuristics}).
Using the current preliminary version of \texttt{sc2ts}, we infer two
large ARGs (with 1,265,685 and 657,239 samples, respectively) and study
the properties of these ARGs to illustrate the power of the method
and to inform subsequent development.
We find that these
ARGs accurately capture known phylogenetic
relationships (Section~\ref{sec:backbone_phylogeny}) and
mutational spectra (Section~\ref{sec:mutation_spectrum}),
and automatically identify the majority of known recombinant
lineages (Sections~\ref{sec:jackson_recombs} and \ref{sec:pango_x_lineages})
with a high level of precision in the
genomic location of recombination breakpoints
(Section~\ref{sec:breakpoint_intervals})
and relationship between parental sequences
(Section~\ref{sec:parent_divergence}).
We hope that these benefits of accurate joint estimation of
genetic inheritance with mutation and recombination will generate community
interest and development of the \texttt{sc2ts} method, and
more generally in applying the efficient and mature software
of the \texttt{tskit} ecosystem to pandemic-scale SARS-CoV-2 data.
\section{Results}
\subsection{Inferred ARGs}
The goals of this preliminary study are to illustrate the utility of
\texttt{sc2ts} and to investigate the properties of the
inferred ARGs to inform subsequent development.
We work with a
representative subset of the available data, limited to
inferences that can be performed on a single server in
a few weeks (see below for further details on timings and
computer hardware used). The cut-off dates for sampling
are arbitrary.
We inferred two ARGs, which we refer to as the
``Wide'' and ``Long'' ARGs throughout.
The Wide ARG is densely sampled but time-limited and
includes 1.27 million sequences collected up to June 30, 2021
which
pass some quality-control filters (Section~\ref{sec:data_preprocessing})
and have a maximum delay between sampling and submission dates of 30 days
(Section~\ref{sec:filtering_time_travellers}).
For the Long ARG, we randomly sub-sample a maximum of 1,000 genomes per
day (again restricting the delay between sampling and submission to 30 days)
and include an additional year's worth of samples (to June 30, 2022).
\begin{table}
\begin{center}
\begin{tabular}{llrlrl}\toprule
& & \multicolumn{2}{l}{Wide ARG} & \multicolumn{2}{l}{Long ARG} \\
\cmidrule{3-6}
\multicolumn{2}{l}{Sample filtering}
& \multicolumn{2}{l}{collection $\leq$ 2021-06-30}
& \multicolumn{2}{l}{collection $\leq$ 2022-06-30} \\
& & \multicolumn{2}{l}{max-delay=30} & \multicolumn{2}{l}{max-delay=30} \\
& & & & \multicolumn{2}{l}{max-daily=1000} \\
\midrule
Nodes & & 1,453,347 & & 783,231 & \\
& \emph{Node type} & \\
& Sample & 1,265,685 & (87.09\%) & 657,239 & (83.91\%) \\
& Daily sample cluster tree & 102,709 & (7.07\%) & 51,807 & (6.61\%) \\
& Reversion push & 40,538 & (2.79\%) & 34,358 & (4.39\%) \\
& Mutation collapse & 40,292 & (2.77\%) & 37,749 & (4.82\%) \\
& Recombination & 4,123 & (0.28\%) & 2,078 & (0.27\%) \\
\cmidrule{2-6}
% Edges & & 1,458,146 & & 785,539 & \\
% & \emph{Parent node type} & \\
% & Samples & 610,729 & (41.88\%) & 319,626 & (40.69\%) \\
% & UPGMA & 470,545 & (32.27\%) & 156,881 & (19.97\%) \\
% & Reversion push & 184,608 & (12.66\%) & 144,991 & (18.46\%) \\
% & Mutation collapse &186,218 & (12.77\%) & 160,833 & (20.47\%) \\
% & Recombination & 6,046 & (0.41\%) & 3,208 & (0.41\%) \\
% \cmidrule{2-6}
Mutations & & 1,231,193 & & 1,062,072 & \\
& Per node per genome & 0.83 & $\pm$1.40 & 1.36 & $\pm$1.72 \\
& Per sample per genome & 0.77 & $\pm$1.39 & 1.38 & $\pm$1.77 \\
& Per site per ARG & 41.23 & $\pm$108.16 & 36.10 & $\pm$80.03\\
\cmidrule{2-6}
% size as reported by ls -lh
\multicolumn{2}{l}{Compressed size (inc metadata)} & 58MB & & 37MB& \\
\multicolumn{2}{l}{Bytes/sample (exc metadata)} & 8.29 & & 10.83 \\
Load time & & 0.9s & & 0.5s & \\
\bottomrule
\end{tabular}
\end{center}
\caption{\label{tab:args}Summary of the inferred ARGs. Nodes are classified
as either samples or by the inference process that produced them
(see Sections~\ref{sec:sample-cluster-tree-inference},
\ref{sec:parsimony-heuristics} and \ref{sec:treatment_recombinants} for
details).
The mean and standard deviation ($\pm$)
are reported for the number of mutations per node, sample and site.
}
\end{table}
The properties of the inferred ARGs are summarised in Table~\ref{tab:args}.
The majority of the nodes in the ARGs represent sample genomes
(Wide ARG: 87\%, Long ARG: 84\%), with the remainder mostly representing
the ancestral sequences inferred
from daily sample clusters (Section~\ref{sec:sample-cluster-tree-inference})
and parsimony heuristics (Section~\ref{sec:parsimony-heuristics}).
Both ARGs contain 29,422 sites (Section \ref{sec:data_preprocessing}),
and a large number of mutations.
The average number of mutations per site is high, although some
of this may be explained by outlier sites with artefactually high
mutation counts (see Figure~\ref{fig:breakpoint-distribution}).
Despite this, however, the number of mutations per node
is small. In the Wide ARG, for example, we have a mean of 0.77
mutations per sampled genome, demonstrating that most
added samples fit into the ARG parsimoniously.
See Section~\ref{sec:mutation_spectrum}
for more analysis of the patterns of inferred mutations.
Recombination plays a relatively minor role, with $<0.3\%$
of nodes in the ARGs representing inferred recombination events.
Of these recombination nodes, the majority are ancestral
to only one sample (Wide ARG: 63.1\%, Long ARG: 63.3\%).
We analyse signals of recombination in Sections~\ref{sec:jackson_recombs},
\ref{sec:breakpoint_intervals}, \ref{sec:parent_divergence},
and \ref{sec:pango_x_lineages}.
Table~\ref{tab:args} also summarises some of the computational
properties of the inferred ARGs.
The ARGs are encoded as a ``succinct tree sequence'' using
the \texttt{tskit} library, which provides an extensive
suite of operations for constructing and analysing ARGs
(Section~\ref{sec:args}). For example, the Wide ARG
which contains complete genomes (with imputed missing data)
for around 1.2 million samples, along with extensive sample
and debugging metadata, requires only 58MB of space (compressed
using the \texttt{tszip} utility). The majority of this
space is used by the metadata, which when discarded results in
an encoding that requires an average of only 8.29 bytes per
SARS-CoV-2 genome stored.
Loading these ARGs takes less than a second, and they can be interactively
analysed using Jupyter notebooks~\citep{Kluyver2016-jupyter}
on a standard laptop. The majority
of the analyses in this preprint can be carried out in seconds
with the \texttt{tskit} Python API, using a few gigabytes of RAM.
Inferring these ARGs does require substantial computation.
The Wide ARG required 17 days to infer on a server with
128 threads and 512 GB RAM (2x AMD EPYC 7502 @ 2.5GHz). The Long ARG
required 23 days on a (much older) machine with 40 threads and 256 GB RAM (2x
Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz). The majority of the time is spent
on running the LS HMM (Section~\ref{sec:ls}) to find the copying
path for each sequence, and the process
is therefore highly amenable to distributing across multiple machines.
We therefore anticipate that further development of the
inference methods and scaling out across multiple servers will
enable inferences at the full pandemic scale.
\subsection{Backbone phylogeny}
\label{sec:backbone_phylogeny}
% FIXME the development of thought is quite muddled here, we should rearrange
% in later revisions. We should discuss the expected lack of recombination
% and our likely false +s in one paragraph.
Compared to organisms like humans that recombine in every generation,
recombination is relatively rare in SARS-CoV-2, with recombination nodes accounting
for $<$0.3\% of the inferred ancestry (Table~\ref{tab:args}). As a result,
relationships between strains can often be represented by a
single phylogenetic tree, particularly when looking at
a subset of strains. We expect ARGs to be particularly treelike
early in the pandemic, when co-infection was less likely and divergence between
lineages relatively low.
\begin{figure} \centering
\includegraphics[width=\textwidth]{figures/cophylogeny_wide.pdf}
\caption{\label{fig:cophylogeny}
Tanglegram comparing a local tree from the Wide ARG
(sampled to mid-2021) and an ``all-time'' global Nextstrain tree
(downloaded on 2023-01-21).
Phylogenies are pruned down to those samples present as tips in both datasets, with the horizontal axis representing time (tips end at the sample collection date). Light grey
lines match the corresponding samples between the two trees; black circles
indicate identical sample partitions between the two trees. Terminal branches
are colour-coded according to the Pango lineage status assigned to the tip
samples.
The tanglegram was generated using the Neighbor-Net algorithm
\citep{Scornavacca2011-mg} implemented in Dendroscope version 3.8.5
\citep{Huson2012-ys}. See Figure~\ref{fig:cophylogeny_long} for the equivalent
cophylogeny for the Long ARG.}
\end{figure}
A classic tree-based summary of SARS-CoV-2 ancestry is provided
by the Nextstrain project \citep{Hadfield2018-ef}. The trees
available from Nextstrain are based on small subsamples of the
dataset, and early in the pandemic tend to be restricted to
a sample of strains that were not thought to be recombinants.
For validation purposes, we compare our ARGs with
a downloaded Nextstrain tree, restricted to the time period
covered by each ARG. To enable this, we ``simplify''~\citep{Kelleher2018-xc}
the \texttt{sc2ts} ARGs to a backbone containing only those samples
present in the Nextstrain tree. This results in a small set of shared
samples (Wide ARG: 180, Long ARG: 88), none of which are
assigned to Pango recombinant lineages by Nextclade (see Section
\ref{sec:pango_x_lineages}).
The \texttt{sc2ts} backbone phylogenies for these Nextstrain subsamples contain small amounts of recombination,
with 7 recombination nodes in the Wide ARG backbone (8 for the Long ARG
backbone). However, the recombination events involve minor, local topological rearrangements
where recombination only occurs between close relatives.
(The majority of these recombinations are likely
false-positives, as discussed in Section~\ref{sec:false_positives}.)
Figure~\ref{fig:cophylogeny} compares the backbone phylogeny of the Wide ARG with
a GISAID global ``all-time'' tree from Nextstrain.
We illustrate the backbone phylogeny by visualising
a single tree in the middle of the viral genome,
although other regions of the
genome show almost identical topologies.
It is clear that the backbone topology of the \texttt{sc2ts} tree shows very close
agreement with the Nextstrain tree. The sample genomes cluster by their
assigned Pango lineage status, and many variants and their descendants form
identical monophyletic clades in both the trees (e.g., the Alpha and Delta VoC
clades, labelled).
Figure~\ref{fig:cophylogeny_long} shows the same comparison for the Long ARG, with similar results.
Figure~\ref{fig:cophylogeny} also reveals some notable differences between the
trees. Firstly, the \texttt{sc2ts} tree is generally less well resolved,
particularly in early 2020 when sampling density was much lower than
later in the pandemic. Resolution early in the pandemic could be improved
by using a tree inferred using classical phylogenetic
approaches for the first few months of the pandemic, before the scale of
data began to overwhelm these methods. Indeed, this is the approach
taken by UShER~\citep{Turakhia2021-ur}. Using a pre-existing tree for
the early stages of the pandemic would be straightforward in \texttt{sc2ts},
and the main reason we did not do this for the initial
version under consideration here was to evaluate the algorithm's
performance with sparse data. Given
the simplicity of the algorithm, tree inference for the early pandemic
is surprisingly good.
The second difference between the \texttt{sc2ts} and Nextstrain trees
that we would like to highlight are a few non-identical sample partitions
near the tips (e.g.\ criss-crossing assignments within the Alpha clade). It is unclear what
particular differences between the phylogenetic reconstruction algorithms are driving these
differences, and more study is required to characterise and address them.
Finally, some branch lengths differ substantially between the \texttt{sc2ts}
and Nextstrain trees.
As discussed in Section~\ref{sec:node_dating}, the dating of nodes
other than samples in \texttt{sc2ts} is currently quite crude,
but there are likely straightforward expedients that would yield
substantial improvements.
\subsection{Mutational spectrum}
\label{sec:mutation_spectrum}
The ARGs inferred by \texttt{sc2ts} and represented using the \texttt{tskit}
library (Section~\ref{sec:args}) are a joint estimate of the genealogy with recombination and mutation. Unlike most approaches to
phylogenetic analysis, mutations are included in the \texttt{tskit}
data model alongside
the topological representation of genetic inheritance.
This has many
advantages, for example allowing us to compute statistics of the observed
sequences efficiently~\citep{Kelleher2016-wk,Ralph2020-efficiently} and
to provide high levels of data compression~\citep{Kelleher2019-ba}.
The same idea has recently been used to represent
SARS-CoV-2 data in UShER's ``mutation annotated tree''
format~\citep{Turakhia2021-ur}.
\begin{figure} \centering
\newcolumntype{Y}{>{\centering\arraybackslash}X}
\begin{tabularx}{\textwidth}{cY}
\hspace*{-2.5mm}\includegraphics[width=0.32\textwidth]{figures/mutational_spectra.pdf}
&
\begin{tabular}[b]{llrlr}\toprule
& \multicolumn{2}{l}{Wide ARG} & \multicolumn{2}{l}{Long ARG} \\
% \midrule
\cmidrule{2-5}
Total & \multicolumn{2}{l}{1,213,193} & \multicolumn{2}{l}{1,062,072} \\
Private & 758,903 & (62.55\%) & 767,111 & (72.23\%)\\
\cmidrule{2-5}
Transitions & 873,487 & (72.00\%) & 783,773 & (73.80\%) \\
Transversions & 326,053 & (26.88\%) & 270,333 & (25.45\%) \\
\cmidrule{2-5}
Insertions & 6,191 & (0.51\%) & 2,814 & (0.26\%) \\
Deletions & 7,462 & (0.62\%) & 5,152 & (0.49\%) \\
\cmidrule{2-5}
Recurrent & 74,719 & (6.16\%) & 50,099 & (4.72\%) \\
Of which reversions & 72,617 & (5.99\%) & 48,226 & (4.54\%) \\
\bottomrule
\vspace{-2mm}
\end{tabular}\\
(A) & (B)\\
\end{tabularx}
\caption{\label{fig:mutational_spectra}
(A) Mutational spectrum in
the Wide ARG compared to \cite{Yi2021-sc}. Mutations are
categorised by type (i.e., inherited state $>$ derived state). The percentages
of each mutation type from the Wide ARG are represented by blue bars and the
percentages from Yi et al.\ by orange bars, with the darker colours representing
one direction (e.g., C$>$U) and the lighter colours the reverse (e.g., U$>$C).
(B) Summary of mutations in the Long and Wide ARGs. Private mutations
occur on terminal branches. Insertions are mutations in which the
inherited state is the gap state ``-'' and the derived state is a
nucleotide, and vice versa for deletions. See Section~\ref{sec:args}
for precise definitions of mutations, and the recurrent and reversion
classifications.
}
\end{figure}
The properties of the mutations inferred in the Wide and Long
ARGs are summarised in Figure~\ref{fig:mutational_spectra}B. In both
cases we have a large number of mutations, and a majority of these
(Wide ARG: 62.55\%, Long ARG: 72.23\%) are private to a single
sample, i.e.\ on terminal branches.
Although the average number
of mutations per sample is small
(Wide ARG: 0.77, Long ARG: 1.38; Table~\ref{tab:args}),
the average number per site
is large in both ARGs (Wide ARG: 41.23; Long ARG: 36.10; Table~\ref{tab:args}).
However, a lower median count (Wide ARG: 14; Long ARG: 13),
suggests that these high mutation counts are
partly driven by some hypermutable sites (e.g., site 28,271 has over 7,000
mutations in the Wide ARG; Figure~\ref{fig:breakpoint-distribution}), which
may be artefactual.
The current version of \texttt{sc2ts} infers a
large number of ``reversion'' mutations, with 5.99\% of all mutations in the Wide
ARG (Long ARG: 4.54\%) reverting the state change of the immediately
ancestral mutation (see Section~\ref{sec:args} for precise definitions).
These are symptomatic of both data quality issues
such as ``time travellers'' (Section~\ref{sec:filtering_time_travellers})
and problematic sites (e.g., 28,271 as discussed in
Section~\ref{sec:breakpoint_intervals}; see also
Figure~\ref{fig:pango_XB_gisaid_graph} for an example of multiple
reversions at this site), as well as indicating opportunities
for improvement in tree building heuristics
(Section~\ref{sec:parsimony-heuristics}).
For example, the current ``reversion push'' operation
only eliminates reversions in the newly added portion of the tree
and not reversions around earlier nodes created by this algorithm.
Despite the presence of some artefactual mutations,
the properties of the mutations inferred by \texttt{sc2ts} largely follow
established results.
In Figure~\ref{fig:mutational_spectra}A we compare the mutational spectrum
in the Wide ARG to the results of \cite{Yi2021-sc},
who reconstructed a SARS-CoV-2 phylogeny of
over 350,000 genomes sampled globally from 2019-12-24 to 2021-01-12
and classified the mutations occurring along the phylogeny.
We categorised all single nucleotide
mutations in the Wide ARG by type (defined by the inherited and derived states),
excluding mutations inherited by only a single sample (which are
more likely to be sequencing errors).
Similarly, we took the data for single nucleotide mutations from
\citet[][\url{https://github.com/ju-lab/SC2_evol_signature}]{Yi2021-sc}, excluding
mutations occurring along terminal branches, and tallied them up by type.
Figure~\ref{fig:mutational_spectra}A shows that the mutational spectrum from the
Wide ARG (based on 448,825 mutations) matches that reported by \citet[based on
92,344 mutations]{Yi2021-sc}. In both spectra, C-to-U mutations and G-to-U
mutations occur more frequently than U-to-C and U-to-G, respectively. Similar
results are obtained when including the mutations inherited by only a single
sample or those occurring on terminal branches (data not shown).
\subsection{Early recombinants}
\label{sec:jackson_recombs}
RNA viruses are known to recombine at high rates when cells are co-infected
\citep{Simon2011-rna}, and recombination has been widely
documented to be commonplace in animal and human coronaviruses
\citep{Su2016-epidemiology}. While recombination in SARS-CoV-2
was shown early on to be frequent \emph{in-vitro}
\citep{Gribble2021-coronavirus}, the relatively slow accumulation
of genetic diversity early in the pandemic hampered efforts to
detect recombinant strains. A number of early studies relying on
analysing patterns of linkage disequilibrium and searching for
mosaic genomes carrying characteristic mutations of different
lineages either failed to detect recombination or posited
that this occurred at low rates \citep[e.g.,~][]{Nie2020-phylogenetic,Tang2020-origin,VanInsberghe2021-eu,Varabyou2021-rw}.
The first clear evidence of recombinant lineages was presented by
\citet{Jackson2021-ik}, who performed a careful analysis of sequences
circulating in the UK in late 2020 to early 2021 and found
evidence of multiple independent recombination events and onward
transmission.
By searching for samples combining genomic segments from Alpha (B.1.1.7) and
from the parental lineage B.1.1 based on a list of 22 Alpha-defining mutations,
they found 16 recombinant sequences from 8 putative
origins (groups A to D and four singletons).
These findings are closely replicated in both the Wide and Long ARGs.
\begin{table} \centering
\begin{tabular}{lllr@{--}lr@{+}l}
\toprule
Group & Sequences & Method & \multicolumn{2}{c}{Interval}
& \multicolumn{2}{r}{Parent lineages} \\
\midrule
A (XA) & 4 & Jackson & 21,256&21,615 & B.1.177&B.1.1.7 \\
& &\texttt{sc2ts} & 21,256&22,228 & B.1.177.18&B.1.1.7 \\
\cmidrule{3-7}
B & 2 & Jackson & 6,529&6,955 & B.1.36&B.1.1.7 \\
& &\texttt{sc2ts} & 6,529&6,955 & B.1.36&B.1.1.7 \\
\cmidrule{3-7}
C & 3 &Jackson & 25,997&27,443 & B.1.1.7&B.1.221 \\
& & \texttt{sc2ts} & 25,997&27,973 & B.1.1.7&B.1.221 \\
\cmidrule{3-7}
D & 3 & Jackson & 21,576&23,064 & B.1.36.17&B.1.1.7 \\
& & \texttt{sc2ts} & 22,445&23,064 & B.1.36.39&B.1.1.7 \\
\bottomrule
\end{tabular}
\caption{\label{tab:jackson}Comparison of recombination breakpoint
intervals and parent lineages for Groups A-D
reported by \cite{Jackson2021-ik} with the corresponding
recombination events in the Wide ARG.
The second column gives the number of sequences in the group,
limited to the samples considered by \citet{Jackson2021-ik}.
The 3SEQ~\citep{boni2007exact} coordinates
reported by Jackson et al.\ have been altered as follows: we
add one to both left and right coordinates to correspond to the
\texttt{tskit} definition of inheritance on either side of a breakpoint,
and add one to the right coordinate to make the intervals
right-exclusive. See Section~\ref{sec:treatment_recombinants}
for a precise definition of sequence inheritance at recombination
events and the corresponding breakpoint intervals.
Details for all 16 sequences are
given in Table~\ref{tab:jackson_supplement}.}
\end{table}
The Wide ARG contains 15 of these 16 recombinant sequences
(sample MILK-103C712 was removed during preprocessing; see
Section~\ref{sec:data_preprocessing}).
Table~\ref{tab:jackson} shows the groups of sequences identified by Jackson
et al.\ as likely independent recombination events with onward transmission.
In each case we have a corresponding recombination node in the Wide ARG,
from which all the sequences in the group descend. The parent lineages
and breakpoint intervals agree closely (see
Section~\ref{sec:breakpoint_intervals} for more details on breakpoint
intervals).
For groups B, C and D,
these recombination nodes form clades consisting only of the identified
sequences.
Group A sequences were subsequently given the Pango XA designation
following onward transmission,
and there are 44 XA designated samples in the Wide ARG (including the
4 sequences analysed by Jackson et al.). The Group A recombination
node forms a monophyletic clade of these 44 samples.
Table~\ref{tab:jackson_supplement} shows the details for each of the
16 sequences individually and showing generally a strong concordance
in mosaic structure and parent lineages
(including sample CAMC-CB7AB3, which is inferred to have two breakpoints under both
methods).
The Long ARG contains 5 of the sequences: two each from groups A and B
and sample QEUH-1067DEF. These cluster under three recombination nodes, as expected,
and have identical breakpoint intervals and parental lineage assignments
to those of Wide ARG.
The recombination nodes for Group B and sample QEUH-1067DEF are ancestral only
to the sequences involved.
The recombination node for group A forms a monophyletic clade of all
5 XA samples present in the Long ARG
(Figure~\ref{fig:pango-simple-origin-graph}A)
\subsection{Recombination breakpoint intervals}
\label{sec:breakpoint_intervals}
It is rarely possible to be precise about the position on the genome at which
a recombinant sequence switches from inheriting from one parent to another.
Even if we observe the recombinant sequence before subsequent
divergence occurs (during onward transmission), there is no way to identify the exact breakpoint if the two parent sequences
are similar.
Here we define the
interval within which a particular breakpoint may have occurred
as the genome coordinates over which the
sequences for the left and right parent nodes are identical. The right-hand
extreme of the breakpoint interval is chosen by the LS HMM Viterbi algorithm
(Section~\ref{sec:ls}), and the left endpoint is then derived by directly
comparing the parent sequences. See Section~\ref{sec:treatment_recombinants} for
a precise definition of breakpoints and their intervals.
\begin{figure}
\centering
\includegraphics[width=\textwidth]{figures/wide_arg_recombination_intervals.pdf}
\caption{\label{fig:breakpoint-distribution}
Distribution of recombination breakpoints and mutations along the genome in
the Wide ARG. Top panel shows the intervals for 1,769 breakpoints associated
with 1,522 recombination nodes with at least two descending samples, plotted along the genome
as line segments (coloured by interval width). The inset histogram shows the
distribution of these interval widths (truncated at 10kb).
The bottom panel shows the number of intervals that span
each site along the genome (left axis, orange) and the number of mutations
per site (right axis, blue).
The top-ten sites by mutation count are annotated.
See Figure~\ref{fig:long_arg_breakpoint_distribution} for the equivalent plot
for the Long ARG.}
\end{figure}
Figure~\ref{fig:breakpoint-distribution} shows the distribution of breakpoint
intervals and patterns of recurrent mutation along the genome in the Wide ARG
(see Figure~\ref{fig:long_arg_breakpoint_distribution} for the same information
for the Long ARG). We focus on the Wide ARG here as it covers roughly the same time
period as the analyses of \cite{Turakhia2022-it}, facilitating comparisons
of the results. To reduce the effect of artefactual recombinants, we
consider only the breakpoints associated with the 1,522 recombination nodes
that are ancestral to more than one sample. The mean length of these intervals
is 1,685 bases (median 962), and the length distribution is summarised in the
inset histogram in Figure~\ref{fig:breakpoint-distribution}.
Although further work is required to filter
out spuriously inferred recombination events (see
Section~\ref{sec:false_positives})
we can draw some preliminary conclusions from Figure~\ref{fig:breakpoint-distribution}.
The number of intervals spanning a site (orange curve) is lower at the ends of the genome,
as expected due to the lack of information about recombination with few flanking sites.
In addition, the number of intervals spanning a site often
drops near the beginning of a gene.
This is particularly apparent at the ORF8/N gene interface,
with the N gene containing fewer potential recombination breakpoints
than other genes, in agreement with the results of \cite{Turakhia2022-it}.
Noticeable declines in the number of spanning intervals are also
seen near the beginning of S, M, ORF7a, and ORF8.
These declines are sometimes associated with hypermutated sites (e.g., 27,384 and 28,271
near the beginning of ORF7a and N, respectively), as expected because sites that
undergo mutation at high rates are more likely to differ between the
parents of a recombinant and so provide information about the location of a breakpoint.
This pattern may, however, also be an artefact of sequencing errors causing sites
to appear different between the parents when they are not
(see the discussion of potential errors at site 28,271 below).
In other cases, however, the drop in intervals does not appear to
coincide with hypermutability and may reflect shifts in the actual
rate of recombination between genes.
Indeed, template switching is known to occur at hotspots,
which often involve transcription-regulatory sequences preceding genes
in SARS-CoV-2~\citep{Yang2021-characterizing}.
The rate of recombination may also depend on the relative abundance
of different subgenomic RNA intermediates that span different genes,
affecting the availability of templates and influencing the rate of
homologous recombination~\citep{Kim2020-gt,Zou2021-sars}.
%\sally{[Sally: KATHERINE, WHAT DO YOU THINK? These not-full-length subgenomic RNA intermediates may provide a lot of homologous templates for some genes more than other (e.g., fig 3B here).]}
It is important to note that the precise endpoints of intervals can be
somewhat arbitrary, because they are defined by the sequence differences
that happen to be present in the recombinant's parents.
Thus, breakpoint intervals will tend to be truncated at sites that are
hypermutable, either due to increased information about parentage
or spurious inferences caused by sequencing errors.
It is therefore helpful to compare the number of intersecting intervals
with the number of mutations per site (Figure~\ref{fig:breakpoint-distribution}).
For instance, position 28,271 (located in the ORF8-N intergenic
region) has the largest number of mutations and appears as an endpoint
of 66 intervals.
The 7,572 mutations (including 5,782 insertions and 1,605 deletions)
at this site are an indicator that this
homopolymeric region may be prone to sequencing errors
and potentially should be included in the list of ``problematic sites'' that
are excluded from analysis (Section~\ref{sec:data_preprocessing}).
On the other hand, 58 of the breakpoint intervals have endpoints
within one base of site 27,972, which has undergone 770 mutation events
(including 425 C$>$T, 314 T$>$C). The C$>$T mutation has the effect of
truncating ORF8, and it has been posited that the truncation is neutral
or advantageous for transmission, and disadvantageous within-host
\citep{Jungreis2021-dh}, suggesting that the high rate of recurrent mutation
at this site may be due to selection.
% JK Dropping this for now:
% We further consider the breakpoint intervals within the context of
% recombination breakpoint sequence motifs. It is hypothesised that certain
% palindromic breakpoint sequences (specifically, CAGAC and CAGAT) promote
% template switching during replication in SARS-CoV-2 via formation of
% base-paired stem loops in the genome structure \citep{Gallaher2020-lb}. There
% are 87 occurrences of these two breakpoint sequences in the Wuhan-Hu-1/2019
% reference sequence (Supplementary Table 5). We observe that XX (XX\%) of the
% breakpoint intervals of the HMM-consistent recombinants span at least one of
% the breakpoint sequences.
\subsection{Divergence between recombinant parents}
\label{sec:parent_divergence}
% Paragraph 2, what do we plot, specifically?
In this section we explore the detailed ancestral relationships between
recombinant parents by investigating the patterns of divergence between
them.
We focus on the Long ARG because it covers time
periods where substantial recombination is known to have occurred
and contains samples from 33 Pango X lineages (i.e., those inferred
to have recombinant ancestry; see Section~\ref{sec:pango_x_lineages} for
further analysis).
Figure~\ref{fig:recomb_mrcas} shows the estimated date of the most recent
common ancestor (MRCA) of the parent nodes for each recombination breakpoint,
plotted against the divergence between these parents (i.e., the
total branch length from the parents to their MRCA in the trees to
the immediate left and right of the breakpoint).
As in Section~\ref{sec:breakpoint_intervals}, in these plots we
exclude breakpoints associated with ``singleton'' recombination nodes
(those ancestral to only one sample).
Larger points distinguish those breakpoints which occur in nodes
ancestral to 5 or more samples, comprising 316 breakpoints from
291 recombination nodes. The criterion of 5 descendants matches the
minimum number required to designate a new Pango
lineage~\citep{Rambaut2020-dw}.
Note that as each plotted point represents a breakpoint, a recombination
node with more than one breakpoint (e.g., with 3 or more parents,
comprising ${\sim}10 \%$ of the
recombination nodes in
Figure~\ref{fig:recomb_mrcas}) will be represented by several points.
\begin{figure} \centering
\includegraphics[width=\textwidth]{figures/recombination_node_mrcas.pdf}
\caption{\label{fig:recomb_mrcas}
Date of common ancestry between the parents on either side of recombination
breakpoints, as a function of the divergence time between the parents.
MRCAs of parents associated with Pango designated recombinants (XA, etc)
are identified in orange. Larger symbols represent breakpoints in recombination
nodes ancestral to five or more samples.
Horizontal dotted lines show the four most common MRCA nodes,
which tend to be associated with major outbreaks and with many immediate children.
The stacked histogram shows the distribution of parental divergence times, ranging
from parents that have diverged only a few days ago, to much more
divergent parent lineages. See Figure~\ref{fig:recomb_mrcas_voc_breakdown}
for equivalent plots broken down by parental VoC classification.
}
\end{figure}
% Paragraph 3, what do we say about the dates/nodes of the MRCAs?
The date of the MRCA of recombinant parents is
concentrated in several banded rows in Figure~\ref{fig:recomb_mrcas}. These are
largely due to a few MRCAs shared by many
recombinants (the top four are indicated in the figure).
These shared MRCAs lie near the root of large expansions,
and the majority are associated with large polytomies, likely
indicating a rapid and under-sampled expansion of a clade (e.g. major
Delta and Omicron waves).
% Paragraph 4, what do we say about the divergence times?
Figure~\ref{fig:recomb_mrcas} shows that there is a
large range in divergence times among parents of detected recombinants, with a broad
spread of times from about 10 to 80 weeks prior to the recombination event
and a minor additional peak involving recombination between lineages that diverged
${\sim}95$ weeks ago, corresponding to common ancestors which trace back to early 2020.
This latter group should
contain, for example Delta-Omicron recombinants. More widely, we can
further classify the breakpoints by the VoC combinations of their
parent lineages. Considering only the Alpha, Delta, and Omicron VoC classes,
such a classification reveals that the majority of breakpoints have two Delta or
two Omicron parents, and that Omicron and Delta are the variants associated
with the most recombination (Figure~\ref{fig:recomb_mrcas_voc_breakdown}).
This may reflect either sampling intensity, the
prevalence of cases (which increases the
chance of coinfection and recombination), or possible heterogeneity in
recombination probabilities among lineages.
The time estimates in these figures should be
treated with a degree of caution,
because non-sample nodes are crudely dated in the current version
of \texttt{sc2ts} (see Section~\ref{sec:node_dating}, in particular
for discussion of how these dates might be improved using existing
methods).
Nevertheless, it is clear
that \texttt{sc2ts} can identify recombination between lineages
that are only a few weeks diverged.
\subsection{Recombinant Pango lineages}
\label{sec:pango_x_lineages}
In this section we focus on the detailed ancestry of samples that have been
previously identified as recombinants, i.e., designated as belonging
to a Pango lineage with a name starting with an ``X''.
We focus primarily on the Long ARG
which contains many more recombinants (the status
of Pango X lineages in the Wide ARG is briefly
summarised in Section~\ref{sec-pango_x_wide_arg}).
Designation of samples to Pango lineages is not a straightforward task,
and there can be significant variation between methods \citep{deBernardiSchneider2023-sars}.
Here, we consider two different assignments of Pango lineages to samples in
the Long ARG, those provided by Nextclade and by GISAID.
% GISAID lineage calls come from the pangoLEARN algorithm.
Using the Nextclade assignments, the Long ARG contains 749 samples from
33 Pango X lineages (711 samples from 33 lineages when we
remove singleton recombinants).
In contrast, using the GISAID designations we
have 515 samples from 38 X lineages
(511 samples from 35 lineages when we filter singleton recombinants). Of the GISAID designations,
28 are shared with Nextclade (26 after filtering singletons).
This variation in classifications highlights the uncertainty that exists when
assigning Pango X lineages to samples, and is important to keep in mind
when interpreting the results here.
We focus here on the 749 Nextclade-designated recombinant samples.
There are two samples designated XP which do not descend from a recombination node,
and a likely explanation for the absence of a corresponding recombination event
in the Long ARG is that the characteristic multibase deletion for XP
(\url{https://github.com/cov-lineages/pango-designation/issues/481}) is masked
during our preprocessing (see Section~\ref{sec:data_preprocessing} for details and potential
improvements). Of the remaining X designated samples, 38 are singleton recombinants,
descending from recombination nodes that are ancestral only to that sample
(10 are labelled XZ; 6 are XE; 3 each from XN and XK; 2 each from XC, XS, XV, XQ and XAB;
and 1 sample from each of XB, XM, XJ, XAF, XAH and XAJ). Such samples are likely to
be enriched for sequencing errors and lineage designation artefacts, and so we
exclude them from further analysis in this section. A further 79 samples
(XN: 53, XZ: 16, XAJ: 6, XE:1, XAD: 1, XAH: 1, XAK: 1) trace back
to a most recent recombination node that is likely to be a false positive
(row C in Table~\ref{tab:false_positive}, see Section~\ref{sec:false_positives}).
For simplicity these samples are likewise excluded from further analyses.
The remaining 630 samples (31 Pango lineages) trace back to 50 different
most recent recombination nodes, summarised
in Table~\ref{tab:pango-recombinants}.
These fall into three classes: single origin, multiple origin,
and multiple nested origins, which we discuss in the following sections.
\subsubsection{Single origin}
In the absence of genealogical information, a reasonable initial assumption is
that all sequences
assigned to a given Pango X lineage are descendants of a single recombinant
sequence, arising as a result of a mixed infection followed by onward
transmission. We would expect our ARGs to reveal evolutionary histories of this
nature, where all the samples assigned to a given recombinant lineage
trace back to one recombination node, representing a single originating
recombination event. In the Long ARG, 16 of the 31 Pango recombinants lineages identified by Nextclade
fall into this category (Table~\ref{tab:pango-recombinants})
\begin{figure}
\centering
\includegraphics[width=\textwidth]{figures/Pango_XA_XAG_XD_nxcld_tight_graph.pdf}
\caption{\label{fig:pango-simple-origin-graph} Examples of non-nested
Nextclade Pango X lineages. (A) Subgraph for XA in the Long ARG: all five samples designated
as XA by Nextclade, together with their ancestral lineages, are shown outwards to the nearest
sampled viral genome; dotted lines show ARG continuation. Vertical position of nodes does not
show absolute time, but relative rank (parents above children). Nodes are coloured by Nextclade
Pango designation; smaller symbols are non-sample nodes inserted by \texttt{sc2ts}, whose
Pango status is imputed. Genomic regions inherited by the recombination node are shown;
breakpoints correspond to the rightmost breakpoint position inferred by \texttt{sc2ts}.
(B) Equivalent subgraph for the 17 XAG samples in the Long ARG, with abbreviated labelling. Where
non-XAG samples are ancestral to further unplotted samples, the number of unplotted descendant samples
is marked as ``+1'', ``+7'', etc.
(C) Equivalent subgraphs for both origination events involving the four XD samples in the Long ARG.
Details of the mutations and sample node identities for all three plots are provided in
supplementary Figures \ref{fig:pango_XA_gisaid_graph}, \ref{fig:pango_XAG_gisaid_graph}, and \ref{fig:pango_XD_gisaid_graph}, which also provide alternative GISAID Pango designations.
} \end{figure}
One of the simplest examples is XA, corresponding to group A
of \citet{Jackson2021-ik} as discussed in Section~\ref{sec:jackson_recombs}.
Figure~\ref{fig:pango-simple-origin-graph}A
shows the exact relationships inferred by \texttt{sc2ts} as a subgraph of the Long ARG.
Here, paths have been traced from all
Nextclade-identified XA samples (in red) to the closest other sample nodes in the ARG.
Sample nodes are plotted as larger circles, but the subgraph also includes intermediate,
non-sample nodes (i.e., inserted by \texttt{sc2ts}, see Sections~\ref{sec:sample-cluster-tree-inference} and \ref{sec:parsimony-heuristics}). Dotted lines show where this subgraph links
to the rest of the ARG. Above recombination nodes, only ancestral nodes are shown, meaning
that the subgraph is not extended to show additional descendants of recombinant parents.
It is clear from the XA subgraph that all the samples labelled XA by Nextclade trace to a
single originating recombination node, whose genome is a composite of a B.1.177.18 lineage
on the left of the genome and a B.1.1.7 lineage on the right. In the subgraph we show
the rightmost genomic position for the recombination breakpoint, here at position 22,227
(corresponding to a breakpoint strictly less than 22,228, see Table~\ref{tab:jackson})
%% Discuss panel B where XAG shares a common recombinant origin with samples from several other X lineages
A more complex single-origin case is XAG, illustrated in
Figure~\ref{fig:pango-simple-origin-graph}B. Here, the XAG samples
all trace back to the same most recent recombination node
(combining BA.1 on the left and BA.2.9 on the right), but
we infer this recombination event to also be the originating event for
all the recombinant samples designated XAA,
and some, but not all, of those identified as XAB, XQ, and XU by Nextclade.
The classification of originating recombination events is dependent on
accurate designation of Pango lineages to samples. It is therefore important