- Quick usage
- Examples
- Supported Input Data
- Parameter Descriptions
- Flye output
- Repeat graph
- Flye benchmarks
- Algorithm Description
usage: flye (--pacbio-raw | --pacbio-corr | --pacbio-hifi | --nano-raw |
--nano-corr | --subassemblies) file1 [file_2 ...]
--genome-size SIZE --out-dir PATH
[--threads int] [--iterations int] [--min-overlap int]
[--meta] [--plasmids] [--trestle] [--polish-target]
[--keep-haplotypes] [--debug] [--version] [--help]
[--resume] [--resume-from] [--stop-after]
Assembly of long reads with repeat graphs
optional arguments:
-h, --help show this help message and exit
--pacbio-raw path [path ...]
PacBio raw reads
--pacbio-corr path [path ...]
PacBio corrected reads
--pacbio-hifi path [path ...]
PacBio HiFi reads
--nano-raw path [path ...]
ONT raw reads
--nano-corr path [path ...]
ONT corrected reads
--subassemblies path [path ...]
high-quality contigs input
-g size, --genome-size size
estimated genome size (for example, 5m or 2.6g)
-o path, --out-dir path
Output directory
-t int, --threads int
number of parallel threads [1]
-i int, --iterations int
number of polishing iterations [1]
-m int, --min-overlap int
minimum overlap between reads [auto]
--asm-coverage int reduced coverage for initial disjointig assembly [not
set]
--plasmids rescue short unassembled plasmids
--meta metagenome / uneven coverage mode
--keep-haplotypes do not collapse alternative haplotypes
--trestle enable Trestle [disabled]
--polish-target path run polisher on the target sequence
--resume resume from the last completed stage
--resume-from stage_name
resume from a custom stage
--stop-after stage_name
stop after the specified stage completed
--debug enable debug output
-v, --version show program's version number and exit
Input reads can be in FASTA or FASTQ format, uncompressed or compressed with gz. Currently, PacBio (raw, corrected, HiFi) and ONT reads (raw, corrected) are supported. Expected error rates are <30% for raw, <3% for corrected, and <1% for HiFi. Note that Flye was primarily developed to run on raw reads. Additionally, the --subassemblies option performs a consensus assembly of multiple sets of high-quality contigs. You may specify multiple files with reads (separated by spaces). Mixing different read types is not yet supported. The --meta option enables the mode for metagenome/uneven coverage assembly.
You must provide an estimate of the genome size as input, which is used for solid k-mers selection. Standard size modifiers are supported (e.g. 5m or 2.6g). In the case of metagenome assembly, the expected total assembly size should be provided.
To reduce memory consumption for large genome assemblies, you can use a subset of the longest reads for initial disjointig assembly by specifying --asm-coverage option. Typically, 40x coverage is enough to produce good disjointigs.
You can run Flye polisher as a standalone tool using --polish-target option.
You can try Flye assembly on these ready-to-use datasets:
The original dataset is available at the
PacBio website.
We coverted the raw bas.h5
file to the FASTA format for the convenience.
wget https://zenodo.org/record/1172816/files/E.coli_PacBio_40x.fasta
flye --pacbio-raw E.coli_PacBio_40x.fasta --out-dir out_pacbio --genome-size 5m --threads 4
with 5m
being the expected genome size, the threads argument being optional
(you may adjust it for your environment), and out_pacbio
being the directory
where the assembly results will be placed.
The dataset was originally released by the Loman lab.
wget https://zenodo.org/record/1172816/files/Loman_E.coli_MAP006-1_2D_50x.fasta
flye --nano-raw Loman_E.coli_MAP006-1_2D_50x.fasta --out-dir out_nano --genome-size 5m --threads 4
Flye was tested on raw PacBio reads (P5C3 and P6C4) with error rate ~15%. Note that Flye assumes that the input files represent PacBio subreads, e.g. adaptors and noise are trimmed and multiple passes of the same insertion sequence are separated. This is typically handled by PacBio instruments/toolchains, however we saw examples of incorrect third-party raw -> fastq conversions, which resulted into incorrectly trimmed data. In case Flye is failing to get reasonable assemblies, make sure that your reads are properly preprocessed.
Flye now supports assembly of PacBio HiFi protocol via --pacbio-hifi
option.
The expected read error is <1%.
We performed our benchmarks with raw ONT reads (R7-R9) with error rate ~15%. Due to the biased error pattern, per-nucleotide accuracy is usually lower for ONT data than with PacBio data, especially in homopolymer regions.
While Flye was designed for assembly of raw reads (and this is the recommended way),
it also supports error-corrected PacBio/ONT reads as input (use the corr
option).
The parameters are optimized for error rates <3%. If you are getting highly
fragmented assembly - most likely error rates in your reads are higher. In this case,
consider to assemble using the raw reads instead.
--subassemblies
input mode generates a consensus of multiple high quality contig assemblies
(such as produced by different short/long read assemblers). The expected error rate
is <1%. You might want to skip the polishing stage with --iterations 0
argument
(however, it might still be helpful to correct small structural errors).
Flye works directly with base-called raw reads and does not require any prior error correction. Flye automatically detects chimeric reads or reads with low quality ends, so you do not need to curate them before the assembly. However, it is always worth checking for possible contamination in the reads, since it may affect the automatic selection of estimated parameters for solid kmers and genome size / coverage.
You must provide an estimate of the genome size as input, which is used for solid k-mers selection. The estimate could be rough (e.g. withing 0.5x-2x range) and does not affect the other assembly stages. Standard size modificators are supported (e.g. 5m or 2.6g)
This sets a minimum overlap length for two reads to be considered overlapping. In the latest Flye versions, this parameter is chosen automatically based on the read length distribution (reads N90) and does not require manual setting. Typical value is 3k-5k (and down to 1k for datasets with shorter read length). Intuitively, we want to set this parameter as high as possible, so the repeat graph is less tangled. However, higher values might lead to assembly gaps.
In some rare cases it makes sense to manually increase minimum overlap for assemblies of big genomes with long reads and high coverage.
Metagenome assembly mode, that is designed for highly non-uniform coverage and
is sensitive to underrepresented sequence at low coverage (as low as 2x).
In some examples of simple metagenomes, we observed that the normal (isolate)
Flye mode assembled more contigious bacterial
consensus sequence, while the metagenome mode was slightly more fragmented, but
revealed strain mixtures. For relatively complex metagenome --meta
mode
is the recommended way.
By default, Flye (and metaFlye) collapses graph structures caused by
alternative haplotypes (bubbles, superbubbles, roundabouts) to produce
longer consensus contigs. The option --keep-haplotypes
retains
the alternative paths on the graph, producing less contigouos, but
more detailed assembly.
Trestle is an extra module that resolves simple repeats of
multipicity 2 that were not bridged by reads. Depending on the
datasets, it might resolve a few extra repeats, which is helpful
for small (bacterial genomes). Use --trestle
option to enable the module.
On large genomes, the contiguity improvements are usually minimal,
but the computation might take a lot of time.
Typically, assemblies of large genomes at high coverage require
a hundreds of RAM. For high coverage assemblies, you can reduce memory usage
by using only a subset of longest reads for initial contig extension
stage (usually, the memory bottleneck). The parameter --asm-coverage
specifies the target coverage of the longest reads. For a typicall assembly, 30x
is enough to produce good initial contigs. Regardless of this parameter,
all reads will be used at the later pipeline stages.
Polishing is performed as the final assembly stage. By default, Flye runs one polishing iteration. Additional iterations might correct a small number of extra errors (due to improvements on how reads may align to the corrected assembly). If the parameter is set to 0, the polishing is not performed.
Use --resume
to resume a previous run of the assembler that may have terminated
prematurely (using the same output directory).
The assembly will continue from the last previously completed step.
You might also resume from a particular stage with --resume-from stage_name
,
where stage_name
is a choice of assembly, consensus, repeat, trestle, polishing
.
For example, you might supply different sets of reads for different stages.
The main output files are:
assembly.fasta
- Final assembly. Contains contigs and possibly scaffolds (see below).assembly_graph.{gfa|gv}
- Final repeat graph. Note that the edge sequences might be different (shorter) than contig sequences, because contigs might include multiple graph edges (see below).assembly_info.txt
- Extra information about contigs (such as length or coverage).
Each contig is formed by a single unique graph edge. If possible, unique contigs are extended with the sequence from flanking unresolved repeats on the graph. Thus, a contig fully contains the corresponding graph edge (with the same id), but might be longer then this edge. This is somewhat similar to unitig-contig relation in OLC assemblers. In a rare case when a repetitive graph edge is not covered by the set of "extended" contigs, it will be also output in the assembly file.
Sometimes it is possible to further order contigs into scaffolds based on the
repeat graph structure. These ordered contigs will be output as a part of scaffold
in the assembly file (with a scaffold_
prefix). Since it is hard to give a reliable estimate of the
gap size, those gaps are represented with the default 100 Ns. assembly_info.txt
file (below) contains additional information about how scaffolds were formed.
Extra information about contigs/scaffolds is output into the assembly_info.txt
file.
It is a tab-delimited table with the columns as follows:
- Contig/scaffold id
- Length
- Coverage
- Is circular, (Y)es or (N)o
- Is repetitive, (Y)es or (N)o
- Multiplicity (based on coverage)
- Alternative group
- Graph path (graph path corresponding to this contig/scaffold).
Scaffold gaps are marked with ??
symbols, and *
symbol denotes a
terminal graph node.
Alternative contigs (representing alternative haplotypes) will have the same
alt. group ID. Primary contigs are marked by *
The Flye algorithms are using repeat graph as a core data structure. In difference to de Bruijn graphs which require exact k-mer matches, repeat graphs are built using approximate sequence matches, thus can tollerate higher noise of SMS reads.
The edges of repeat graph represent genomic sequence, and nodes define the junctions. All edges are classified into unique and repetitive. The genome traverses the graph in an unknown way, so as each unique edge appears exactly once in this traversal. Repeat graphs are useful for repeat analysis and resolution - which are one of the key genome assembly challenges.
Above is an example of a repeat graph of a bacterial assembly. Each edge is labeled with its id, length and coverage. Repetitive edges are shown in color, and unique edges are black. Note that each edge is represented in two copies: forward and reverse complement (marked with +/- signs), therefore the entire genome is represented in two copies as well.
In this example, there are two unresolved repeats: (i) a red repeat of multiplicity two and length 35k and (ii) a green repeat cluster of multiplicity three and length 34k - 36k. As the repeats remained unresolved, there are no reads in the dataset that cover those repeats in full. Five unique edges will correspond to five contigs in the final assembly.
Repeat graphs produced by Flye could be visualized using AGB or Bandage.
Repeat graph before repeat resolution could be found in
the 20-repeat/graph_before_rr.gv
file.
Genome | Data | Asm.Size | NG50 | CPU time | RAM |
---|---|---|---|---|---|
E.coli | PB 50x | 4.6 Mb | 4.6 Mb | 2 h | 2 Gb |
C.elegans | PB 40x | 102 Mb | 3.6 Mb | 100 h | 31 Gb |
A.thaliana | PB 75x | 120 Mb | 9.5 Mb | 100 h | 46 Gb |
D.melanogaster | ONT 30x | 139 Mb | 10.6 Mb | 130 h | 31 Gb |
D.melanogaster | PB 120x | 142 Mb | 18.8 Mb | 150 h | 75 Gb |
Human NA12878 | ONT 35x (rel6) | 2.9 Gb | 33.2 Mb | 2500 h | 714 Gb |
Human CHM13 T2T | ONT 120x (rel3) | 2.9 Gb | 75.1 Mb | 5000 h | 871 Gb |
Human HG002 | PB CCS 30x | 2.9 Gb | 27.5 Mb | 1400 h | 272 Gb |
Human CHM1 | PB 100x | 2.8 Gb | 21.5 Mb | 2700 h | 676 Gb |
HMP mock | PB meta 7 Gb | 66 Mb | 2.6 Mb | 60 h | 72 Gb |
Zymo Even | ONT meta 14 Gb | 64 Mb | 0.6 Mb | 60 h | 129 Gb |
Zymo Log | ONT meta 16 Gb | 23 Mb | 1.3 Mb | 100 h | 76 Gb |
The assemblies generated using Flye 2.7 could be downloaded from Zenodo.
All datasets were run with default parameters for the corresponding read type
with the following exceptions: CHM13 T2T was run with --min-overlap 10000 --asm-coverage 50
;
CHM1 was run with --asm-overage 40
.
This is a brief description of the Flye algorithm. Please refer to the manuscript for more detailed information. The draft contig extension is organized as follows:
- K-mer counting / erroneous k-mer pre-filtering
- Solid k-mer selection (k-mers with sufficient frequency, which are unlikely to be erroneous)
- Contig extension. The algorithm starts from a single read and extends it with a next overlapping read (overlaps are dynamically detected using the selected solid k-mers).
Note that we do not attempt to resolve repeats at this stage, thus the reconstructed contigs might contain misassemblies. Flye then aligns the reads on these draft contigs using minimap2 and calls a consensus. Afterwards, Flye performs repeat analysis as follows:
- Repeat graph is constructed from the (possibly misassembled) contigs
- In this graph all repeats longer than minimum overlap are collapsed
- The algorithm resolves repeats using the read information and graph structure
- The unbranching paths in the graph are output as contigs
If enabled, after resolving bridged repeats, Trestle module attempts to resolve simple unbridged repeats (of multiplicity 2) using the heterogeneities between repeat copies. Finally, Flye performs polishing of the resulting assembly to correct the remaining errors:
- Alignment of all reads to the current assembly using minimap2
- Partition the alignment into mini-alignments (bubbles)
- Error correction of each bubble using a maximum likelihood approach
The polishing steps could be repeated, which might slightly increase quality for some datasets.