Skip to content

Latest commit

 

History

History
387 lines (301 loc) · 17.8 KB

USAGE.md

File metadata and controls

387 lines (301 loc) · 17.8 KB

Flye manual

Table of Contents

Quick usage

usage: flye (--pacbio-raw | --pacbio-corr | --pacbio-hifi | --nano-raw |
         --nano-corr | --subassemblies) file1 [file_2 ...]
         --genome-size SIZE --out-dir PATH

         [--threads int] [--iterations int] [--min-overlap int]
         [--meta] [--plasmids] [--trestle] [--polish-target]
         [--keep-haplotypes] [--debug] [--version] [--help] 
         [--resume] [--resume-from] [--stop-after]

Assembly of long reads with repeat graphs

optional arguments:
  -h, --help            show this help message and exit
  --pacbio-raw path [path ...]
                        PacBio raw reads
  --pacbio-corr path [path ...]
                        PacBio corrected reads
  --pacbio-hifi path [path ...]
                        PacBio HiFi reads
  --nano-raw path [path ...]
                        ONT raw reads
  --nano-corr path [path ...]
                        ONT corrected reads
  --subassemblies path [path ...]
                        high-quality contigs input
  -g size, --genome-size size
                        estimated genome size (for example, 5m or 2.6g)
  -o path, --out-dir path
                        Output directory
  -t int, --threads int
                        number of parallel threads [1]
  -i int, --iterations int
                        number of polishing iterations [1]
  -m int, --min-overlap int
                        minimum overlap between reads [auto]
  --asm-coverage int    reduced coverage for initial disjointig assembly [not
                        set]
  --plasmids            rescue short unassembled plasmids
  --meta                metagenome / uneven coverage mode
  --keep-haplotypes     do not collapse alternative haplotypes
  --trestle             enable Trestle [disabled]
  --polish-target path  run polisher on the target sequence
  --resume              resume from the last completed stage
  --resume-from stage_name
                        resume from a custom stage
  --stop-after stage_name
                        stop after the specified stage completed
  --debug               enable debug output
  -v, --version         show program's version number and exit

Input reads can be in FASTA or FASTQ format, uncompressed or compressed with gz. Currently, PacBio (raw, corrected, HiFi) and ONT reads (raw, corrected) are supported. Expected error rates are <30% for raw, <3% for corrected, and <1% for HiFi. Note that Flye was primarily developed to run on raw reads. Additionally, the --subassemblies option performs a consensus assembly of multiple sets of high-quality contigs. You may specify multiple files with reads (separated by spaces). Mixing different read types is not yet supported. The --meta option enables the mode for metagenome/uneven coverage assembly.

You must provide an estimate of the genome size as input, which is used for solid k-mers selection. Standard size modifiers are supported (e.g. 5m or 2.6g). In the case of metagenome assembly, the expected total assembly size should be provided.

To reduce memory consumption for large genome assemblies, you can use a subset of the longest reads for initial disjointig assembly by specifying --asm-coverage option. Typically, 40x coverage is enough to produce good disjointigs.

You can run Flye polisher as a standalone tool using --polish-target option.

Examples

You can try Flye assembly on these ready-to-use datasets:

E. coli P6-C4 PacBio data

The original dataset is available at the PacBio website. We coverted the raw bas.h5 file to the FASTA format for the convenience.

wget https://zenodo.org/record/1172816/files/E.coli_PacBio_40x.fasta
flye --pacbio-raw E.coli_PacBio_40x.fasta --out-dir out_pacbio --genome-size 5m --threads 4

with 5m being the expected genome size, the threads argument being optional (you may adjust it for your environment), and out_pacbio being the directory where the assembly results will be placed.

E. coli Oxford Nanopore Technologies data

The dataset was originally released by the Loman lab.

wget https://zenodo.org/record/1172816/files/Loman_E.coli_MAP006-1_2D_50x.fasta
flye --nano-raw Loman_E.coli_MAP006-1_2D_50x.fasta --out-dir out_nano --genome-size 5m --threads 4

Supported Input Data

PacBio data

Flye was tested on raw PacBio reads (P5C3 and P6C4) with error rate ~15%. Note that Flye assumes that the input files represent PacBio subreads, e.g. adaptors and noise are trimmed and multiple passes of the same insertion sequence are separated. This is typically handled by PacBio instruments/toolchains, however we saw examples of incorrect third-party raw -> fastq conversions, which resulted into incorrectly trimmed data. In case Flye is failing to get reasonable assemblies, make sure that your reads are properly preprocessed.

Flye now supports assembly of PacBio HiFi protocol via --pacbio-hifi option. The expected read error is <1%.

Oxford Nanopore data

We performed our benchmarks with raw ONT reads (R7-R9) with error rate ~15%. Due to the biased error pattern, per-nucleotide accuracy is usually lower for ONT data than with PacBio data, especially in homopolymer regions.

Error-corrected reads input

While Flye was designed for assembly of raw reads (and this is the recommended way), it also supports error-corrected PacBio/ONT reads as input (use the corr option). The parameters are optimized for error rates <3%. If you are getting highly fragmented assembly - most likely error rates in your reads are higher. In this case, consider to assemble using the raw reads instead.

Consensus of multiple contig sets

--subassemblies input mode generates a consensus of multiple high quality contig assemblies (such as produced by different short/long read assemblers). The expected error rate is <1%. You might want to skip the polishing stage with --iterations 0 argument (however, it might still be helpful to correct small structural errors).

Input data preparation

Flye works directly with base-called raw reads and does not require any prior error correction. Flye automatically detects chimeric reads or reads with low quality ends, so you do not need to curate them before the assembly. However, it is always worth checking for possible contamination in the reads, since it may affect the automatic selection of estimated parameters for solid kmers and genome size / coverage.

Parameter descriptions

Estimated genome size (required)

You must provide an estimate of the genome size as input, which is used for solid k-mers selection. The estimate could be rough (e.g. withing 0.5x-2x range) and does not affect the other assembly stages. Standard size modificators are supported (e.g. 5m or 2.6g)

Minimum overlap length

This sets a minimum overlap length for two reads to be considered overlapping. In the latest Flye versions, this parameter is chosen automatically based on the read length distribution (reads N90) and does not require manual setting. Typical value is 3k-5k (and down to 1k for datasets with shorter read length). Intuitively, we want to set this parameter as high as possible, so the repeat graph is less tangled. However, higher values might lead to assembly gaps.

In some rare cases it makes sense to manually increase minimum overlap for assemblies of big genomes with long reads and high coverage.

Metagenome mode

Metagenome assembly mode, that is designed for highly non-uniform coverage and is sensitive to underrepresented sequence at low coverage (as low as 2x). In some examples of simple metagenomes, we observed that the normal (isolate) Flye mode assembled more contigious bacterial consensus sequence, while the metagenome mode was slightly more fragmented, but revealed strain mixtures. For relatively complex metagenome --meta mode is the recommended way.

Haplotype mode

By default, Flye (and metaFlye) collapses graph structures caused by alternative haplotypes (bubbles, superbubbles, roundabouts) to produce longer consensus contigs. The option --keep-haplotypes retains the alternative paths on the graph, producing less contigouos, but more detailed assembly.

Trestle

Trestle is an extra module that resolves simple repeats of multipicity 2 that were not bridged by reads. Depending on the datasets, it might resolve a few extra repeats, which is helpful for small (bacterial genomes). Use --trestle option to enable the module. On large genomes, the contiguity improvements are usually minimal, but the computation might take a lot of time.

Reduced contig assembly coverage

Typically, assemblies of large genomes at high coverage require a hundreds of RAM. For high coverage assemblies, you can reduce memory usage by using only a subset of longest reads for initial contig extension stage (usually, the memory bottleneck). The parameter --asm-coverage specifies the target coverage of the longest reads. For a typicall assembly, 30x is enough to produce good initial contigs. Regardless of this parameter, all reads will be used at the later pipeline stages.

Number of polishing iterations

Polishing is performed as the final assembly stage. By default, Flye runs one polishing iteration. Additional iterations might correct a small number of extra errors (due to improvements on how reads may align to the corrected assembly). If the parameter is set to 0, the polishing is not performed.

Re-starting from a particular assembly stage

Use --resume to resume a previous run of the assembler that may have terminated prematurely (using the same output directory). The assembly will continue from the last previously completed step.

You might also resume from a particular stage with --resume-from stage_name, where stage_name is a choice of assembly, consensus, repeat, trestle, polishing. For example, you might supply different sets of reads for different stages.

Flye output

The main output files are:

  • assembly.fasta - Final assembly. Contains contigs and possibly scaffolds (see below).
  • assembly_graph.{gfa|gv} - Final repeat graph. Note that the edge sequences might be different (shorter) than contig sequences, because contigs might include multiple graph edges (see below).
  • assembly_info.txt - Extra information about contigs (such as length or coverage).

Each contig is formed by a single unique graph edge. If possible, unique contigs are extended with the sequence from flanking unresolved repeats on the graph. Thus, a contig fully contains the corresponding graph edge (with the same id), but might be longer then this edge. This is somewhat similar to unitig-contig relation in OLC assemblers. In a rare case when a repetitive graph edge is not covered by the set of "extended" contigs, it will be also output in the assembly file.

Sometimes it is possible to further order contigs into scaffolds based on the repeat graph structure. These ordered contigs will be output as a part of scaffold in the assembly file (with a scaffold_ prefix). Since it is hard to give a reliable estimate of the gap size, those gaps are represented with the default 100 Ns. assembly_info.txt file (below) contains additional information about how scaffolds were formed.

Extra information about contigs/scaffolds is output into the assembly_info.txt file. It is a tab-delimited table with the columns as follows:

  • Contig/scaffold id
  • Length
  • Coverage
  • Is circular, (Y)es or (N)o
  • Is repetitive, (Y)es or (N)o
  • Multiplicity (based on coverage)
  • Alternative group
  • Graph path (graph path corresponding to this contig/scaffold).

Scaffold gaps are marked with ?? symbols, and * symbol denotes a terminal graph node.

Alternative contigs (representing alternative haplotypes) will have the same alt. group ID. Primary contigs are marked by *

Repeat graph

The Flye algorithms are using repeat graph as a core data structure. In difference to de Bruijn graphs which require exact k-mer matches, repeat graphs are built using approximate sequence matches, thus can tollerate higher noise of SMS reads.

The edges of repeat graph represent genomic sequence, and nodes define the junctions. All edges are classified into unique and repetitive. The genome traverses the graph in an unknown way, so as each unique edge appears exactly once in this traversal. Repeat graphs are useful for repeat analysis and resolution - which are one of the key genome assembly challenges.

Graph example

Above is an example of a repeat graph of a bacterial assembly. Each edge is labeled with its id, length and coverage. Repetitive edges are shown in color, and unique edges are black. Note that each edge is represented in two copies: forward and reverse complement (marked with +/- signs), therefore the entire genome is represented in two copies as well.

In this example, there are two unresolved repeats: (i) a red repeat of multiplicity two and length 35k and (ii) a green repeat cluster of multiplicity three and length 34k - 36k. As the repeats remained unresolved, there are no reads in the dataset that cover those repeats in full. Five unique edges will correspond to five contigs in the final assembly.

Repeat graphs produced by Flye could be visualized using AGB or Bandage.

Repeat graph before repeat resolution could be found in the 20-repeat/graph_before_rr.gv file.

Flye benchmarks

Genome Data Asm.Size NG50 CPU time RAM
E.coli PB 50x 4.6 Mb 4.6 Mb 2 h 2 Gb
C.elegans PB 40x 102 Mb 3.6 Mb 100 h 31 Gb
A.thaliana PB 75x 120 Mb 9.5 Mb 100 h 46 Gb
D.melanogaster ONT 30x 139 Mb 10.6 Mb 130 h 31 Gb
D.melanogaster PB 120x 142 Mb 18.8 Mb 150 h 75 Gb
Human NA12878 ONT 35x (rel6) 2.9 Gb 33.2 Mb 2500 h 714 Gb
Human CHM13 T2T ONT 120x (rel3) 2.9 Gb 75.1 Mb 5000 h 871 Gb
Human HG002 PB CCS 30x 2.9 Gb 27.5 Mb 1400 h 272 Gb
Human CHM1 PB 100x 2.8 Gb 21.5 Mb 2700 h 676 Gb
HMP mock PB meta 7 Gb 66 Mb 2.6 Mb 60 h 72 Gb
Zymo Even ONT meta 14 Gb 64 Mb 0.6 Mb 60 h 129 Gb
Zymo Log ONT meta 16 Gb 23 Mb 1.3 Mb 100 h 76 Gb

The assemblies generated using Flye 2.7 could be downloaded from Zenodo. All datasets were run with default parameters for the corresponding read type with the following exceptions: CHM13 T2T was run with --min-overlap 10000 --asm-coverage 50; CHM1 was run with --asm-overage 40.

Algorithm Description

This is a brief description of the Flye algorithm. Please refer to the manuscript for more detailed information. The draft contig extension is organized as follows:

  • K-mer counting / erroneous k-mer pre-filtering
  • Solid k-mer selection (k-mers with sufficient frequency, which are unlikely to be erroneous)
  • Contig extension. The algorithm starts from a single read and extends it with a next overlapping read (overlaps are dynamically detected using the selected solid k-mers).

Note that we do not attempt to resolve repeats at this stage, thus the reconstructed contigs might contain misassemblies. Flye then aligns the reads on these draft contigs using minimap2 and calls a consensus. Afterwards, Flye performs repeat analysis as follows:

  • Repeat graph is constructed from the (possibly misassembled) contigs
  • In this graph all repeats longer than minimum overlap are collapsed
  • The algorithm resolves repeats using the read information and graph structure
  • The unbranching paths in the graph are output as contigs

If enabled, after resolving bridged repeats, Trestle module attempts to resolve simple unbridged repeats (of multiplicity 2) using the heterogeneities between repeat copies. Finally, Flye performs polishing of the resulting assembly to correct the remaining errors:

  • Alignment of all reads to the current assembly using minimap2
  • Partition the alignment into mini-alignments (bubbles)
  • Error correction of each bubble using a maximum likelihood approach

The polishing steps could be repeated, which might slightly increase quality for some datasets.