Canu v2.0
These are release notes for Canu version 2.0, which was released on March 18th, 2020. Canu is specialized for assembly of single-molecule high-noise sequences. Full documentation can be found at http://canu.readthedocs.org/.
This release provides a stable, tested, and documented version of the software. The binary distributions should work on any relatively recent version of the respective OS and are the recommended way to install Canu. The source code distribution contains everything you need to create a binary distribution for your own specific OS.
Citation
- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Research. (2017).
- Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM. De novo assembly of haplotype-resolved genomes with trio binning. Nature Biotechnology. (2018).
- Nurk S, Walenz BP, Rhiea A, Vollger MR, Logsdon GA, Grothe R, Miga KH, Eichler EE, Phillippy AM, Koren S. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. biorXiv. (2020).
Minimum Requirements
- 8GB minimum memory; 16GB strongly suggested
- GCC 4.5 (for compilation only); GCC 7 or newer strongly recommended
- Perl 5.12.0, or File::Path 2.08
- Java SE 8
- macOS 10.10 Yosemite (for macOS/Darwin binaries only)
- gnuplot 5.2 (optional, for generating diagnostic graphs)
Installation
Users can download Canu as source code or as pre-compiled binaries. The binary distribution is the recommended install method, assuming it is available for your platform. The source code package needs to be compiled and installed before it can be used.
To install from a binary distribution (recommended installation method):
tar -xJf canu-2.0.*.tar.xz
To install from source code (the file can be named either canu-v2.0.tar.gz
or just v2.0.tar.gz
, depending on how it is downloaded):
gunzip -dc canu-v2.0.tar.gz | tar -xf -
cd canu-2.0/src
make -j 8
cd ..
In both cases, canu is installed in directory canu-2.0/-, for example, canu-1.9/Linux-amd64. You can run the assembler with:
canu-2.0/*/bin/canu
Changes
This release introduces support for PacBio HiFi assembly and includes several major bug fixes.
Canu v2.0 IS NOT compatible with assemblies started with any previous version.
- Support for HiFi data using option '-pacbio-hifi'. Full details in the preprint HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads.
- Numerous improvements to contig construction that make longer more correct contigs:
** Detect bubbles during contig construction and prevent them from shattering heterozygous genomes.
** Detect and remove short branches branches during contig construction.
** Detect reads that are not fully covererd by overlaps and exclude them from contigs. - Option 'stopOnReadQuality' is enabled by default, but no longer aborts if there are too many short reads.
- Option 'minInputCoverage' will stop the assembly if the input read coverage is below this value, default 10. This supplements 'stopOnLowCoverage', which stops if read coverage is below some value after input, after correction or after trimming.
- Option 'maxInputCoverage', default 200, will randomly down-sample input reads to this coverage. It replaces option 'readSamplingCoverage' ('readSamplingBias' still exists).
- Write intermediate Mhap outputs to the
stageDirectory
if it is set.
Bug Fixes
- Multiple fixes to read positioning during contig construction (
Assertion 'cnt > 0' failed.
) - Possibly fix a weird error reading overlapper output that resulted in out of memory errors (
terminate called after throwing an instance of 'std::bad_alloc'
). - A variety of bug fixes that nobody will really care about (unless your assembly crashed, in which case you already know it's fixed) and will be tedious to list, so they aren't listed.
Known Issues
See the issues page for up-to date open issues, or to report a problem.
- Large memory usage and runtime for long reads (e.g., Nanopore) when using the
overlapper=ovl
algorithm, and during Overlap Error Adjustment. The-fast
option enables a significantly faster algorithm, but may produce slightly less contiguous assemblies on genomes larger than 1 Gbp in size. It is recommended for nanopore genomes smaller than 1 Gbp. - No support for trio binning of HiFi data. As a workaround, specify the HiFi data as -pacbio-raw and run only the haplotyping step (-haplotype) followed by assembly of the partitioned reads.
See the FAQ for many suggestions, including suggestions for specific data types, e.g., Nanopore r9 reads.
Legal
Canu is derived from Celera Assembler and includes code from many other projects. Most, but not all, of the code is GPL licensed. See the README.licenses file and individual source code files for details.