Skip to content

Commit

Permalink
Merge branch 'v0.0.1-changelog' into 'master'
Browse files Browse the repository at this point in the history
Added changelog and updated readme

See merge request machine-learning/dorado!110
  • Loading branch information
iiSeymour committed Oct 5, 2022
2 parents 4b67720 + 0a0d53d commit b1ae8d0
Show file tree
Hide file tree
Showing 2 changed files with 107 additions and 10 deletions.
31 changes: 31 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Changelog

All notable changes to Dorado will be documented in this file.

# [0.0.1] (05 Oct 2022)

We are excited to announce the first binary release of Dorado. This Dorado release introduces important new features such as support for modified base calling, and significant improvements to basecalling performance and usability, taking it to the state of the art for speed and accuracy.

## Major changes
* d3ddd1f078adc5b52ebfbb7d6aa5ee71acb0b7fb, 37e28f7b3d70dda469f3c498dcbe1ea5df722936, - Support for mod base calling, performance enhancements.
* dd79bf5fb4b005052eb46969cfadd8ef2af8378e, 2a7fc176a5c0075a6fbf95dd3f7a41d52e420963, 465cb4a29e8cfd45b74064f13eb5c152fa2fa1c6, 56482fbd364a8d2cacb608b13b3a7f1792a604e3 - Support for basecalling on M1 family Apple Silicon.
* bd6014edc8de374645ade284dd103eccbfa481db - Support for basecalling on systems with multiple Nvidia GPUs.
* 41fdb1189a4677c6932a4c4467d69c73407dfaaa - Support for POD5 file format.
* 075065447d1a273f3101037c1578647bc2ad8b1e - Addition of new “Quantile” - based read scaling algorithm for higher accuracy.
* 8acf2baa35932a9c42a419b6b620f92e25a87bba - Upgrade to torch 1.12.1
* 9955d0de71d36e279b44e46545f5dfb6c742f224 - Added fast int8-quantization optimisation for LSTM networks with layer size of 96 or 128
* f2e993d3961a52072cf43b0f327dbc21029c3aad - New cuBLAS-based implementation for LSTMs leading to state-of-the-art performance.
* 6ec50dc5cecc65f0ff940420c0de152ba561f85c - Major rearchitecture of CUDA model runners for higher basecalling speed and lower GPU memory utilisation.
* a0a197f4950d390221b6ffc82ccd8ce012c3c765 - Accuracy improvements to handling of short reads (<1Kb) with an upgraded padding strategy.
* d01bf04f7dd84b14753790fae83ba20b7776f498 - Ability to download basecalling models from Dorado.
* 7c7e59c6d65464f0eee40bf3d9f6885aae27839b - Support for SAM output

## Minor Changes
* 0e89d633d66f36256ad437ca0a3b64ff9eb0b1a1 - Automatic selection of batch size if user does not specify.
* 6afceea0195c07b02a40e43e0a395c3d82d44add - Dorado version added to SAM output, including commit hash
* 339b2fc5d7eee5be7f8289d51422a70ad06f6d58 - Scaling information recorded in SAM output
* afbfab92f8207b9a67aae0aa87478c4b95e647b8 - Timestamps added to SAM output
* 9ec2d970a0e5dee739daaafdfd08c20844140cb3 - Support for multi-threaded read scaling, preventing CPU bottlenecks and improving basecall speed.
* 7cbdbe04e76edf7d704e28263d64dddd6ab7d375 - Support for multi-threaded POD5 reading for higher data ingestion rate and improved performance.
* 5a33e83512343e9fd36470fa84fa36c97211672b - Automatic querying of M1 device type for detection of optimal basecalling parameters.
* 42703d0c02638633b44f68c0fc53534a9566b634 - Basecalling progress (Number of reads basecalled) printed out to terminal on Linux.
86 changes: 76 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,87 @@
# Dorado

This is a *preview version* of Dorado, a Libtorch Basecaller for Oxford Nanopore Reads. This software is in alpha preview stage and being released for early evaluation. It is subject to change. If you encounter any problems building or running Dorado please [report an issue](https://github.com/nanoporetech/dorado/issues).
Dorado is a high-performance, easy-to-use, open source basecaller for Oxford Nanopore reads.

## Downloading Dorado
## Features

We will be publishing pre-built releases in the next few days.
* One executable with sensible defaults, automatic hardware detection and configuration.
* Runs on Apple silicon (M1-family) and Nvidia GPUs including multi-GPU with linear scaling.
* Modified basecalling (Remora models).
* [POD5](https://github.com/nanoporetech/pod5-file-format) support for highest basecalling performance.
* Based on libtorch, the C++ API for pytorch.
* Multiple custom optimisations in CUDA and Metal for maximising inference performance.

This is an alpha of Dorado . This software is being released for evaluation. If you encounter any problems building or running Dorado please [report an issue](https://github.com/nanoporetech/dorado/issues).


## Installation

- [dorado-0.0.1-linux-x64](https://nanoporetech.box.com/shared/static/h8eqc9htxk938jzpl4fch2rqlm48yeb0.gz)
- [dorado-0.0.1-osx-arm64](https://nanoporetech.box.com/shared/static/e33yelum810yv09mao5mr7tgh8gj2e5f.gz)
- [dorado-0.0.1-windows-x64](https://nanoporetech.box.com/shared/static/vb2zxg30e5jl9dbit4eqtlvkb5p377te.zip)

## Running

To run Dorado, download a model and point it to POD5 files. Fast5 files are supported but will not be as performant.

```
$ dorado download --model [email protected]
$ dorado basecaller [email protected] pod5/ > calls.sam
$ dorado basecaller [email protected] pod5s/ > calls.sam
```

For unaligned BAM output, dorado output can be piped to BAM using samtoools:

```
$ dorado basecaller [email protected] pod5s/ | samtools view -Sh > calls.bam
```

## Platforms

Dorado has been tested on the following systems:

| Platform | GPU/CPU |
| -------- | -------------------- |
| Windows | x86 |
| Apple | M1, M1 Max, M1 Ultra |
| Linux | (G)V100, A100 |
| Platform | GPU/CPU |
| -------- | ---------------------------- |
| Windows | (G)V100, A100 |
| Apple | M1, M1 Pro, M1 Max, M1 Ultra |
| Linux | (G)V100, A100 |

Systems not listed above but which have Nvidia GPUs with >=8GB VRAM and architecture from Volta onwards have not been widely tested but are expected to work. If you encounter problems with running on your system please [report an issue](https://github.com/nanoporetech/dorado/issues)

## Roadmap

Other Platforms may work, if you encounter problems with running on your system please [report an issue](https://github.com/nanoporetech/dorado/issues)
Dorado is still in alpha stage and not feature-complete, the following features form the core of our roadmap:

1. DNA Barcode multiplexing
2. Duplex basecalling
3. Alignmnet (output aligned BAMs)
4. Python API

## Performance tips

1. For optimal performance Dorado requires POD5 file input. Please [convert your Fast5 files](https://github.com/nanoporetech/pod5-file-format) before basecalling.
1. Dorado will automatically detect your GPUs' free memory and select an appropriate batch size. If you know what you're doing, you can use the `--batch` parameter to tune batch size.
2. Dorado will automatically run in multi-GPU (`'cuda:all'`) mode. If you have a hetrogenous collection of GPUs select the faster GPUs using the `--device` flag (e.g `--device "cuda:0,2`). Not doing this will have a detrimental impact on performance.

## Available basecalling models

To download all available dorado models run:

```
$ dorado download --model all
```

The following models are currently available:

* [email protected]
* [email protected]
* [email protected]
* [email protected]
* [email protected]
* [email protected]
* [email protected]
* [email protected]
* [email protected]

## Developer quickstart

Expand Down Expand Up @@ -51,3 +109,11 @@ The project uses pre-commit to ensure code is consistently formatted, you can se
$ pip install pre-commit
$ pre-commit install
```

### Licence and Copyright
(c) 2022 Oxford Nanopore Technologies Ltd.

Dorado is distributed under the terms of the Oxford Nanopore
Technologies, Ltd. Public License, v. 1.0. If a copy of the License
was not distributed with this file, You can obtain one at
http://nanoporetech.com

0 comments on commit b1ae8d0

Please sign in to comment.