Taiyaki, initial commit

nanoporetech · Mar 12, 2019 · 411de95 · 411de95
commit 411de95
Show file tree

Hide file tree

Showing 111 changed files with 10,140 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,18 @@
+*.pyc
+*~
+*.sw?
+*.so
+*.egg-info
+.eggs
+build
+dist
+RESULTS
+taiyaki/ctc/ctc.c
+taiyaki/squiggle_match/squiggle_match.c
+taiyaki/version.py
+venv
+run
+.cache
+.tox
+*.fasta
+*.hdf5
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,11 @@
+#  Taiyaki
+Version numbers: major.minor.patch
+* Major version bump indicates a substantial change, e.g. file formats or removal of functionality.
+* Minor version bump indicates a change in functionality that may affect users.
+* Patch version bump indicates bug-fixes or minor improvements not expected to affect users.
+
+## v3.0.0
+Initial release:
+* Prepare data for training basecallers by remapping signal to reference sequence
+* Train neural networks for flip-flop basecalling and squiggle prediction
+* Export basecaller models for use in Guppy
diff --git a/FILE_FORMATS.md b/FILE_FORMATS.md
@@ -0,0 +1,110 @@
+This document describes file formats used in the Taiyaki package.
+
+## Fast5 files
+
+The package reads single or multi-read fast5 files using a wrapper around the **ont_fast5_api** package.
+
+## Strand lists
+
+Strand lists should be tab-separated text files having columns ('filename' or 'filename_fast5' and not both) and / or 'read_id'.
+If a strand list file is supplied as an optional argument to a script, then
+1. If no column 'read_id' is present, then all files with names in the column 'filename' or 'filename_fast5' are read.
+2. If no column 'filename' or 'filename_fast5' is present, then all reads with read_ids in the 'read_id' column are are read from files in the directory specified.
+3. If there is a ('filename' or 'filename_fast5') column and a 'read_id' column, then the strand list is regarded as a list of pairs (filename, read_id).
+
+## Per-read parameter files
+
+The script **bin/generate_per_read_parameters.py** creates a tsv file with columns ('UUID', 'trim_start', 'trim_end', 'shift', 'scale') which give instructions for
+handling of each read. The shift and scale parameters are chosen so that
+
+    y = (current_in_pA - shift)/scale
+
+is standardised: that is, so that roughly, mean(**y**)=0 and std(**y**)=1 (more robust statistics are used by the script to generate the parameters).
+
+    UUID				trim_start	trim_end	shift			scale
+    6a8a74ff-5316-41d8-825d-a018af4242bf	200	50	85.43114135742188	15.168887446289057
+    906f26ce-367a-4d3c-b279-ca86f6db7255	200	50	97.36762817382814	15.927331818603507
+    90b3c72f-ac34-4337-b33a-2fecd0216b99	200	50	82.36786376953125	15.076369731445316
+
+We expect users to find reasons to generate their own per-read-parameter files or to modify the ones generated by this script.
+
+## Reference files
+
+Files to store the reference for each read are used as an ingredient for remapping.
+
+These are fasta files where the comment line for each sequence is the UUID:
+
+    >6a8a74ff-5316-41d8-825d-a018af4242bf
+    GTGCTTGTGGGGTATTGCTCAAGAAATTTTTGCCCAGATCAATGTTCTGGAGATTTTACCCAATGT.....
+    >906f26ce-367a-4d3c-b279-ca86f6db7255
+    AATCCTGCCTCTAAAGAAAGAAAAAAAAAAATCAGCTAGGTGTAGCCATAGGCAGCTGTAGTCCCA.....
+
+## Mapped signal files
+
+Data for training is stored in mapped signal files.
+The class **HDF5** in **taiyaiki/mapped_signal_files.py** provides an API for reading and writing these files, and also
+methods for checking that a file conforms to the specification.
+
+The files are HDF5 files with the following structure.
+
+    HDF5_file/
+      ├── attribute: version (integer)
+      └── group: Reads/
+          ├── group: <read_id_1>
+          ├── group: <read_id_2>
+          ├── group: <read_id_3>
+          .
+          .
+
+
+Each read_id is a UUID, and the data in each read group is:
+
+|   **name**        |**attribute/dataset** | **type**  | **description**                                                    |
+|-------------------|----------------------|-----------|--------------------------------------------------------------------|
+|   alphabet        |  attr                | str       | e.g. 'ACGT' for DNA. May include modified bases in future releases |
+| collapse_alphabet |  attr                | str       | canonical base for each base in 'alphabet'.                        |
+| shift_frompA      |  attr                | float     | shift parameter - see 'per-read-parameter files' above             |
+| scale_frompA      |  attr                | float     | scale parameter - see 'per-read-parameter files' above             |
+| range             |  attr                | float     | see equation below                                                 |
+| offset            |  attr                | float     | see equation below                                                 |
+| digitisation      |  attr                | float     | see equation below                                                 |
+| Dacs              |  dataset             | int16     | signal data representing current through pore (see equation below) |
+| Ref_to_signal     |  dataset             | int32     | Ref_to_signal[n] = location in Dacs associated with Reference[n]   |
+| Reference         |  dataset             | int16     | alphabet[Reference[n]] is the nth base in the reference sequence   |
+| mapping_score     | attr (optional)      | str       | score associated with mapping of ref to signal                     |
+| mapping_method    | str (optional)       | str       | short description of mapping method                                |
+
+
+The current in pA is calculated from the integers in Dacs by the equation
+
+    current = (Dacs + offset ) * range / digitisation
+
+
+## Chunk logs
+
+During training, **bin/train_flipflop.py** generates (input,output) pairs of (signal,sequence) for network training.
+We refer to each of these (signal,sequence) pairs as a chunk. Some chunks are rejected rather than being fed into the
+training loop, either because the required data could not be found or because they are filtered out. For example, we
+filter out chunks which contain very long slips (where many bases pass through the pore in a short time) because we
+expect them to make training more difficult.
+
+With the option **--chunk_logging_threshold 0**, the scripts **bin/train_flipflop.py** and **bin/train_squiggle.py** produce chunk logs.
+
+These are tab-separated text files which describe the chunks selected and rejected, giving the training loss for each chunk that was
+used in training, and a reason for rejection for those chunks which were not used.
+
+Before training starts, 1000 chunks are sampled to determine the baseline for filtering. These chunks are recorded (when **--chunk_logging_threshold 0**) at the start of the
+chunk log file, and they can be distinguished from those generated in the training loop because they are not marked as rejected, but do not have a loss associated with them.
+
+The script **misc/plot_chunk_log.py** can be used to plot the data in this file.
+
+## Model files
+
+* Neural network descriptions (with parameters not specified) are needed as an input to training. These are python files: an example is given in the directory **models**.
+* It is also possible to use the result of earlier training runs as a starting point: in this case use a **.checkpoint** file (see below). 
+* Trained network files are much larger than the python files which define the structure of a network. For example, **bin/train_flipflop.py** saves trained models at each checkpoint and at the end of training in two different formats:
+    * **.params** files store the model parameters in a flat pytorch structure.
+    * **.checkpoint** files can be used to read a network directly into a pytorch function using **torch.load()**.
+    * The script **bin/dump_json.py** transforms a **.checkpoint** file into a **json**-based format which can be used by Guppy.
+    * **bin/prepare_mapped_reads.py** needs a trained flip-flop network to use for remapping. This is in the **.checkpoint** format, and an example can be found in the **models** directory.
+