Skip to content
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.

Commit

Permalink
Taiyaki, initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
tmassingham-ont committed Mar 12, 2019
0 parents commit 411de95
Show file tree
Hide file tree
Showing 111 changed files with 10,140 additions and 0 deletions.
18 changes: 18 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
*.pyc
*~
*.sw?
*.so
*.egg-info
.eggs
build
dist
RESULTS
taiyaki/ctc/ctc.c
taiyaki/squiggle_match/squiggle_match.c
taiyaki/version.py
venv
run
.cache
.tox
*.fasta
*.hdf5
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Taiyaki
Version numbers: major.minor.patch
* Major version bump indicates a substantial change, e.g. file formats or removal of functionality.
* Minor version bump indicates a change in functionality that may affect users.
* Patch version bump indicates bug-fixes or minor improvements not expected to affect users.

## v3.0.0
Initial release:
* Prepare data for training basecallers by remapping signal to reference sequence
* Train neural networks for flip-flop basecalling and squiggle prediction
* Export basecaller models for use in Guppy
110 changes: 110 additions & 0 deletions FILE_FORMATS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
This document describes file formats used in the Taiyaki package.

## Fast5 files

The package reads single or multi-read fast5 files using a wrapper around the **ont_fast5_api** package.

## Strand lists

Strand lists should be tab-separated text files having columns ('filename' or 'filename_fast5' and not both) and / or 'read_id'.
If a strand list file is supplied as an optional argument to a script, then
1. If no column 'read_id' is present, then all files with names in the column 'filename' or 'filename_fast5' are read.
2. If no column 'filename' or 'filename_fast5' is present, then all reads with read_ids in the 'read_id' column are are read from files in the directory specified.
3. If there is a ('filename' or 'filename_fast5') column and a 'read_id' column, then the strand list is regarded as a list of pairs (filename, read_id).

## Per-read parameter files

The script **bin/generate_per_read_parameters.py** creates a tsv file with columns ('UUID', 'trim_start', 'trim_end', 'shift', 'scale') which give instructions for
handling of each read. The shift and scale parameters are chosen so that

y = (current_in_pA - shift)/scale

is standardised: that is, so that roughly, mean(**y**)=0 and std(**y**)=1 (more robust statistics are used by the script to generate the parameters).

UUID trim_start trim_end shift scale
6a8a74ff-5316-41d8-825d-a018af4242bf 200 50 85.43114135742188 15.168887446289057
906f26ce-367a-4d3c-b279-ca86f6db7255 200 50 97.36762817382814 15.927331818603507
90b3c72f-ac34-4337-b33a-2fecd0216b99 200 50 82.36786376953125 15.076369731445316

We expect users to find reasons to generate their own per-read-parameter files or to modify the ones generated by this script.

## Reference files

Files to store the reference for each read are used as an ingredient for remapping.

These are fasta files where the comment line for each sequence is the UUID:

>6a8a74ff-5316-41d8-825d-a018af4242bf
GTGCTTGTGGGGTATTGCTCAAGAAATTTTTGCCCAGATCAATGTTCTGGAGATTTTACCCAATGT.....
>906f26ce-367a-4d3c-b279-ca86f6db7255
AATCCTGCCTCTAAAGAAAGAAAAAAAAAAATCAGCTAGGTGTAGCCATAGGCAGCTGTAGTCCCA.....

## Mapped signal files

Data for training is stored in mapped signal files.
The class **HDF5** in **taiyaiki/mapped_signal_files.py** provides an API for reading and writing these files, and also
methods for checking that a file conforms to the specification.

The files are HDF5 files with the following structure.

HDF5_file/
├── attribute: version (integer)
└── group: Reads/
├── group: <read_id_1>
  ├── group: <read_id_2>
  ├── group: <read_id_3>
.
.


Each read_id is a UUID, and the data in each read group is:

| **name** |**attribute/dataset** | **type** | **description** |
|-------------------|----------------------|-----------|--------------------------------------------------------------------|
| alphabet | attr | str | e.g. 'ACGT' for DNA. May include modified bases in future releases |
| collapse_alphabet | attr | str | canonical base for each base in 'alphabet'. |
| shift_frompA | attr | float | shift parameter - see 'per-read-parameter files' above |
| scale_frompA | attr | float | scale parameter - see 'per-read-parameter files' above |
| range | attr | float | see equation below |
| offset | attr | float | see equation below |
| digitisation | attr | float | see equation below |
| Dacs | dataset | int16 | signal data representing current through pore (see equation below) |
| Ref_to_signal | dataset | int32 | Ref_to_signal[n] = location in Dacs associated with Reference[n] |
| Reference | dataset | int16 | alphabet[Reference[n]] is the nth base in the reference sequence |
| mapping_score | attr (optional) | str | score associated with mapping of ref to signal |
| mapping_method | str (optional) | str | short description of mapping method |


The current in pA is calculated from the integers in Dacs by the equation

current = (Dacs + offset ) * range / digitisation


## Chunk logs

During training, **bin/train_flipflop.py** generates (input,output) pairs of (signal,sequence) for network training.
We refer to each of these (signal,sequence) pairs as a chunk. Some chunks are rejected rather than being fed into the
training loop, either because the required data could not be found or because they are filtered out. For example, we
filter out chunks which contain very long slips (where many bases pass through the pore in a short time) because we
expect them to make training more difficult.

With the option **--chunk_logging_threshold 0**, the scripts **bin/train_flipflop.py** and **bin/train_squiggle.py** produce chunk logs.

These are tab-separated text files which describe the chunks selected and rejected, giving the training loss for each chunk that was
used in training, and a reason for rejection for those chunks which were not used.

Before training starts, 1000 chunks are sampled to determine the baseline for filtering. These chunks are recorded (when **--chunk_logging_threshold 0**) at the start of the
chunk log file, and they can be distinguished from those generated in the training loop because they are not marked as rejected, but do not have a loss associated with them.

The script **misc/plot_chunk_log.py** can be used to plot the data in this file.

## Model files

* Neural network descriptions (with parameters not specified) are needed as an input to training. These are python files: an example is given in the directory **models**.
* It is also possible to use the result of earlier training runs as a starting point: in this case use a **.checkpoint** file (see below).
* Trained network files are much larger than the python files which define the structure of a network. For example, **bin/train_flipflop.py** saves trained models at each checkpoint and at the end of training in two different formats:
* **.params** files store the model parameters in a flat pytorch structure.
* **.checkpoint** files can be used to read a network directly into a pytorch function using **torch.load()**.
* The script **bin/dump_json.py** transforms a **.checkpoint** file into a **json**-based format which can be used by Guppy.
* **bin/prepare_mapped_reads.py** needs a trained flip-flop network to use for remapping. This is in the **.checkpoint** format, and an example can be found in the **models** directory.

Loading

0 comments on commit 411de95

Please sign in to comment.