diff --git a/.gitignore b/.gitignore new file mode 100755 index 0000000..3eec164 --- /dev/null +++ b/.gitignore @@ -0,0 +1,18 @@ +*.pyc +*~ +*.sw? +*.so +*.egg-info +.eggs +build +dist +RESULTS +taiyaki/ctc/ctc.c +taiyaki/squiggle_match/squiggle_match.c +taiyaki/version.py +venv +run +.cache +.tox +*.fasta +*.hdf5 diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..369c702 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,11 @@ +# Taiyaki +Version numbers: major.minor.patch +* Major version bump indicates a substantial change, e.g. file formats or removal of functionality. +* Minor version bump indicates a change in functionality that may affect users. +* Patch version bump indicates bug-fixes or minor improvements not expected to affect users. + +## v3.0.0 +Initial release: +* Prepare data for training basecallers by remapping signal to reference sequence +* Train neural networks for flip-flop basecalling and squiggle prediction +* Export basecaller models for use in Guppy \ No newline at end of file diff --git a/FILE_FORMATS.md b/FILE_FORMATS.md new file mode 100644 index 0000000..270ea68 --- /dev/null +++ b/FILE_FORMATS.md @@ -0,0 +1,110 @@ +This document describes file formats used in the Taiyaki package. + +## Fast5 files + +The package reads single or multi-read fast5 files using a wrapper around the **ont_fast5_api** package. + +## Strand lists + +Strand lists should be tab-separated text files having columns ('filename' or 'filename_fast5' and not both) and / or 'read_id'. +If a strand list file is supplied as an optional argument to a script, then +1. If no column 'read_id' is present, then all files with names in the column 'filename' or 'filename_fast5' are read. +2. If no column 'filename' or 'filename_fast5' is present, then all reads with read_ids in the 'read_id' column are are read from files in the directory specified. +3. If there is a ('filename' or 'filename_fast5') column and a 'read_id' column, then the strand list is regarded as a list of pairs (filename, read_id). + +## Per-read parameter files + +The script **bin/generate_per_read_parameters.py** creates a tsv file with columns ('UUID', 'trim_start', 'trim_end', 'shift', 'scale') which give instructions for +handling of each read. The shift and scale parameters are chosen so that + + y = (current_in_pA - shift)/scale + +is standardised: that is, so that roughly, mean(**y**)=0 and std(**y**)=1 (more robust statistics are used by the script to generate the parameters). + + UUID trim_start trim_end shift scale + 6a8a74ff-5316-41d8-825d-a018af4242bf 200 50 85.43114135742188 15.168887446289057 + 906f26ce-367a-4d3c-b279-ca86f6db7255 200 50 97.36762817382814 15.927331818603507 + 90b3c72f-ac34-4337-b33a-2fecd0216b99 200 50 82.36786376953125 15.076369731445316 + +We expect users to find reasons to generate their own per-read-parameter files or to modify the ones generated by this script. + +## Reference files + +Files to store the reference for each read are used as an ingredient for remapping. + +These are fasta files where the comment line for each sequence is the UUID: + + >6a8a74ff-5316-41d8-825d-a018af4242bf + GTGCTTGTGGGGTATTGCTCAAGAAATTTTTGCCCAGATCAATGTTCTGGAGATTTTACCCAATGT..... + >906f26ce-367a-4d3c-b279-ca86f6db7255 + AATCCTGCCTCTAAAGAAAGAAAAAAAAAAATCAGCTAGGTGTAGCCATAGGCAGCTGTAGTCCCA..... + +## Mapped signal files + +Data for training is stored in mapped signal files. +The class **HDF5** in **taiyaiki/mapped_signal_files.py** provides an API for reading and writing these files, and also +methods for checking that a file conforms to the specification. + +The files are HDF5 files with the following structure. + + HDF5_file/ + ├── attribute: version (integer) + └── group: Reads/ + ├── group: +   ├── group: +   ├── group: + . + . + + +Each read_id is a UUID, and the data in each read group is: + +| **name** |**attribute/dataset** | **type** | **description** | +|-------------------|----------------------|-----------|--------------------------------------------------------------------| +| alphabet | attr | str | e.g. 'ACGT' for DNA. May include modified bases in future releases | +| collapse_alphabet | attr | str | canonical base for each base in 'alphabet'. | +| shift_frompA | attr | float | shift parameter - see 'per-read-parameter files' above | +| scale_frompA | attr | float | scale parameter - see 'per-read-parameter files' above | +| range | attr | float | see equation below | +| offset | attr | float | see equation below | +| digitisation | attr | float | see equation below | +| Dacs | dataset | int16 | signal data representing current through pore (see equation below) | +| Ref_to_signal | dataset | int32 | Ref_to_signal[n] = location in Dacs associated with Reference[n] | +| Reference | dataset | int16 | alphabet[Reference[n]] is the nth base in the reference sequence | +| mapping_score | attr (optional) | str | score associated with mapping of ref to signal | +| mapping_method | str (optional) | str | short description of mapping method | + + +The current in pA is calculated from the integers in Dacs by the equation + + current = (Dacs + offset ) * range / digitisation + + +## Chunk logs + +During training, **bin/train_flipflop.py** generates (input,output) pairs of (signal,sequence) for network training. +We refer to each of these (signal,sequence) pairs as a chunk. Some chunks are rejected rather than being fed into the +training loop, either because the required data could not be found or because they are filtered out. For example, we +filter out chunks which contain very long slips (where many bases pass through the pore in a short time) because we +expect them to make training more difficult. + +With the option **--chunk_logging_threshold 0**, the scripts **bin/train_flipflop.py** and **bin/train_squiggle.py** produce chunk logs. + +These are tab-separated text files which describe the chunks selected and rejected, giving the training loss for each chunk that was +used in training, and a reason for rejection for those chunks which were not used. + +Before training starts, 1000 chunks are sampled to determine the baseline for filtering. These chunks are recorded (when **--chunk_logging_threshold 0**) at the start of the +chunk log file, and they can be distinguished from those generated in the training loop because they are not marked as rejected, but do not have a loss associated with them. + +The script **misc/plot_chunk_log.py** can be used to plot the data in this file. + +## Model files + +* Neural network descriptions (with parameters not specified) are needed as an input to training. These are python files: an example is given in the directory **models**. +* It is also possible to use the result of earlier training runs as a starting point: in this case use a **.checkpoint** file (see below). +* Trained network files are much larger than the python files which define the structure of a network. For example, **bin/train_flipflop.py** saves trained models at each checkpoint and at the end of training in two different formats: + * **.params** files store the model parameters in a flat pytorch structure. + * **.checkpoint** files can be used to read a network directly into a pytorch function using **torch.load()**. + * The script **bin/dump_json.py** transforms a **.checkpoint** file into a **json**-based format which can be used by Guppy. + * **bin/prepare_mapped_reads.py** needs a trained flip-flop network to use for remapping. This is in the **.checkpoint** format, and an example can be found in the **models** directory. + diff --git a/LICENCE.txt b/LICENCE.txt new file mode 100644 index 0000000..aed9377 --- /dev/null +++ b/LICENCE.txt @@ -0,0 +1,323 @@ +Oxford Nanopore Technologies, Ltd. Public License Version 1.0 +============================================================= + +1. Definitions +-------------- + +1.1. “Contributor” + means each individual or legal entity that creates, contributes to + the creation of, or owns Covered Software. + +1.2. “Contributor Version” + means the combination of the Contributions of others (if any) used + by a Contributor and that particular Contributor’s Contribution. + +1.3. “Contribution” + means Covered Software of a particular Contributor. + +1.4. “Covered Software” + means Source Code Form to which the initial Contributor has attached + the notice in Exhibit A, the Executable Form of such Source Code + Form, and Modifications of such Source Code Form, in each case + including portions thereof. + +1.5. “Executable Form” + means any form of the work other than Source Code Form. + +1.6. “Larger Work” + means a work that combines Covered Software with other material, in + a separate file or files, that is not Covered Software. + +1.7. “License” + means this document. + +1.8. “Licensable” + means having the right to grant, to the maximum extent possible, + whether at the time of the initial grant or subsequently, any and + all of the rights conveyed by this License. + +1.9. “Modifications” + means any of the following: + + (a) any file in Source Code Form that results from an addition to, + deletion from, or modification of the contents of Covered + Software; or + (b) any new file in Source Code Form that contains any Covered + Software. + +1.10. “Research Purposes” + means use for internal research and not intended for or directed + towards commercial advantages or monetary compensation; provided, + however, that monetary compensation does not include sponsored + research of research funded by grants. + +1.11 “Secondary License” + means either the GNU General Public License, Version 2.0, the GNU + Lesser General Public License, Version 2.1, the GNU Affero General + Public License, Version 3.0, or any later versions of those + licenses. + +1.12. “Source Code Form” + means the form of the work preferred for making modifications. + +1.13. “You” (or “Your”) + means an individual or a legal entity exercising rights under this + License. For legal entities, “You” includes any entity that + controls, is controlled by, or is under common control with You. For + purposes of this definition, “control” means (a) the power, direct + or indirect, to cause the direction or management of such entity, + whether by contract or otherwise, or (b) ownership of more than + fifty percent (50%) of the outstanding shares or beneficial + ownership of such entity. + +2. License Grants and Conditions +-------------------------------- + +2.1. Grants + +Each Contributor hereby grants You a world-wide, royalty-free, +non-exclusive license under Contributor copyrights Licensable by such +Contributor to use, reproduce, make available, modify, display, +perform, distribute, and otherwise exploit solely for Research Purposes +its Contributions, either on an unmodified basis, with Modifications, +or as part of a Larger Work. + +2.2. Effective Date + +The licenses granted in Section 2.1 with respect to any Contribution +become effective for each Contribution on the date the Contributor +first distributes such Contribution. + +2.3. Limitations on Grant Scope + +The licenses granted in this Section 2 are the only rights granted under +this License. No additional rights or licenses will be implied from the +distribution or licensing of Covered Software under this License. The +License is incompatible with Secondary Licenses. Notwithstanding +Section 2.1 above, no copyright license is granted: + +(a) for any code that a Contributor has removed from Covered Software; + or + +(b) use of the Contributions or its Contributor Version other than for +Research Purposes only; or + +(c) for infringements caused by: (i) Your and any other third party’s +modifications of Covered Software, or (ii) the combination of its +Contributions with other software (except as part of its Contributor +Version). + +This License does not grant any rights in the patents, trademarks, +service marks, or logos of any Contributor (except as may be necessary +to comply with the notice requirements in Section 3.4). + +2.4. Subsequent Licenses + +No Contributor makes additional grants as a result of Your choice to +distribute the Covered Software under a subsequent version of this +License (see Section 10.2) or under the terms of a Secondary License +(if permitted under the terms of Section 3.3). + +2.5. Representation + +Each Contributor represents that the Contributor believes its +Contributions are its original creation(s) or it has sufficient rights +to grant the rights to its Contributions conveyed by this License. + +2.6. Fair Use + +This License is not intended to limit any rights You have under +applicable copyright doctrines of fair use, fair dealing, or other +equivalents. + +2.7. Conditions + +Sections 3.1, 3.2, 3.3, and 3.4 are conditions of the licenses granted +in Section 2.1. + +3. Responsibilities +------------------- + +3.1. Distribution of Source Form + +All distribution of Covered Software in Source Code Form, including any +Modifications that You create or to which You contribute, must be under +the terms of this License. You must inform recipients that the Source +Code Form of the Covered Software is governed by the terms of this +License, and how they can obtain a copy of this License. You may not +attempt to alter or restrict the recipients’ rights in the Source Code Form. + +3.2. Distribution of Executable Form + +If You distribute Covered Software in Executable Form then: + +(a) such Covered Software must also be made available in Source Code + Form, as described in Section 3.1, and You must inform recipients of + the Executable Form how they can obtain a copy of such Source Code + Form by reasonable means in a timely manner, at a charge no more + than the cost of distribution to the recipient; and + +(b) You may distribute such Executable Form under the terms of this + License. + +3.3. Distribution of a Larger Work + +You may create and distribute a Larger Work under terms of Your choice, +provided that You also comply with the requirements of this License for +the Covered Software. The Larger Work may not be a combination of Covered +Software with a work governed by one or more Secondary Licenses. + +3.4. Notices + +You may not remove or alter the substance of any license notices +(including copyright notices, patent notices, disclaimers of warranty, +or limitations of liability) contained within the Source Code Form of +the Covered Software, except that You may alter any license notices to +the extent required to remedy known factual inaccuracies. + +3.5. Application of Additional Terms + +You may not choose to offer, or charge a fee for use of the Covered +Software or a fee for, warranty, support, indemnity or liability +obligations to one or more recipients of Covered Software. You must +make it absolutely clear that any such warranty, support, indemnity, or +liability obligation is offered by You alone, and You hereby agree to +indemnify every Contributor for any liability incurred by such +Contributor as a result of warranty, support, indemnity or liability +terms You offer. You may include additional disclaimers of warranty and +limitations of liability specific to any jurisdiction. + +4. Inability to Comply Due to Statute or Regulation +--------------------------------------------------- + +If it is impossible for You to comply with any of the terms of this +License with respect to some or all of the Covered Software due to +statute, judicial order, or regulation then You must: (a) comply with +the terms of this License to the maximum extent possible; and (b) +describe the limitations and the code they affect. Such description must +be placed in a text file included with all distributions of the Covered +Software under this License. Except to the extent prohibited by statute +or regulation, such description must be sufficiently detailed for a +recipient of ordinary skill to be able to understand it. + +5. Termination +-------------- + +5.1. The rights granted under this License will terminate automatically +if You fail to comply with any of its terms. + +5.2. If You initiate litigation against any entity by asserting an +infringement claim (excluding declaratory judgment actions, +counter-claims, and cross-claims) alleging that a Contributor Version +directly or indirectly infringes, then the rights granted to +You by any and all Contributors for the Covered Software under Section +2.1 of this License shall terminate. + +5.3. In the event of termination under Sections 5.1 or 5.2 above, all +end user license agreements (excluding distributors and resellers) which +have been validly granted by You or Your distributors under this License +prior to termination shall survive termination. + +************************************************************************ +* * +* 6. Disclaimer of Warranty * +* ------------------------- * +* * +* Covered Software is provided under this License on an “as is” * +* basis, without warranty of any kind, either expressed, implied, or * +* statutory, including, without limitation, warranties that the * +* Covered Software is free of defects, merchantable, fit for a * +* particular purpose or non-infringing. The entire risk as to the * +* quality and performance of the Covered Software is with You. * +* Should any Covered Software prove defective in any respect, You * +* (not any Contributor) assume the cost of any necessary servicing, * +* repair, or correction. This disclaimer of warranty constitutes an * +* essential part of this License. No use of any Covered Software is * +* authorized under this License except under this disclaimer. * +* * +************************************************************************ + +************************************************************************ +* * +* 7. Limitation of Liability * +* -------------------------- * +* * +* Under no circumstances and under no legal theory, whether tort * +* (including negligence), contract, or otherwise, shall any * +* Contributor, or anyone who distributes Covered Software as * +* permitted above, be liable to You for any direct, indirect, * +* special, incidental, or consequential damages of any character * +* including, without limitation, damages for lost profits, loss of * +* goodwill, work stoppage, computer failure or malfunction, or any * +* and all other commercial damages or losses, even if such party * +* shall have been informed of the possibility of such damages. This * +* limitation of liability shall not apply to liability for death or * +* personal injury resulting from such party’s negligence to the * +* extent applicable law prohibits such limitation, but in such event, * +* and to the greatest extent permissible, damages will be limited to * +* direct damages not to exceed one hundred dollars. Some * +* jurisdictions do not allow the exclusion or limitation of * +* incidental or consequential damages, so this exclusion and * +* limitation may not apply to You. * +* * +************************************************************************ + +8. Litigation +------------- + +Any litigation relating to this License may be brought only in the +courts of a jurisdiction where the defendant maintains its principal +place of business and such litigation shall be governed by laws of that +jurisdiction, without reference to its conflict-of-law provisions. +Nothing in this Section shall prevent a party’s ability to bring +cross-claims or counter-claims. + +9. Miscellaneous +---------------- + +This License represents the complete agreement concerning the subject +matter hereof. If any provision of this License is held to be +unenforceable, such provision shall be reformed only to the extent +necessary to make it enforceable. Any law or regulation which provides +that the language of a contract shall be construed against the drafter +shall not be used to construe this License against a Contributor. + +10. Versions of the License +--------------------------- + +10.1. New Versions + +Oxford Nanopore Technologies, Ltd. is the license steward. Except as +provided in Section 10.3, no one other than the license steward has the +right to modify or publish new versions of this License. Each version +will be given a distinguishing version number. + +10.2. Effect of New Versions + +You may distribute the Covered Software under the terms of the version +of the License under which You originally received the Covered Software, +or under the terms of any subsequent version published by the license +steward. + +10.3. Modified Versions + +If you create software not governed by this License, and you want to +create a new license for such software, you may create and use a +modified version of this License if you rename the license and remove +any references to the name of the license steward (except to note that +such modified license differs from this License). + +Exhibit A - Source Code Form License Notice +------------------------------------------- + + This Source Code Form is subject to the terms of the Oxford Nanopore + Technologies, Ltd. Public License, v. 1.0. Full licence can be found + at + https://github.com/nanoporetech/flappie/blob/master/LICENCE.txt + +If it is not possible or desirable to put the notice in a particular +file, then You may include the notice in a location (such as a LICENSE +file in a relevant directory) where a recipient would be likely to look +for such a notice. + +You may add additional accurate notices of copyright ownership. diff --git a/MANIFEST.in b/MANIFEST.in new file mode 100644 index 0000000..55ea1d3 --- /dev/null +++ b/MANIFEST.in @@ -0,0 +1,2 @@ +include LICENSE.md + diff --git a/Makefile b/Makefile new file mode 100644 index 0000000..23c4994 --- /dev/null +++ b/Makefile @@ -0,0 +1,122 @@ +SHELL = /bin/bash +PYTHON ?= python3 + + +.PHONY: all +all: install + + +# autodetect CUDA version if possible +CUDA ?= $(shell (which nvcc && nvcc --version) | grep -oP "(?<=release )[0-9.]+") + + +# Determine correct torch package to install +TORCH_CUDA_8.0 = cu80 +TORCH_CUDA_9.0 = cu90 +TORCH_CUDA_9.1 = cu91 +TORCH_CUDA_9.2 = cu92 +TORCH_CUDA_10.0 = cu100 +TORCH_PLATFORM ?= $(if $(TORCH_CUDA_$(CUDA)),$(TORCH_CUDA_$(CUDA)),cpu) +PY3_MINOR = $(shell $(PYTHON) -c "import sys; print(sys.version_info.minor)") +TORCH_Linux = http://download.pytorch.org/whl/${TORCH_PLATFORM}/torch-1.0.0-cp3${PY3_MINOR}-cp3${PY3_MINOR}m-linux_x86_64.whl +TORCH_Darwin = torch +TORCH ?= $(TORCH_$(shell uname -s)) + + +# determine correct cupy package to install +CUPY_8.0 = cupy-cuda80 +CUPY_9.0 = cupy-cuda90 +CUPY_9.1 = cupy-cuda91 +CUPY_9.2 = cupy-cuda92 +CUPY_10.0 = cupy-cuda100 +CUPY ?= $(CUPY_$(CUDA)) + + +.PHONY: show_cuda_version +show_cuda_version: + @echo Found CUDA version: $(if $(CUDA), $(CUDA), None) + @echo Will install torch with: $(if $(TORCH), pip install $(TORCH), **not installing torch**) + @echo 'Will install cupy with: ' $(if $(CUPY), pip install $(CUPY), **not installing cupy**) + + +envDir = venv +envPrompt ?= "(taiyaki) " +pyTestArgs ?= +override pyTestArgs += --durations=20 -v + + +.PHONY: install +install: + rm -rf ${envDir} + virtualenv --python=${PYTHON} --prompt=${envPrompt} ${envDir} + source ${envDir}/bin/activate && \ + pip install pip --upgrade && \ + mkdir -p build/wheelhouse && \ + pip download --dest build/wheelhouse ${TORCH} && \ + pip install --find-links build/wheelhouse --no-index torch && \ + pip install -r requirements.txt ${CUPY} && \ + pip install -r develop_requirements.txt && \ + ${PYTHON} setup.py develop + @echo "To activate your new environment: source ${envDir}/bin/activate" + + +.PHONY: deps +deps: + apt-get update + apt-get install -y \ + python3-virtualenv python3-pip python3-setuptools git \ + libblas3 libblas-dev python3-dev lsb-release virtualenv + + +.PHONY: sdist +sdist: + ${PYTHON} setup.py sdist + + +.PHONY: bdist_wheel +bdist_wheel: + ${PYTHON} setup.py bdist_wheel + ls -l dist/*.whl + + +.PHONY: test +test: unittest + + +.PHONY: unittest +unittest: + ${PYTHON} setup.py test --addopts "${pyTestArgs}" + + +.PHONY: acctest +accset ?= +acctest: + mkdir -p build/acctest + pip install -r test/acceptance/requirements.txt + cd build/acctest && ${PYTHON} -m pytest ${pyTestArgs} ../../test/acceptance/${accset} + + +.PHONY: clean +clean: + rm -rf build/ dist/ deb_dist/ *.egg-info/ ${envDir}/ + rm taiyaki/ctc/ctc.c \ + taiyaki/squiggle_match/squiggle_match.c taiyaki/version.py + find . -name '*.pyc' -delete + find . -name '*.so' -delete + + +.PHONY: autopep8 pep8 +pyDirs := taiyaki test bin models misc +pyFiles := $(shell find *.py ${pyDirs} -type f -name "*.py") +autopep8: + autopep8 --ignore E203 -i --max-line-length=120 ${pyFiles} +pep8: + pep8 --ignore E203,E402 --max-line-length=120 ${pyFiles} + + +.PHONY: workflow +workflow: + ./workflow/remap_from_samrefs_then_train_test_workflow.sh + ./workflow/remap_from_samrefs_then_train_multireadf5_test_workflow.sh + ./workflow/remap_from_samrefs_then_train_squiggle_test_workflow.sh +#(The scripts each check to see if the training log file and chunk log file exist and contain data) diff --git a/ONT_logo.png b/ONT_logo.png new file mode 100644 index 0000000..a9989ef Binary files /dev/null and b/ONT_logo.png differ diff --git a/README.md b/README.md new file mode 100644 index 0000000..b09585c --- /dev/null +++ b/README.md @@ -0,0 +1,342 @@ +

+ +

+ +# Taiyaki + +Taiyaki is research software for training models for basecalling Oxford Nanopore reads. + +Oxford Nanopore's devices measure the flow of ions through a nanopore, and detect changes +in that flow as molecules pass through the pore. +These signals can be highly complex and exhibit long-range dependencies, much like spoken +or written language. Taiyaki can be used to train neural networks to understand the +complex signal from a nanopore device, using techniques inspired by state-of-the-art +language processing. + +Taiyaki is used to train the models used to basecall DNA and RNA found in Oxford Nanopore's +Guppy basecaller (version 2.2 at time of writing). This includes the flip-flop models, +which are trained using a technique inspired by Connectionist Temporal Classification +(Graves et al 2006). + +Main features: +* Prepare data for training basecallers by remapping signal to reference sequence +* Train neural networks for flip-flop basecalling and squiggle prediction +* Export basecaller models for use in Guppy + +Taiyaki is built on top of pytorch and is compatible with Python 3.5 or later. +It is aimed at advanced users, and it is an actively evolving research project, so +expect to get your hands dirty. + + +# Contents + +1. [Install system prerequisites](#install-system-prerequisites) +2. [Installation](#installation) +3. [Tests](#tests) +4. [Workflows](#workflows) +5. [Guppy compatibility](#guppy-compatibility) +6. [Environment variables](#environment-variables) +7. [CUDA](#cuda) +8. [Running on UGE](#running-on-a-uge-cluster) +9. [Diagnostics](#diagnostics) + + +# Install system prerequisites + +To install required system packages on ubuntu 16.04: + + sudo make deps + +Other linux platforms may be compatible, but are untested. + +In order to accelerate model training with a GPU you will need to install CUDA (which should install nvcc and add it to your path.) +See instructions from NVIDIA and the [CUDA](#cuda) section below. + +Taiyaki also makes use of the OpenMP extensions for multi-processing. These are supported +by the system installed compiler on most modern Linux systems but require a more modern version +of the clang/llvm compiler than that installed on MacOS machines. Support for OpenMP was +adding in clang/llvm in version 3.7 (see http://llvm.org or use brew). Alternatively you +can install GCC on MacOS using homebrew. + +Some analysis scripts require a recent version of the [BWA aligner](https://github.com/lh3/bwa). + +Windows is not supported. + +# Installation + +--- +**NOTE** +If you intend to use Taiyaki with a GPU, make sure you have installed and set up [CUDA](#cuda) before proceeding. +--- + +## Install Taiyaki in a new virtual environment + +We recommend installing Taiyaki in a self-contained [virtual environment](http://python-guide-pt-br.readthedocs.io/en/latest/dev/virtualenvs/). + +The following command creates a complete environment for developing and testing Taiyaki, in the directory **venv**: + + make install + +Taiyaki will be installed in [development mode](http://setuptools.readthedocs.io/en/latest/setuptools.html#development-mode) so that you can easily test your changes. +You will need to run `source venv/bin/activate` at the start of each session when you want to use this virtual environment. + +## Install Taiyaki system-wide or into activated Python environment + +Taiyaki can be installed from source using either: + + python3 setup.py install + python3 setup.py develop #[development mode](http://setuptools.readthedocs.io/en/latest/setuptools.html#development-mode) + +Alternatively, you can use pip with either: + + pip install path/to/taiyaki/repo + pip install -e path/to/taiyaki/repo #[development mode](http://setuptools.readthedocs.io/en/latest/setuptools.html#development-mode) + +# Tests + +Tests can be run as follows: + + make workflow #runs scripts which carry out the workflow for basecall-network training and for squiggle-predictor training + make acctest #runs acceptance tests + make unittest #runs unit tests + +# Workflows + +## Using the workflow Makefile + +The file at **workflow/Makefile** can be used to direct the process of generating ingredients for training and then running the training itself. + +For example, if we have a directory **read_dir** containing fast5 files, and a fasta file **refs.fa** containing a ground-truth reference sequence for each read, we can (from the Taiyaki root directory) use the command line + + make -f workflow/Makefile MAXREADS=1000 \ + READDIR=read_dir USER_PER_READ_REFERENCE_FILE=refs.fa \ + DEVICE=3 train_remapuser_ref + +This will place the training ingredients in a directory **RESULTS/training_ingredients** and the training output (including logs and trained models) +in **RESULTS/remap_training**, using GPU 3 and only reading the first 1000 reads in the directory. The fast5 files may be single or multi-read. + +Using command line options to **make**, it is possible to change various other options, including the directory where the results go. Read the Makefile to find out about these options. +The Makefile can also be used to follow a squiggle-mapping workflow. + +The paragraph below describes the steps in the workflow in more detail. + +## Steps from fast5 files to basecalling + +The script **bin/prepare_mapped_reads.py** prepares a file containing mapped signals. This mapped signal file is then used to train a basecalling model. + +The simplest workflow looks like this. The flow runs from top to bottom and lines show the inputs required for each stage. +The scripts in the Taiyaki package are shown, as are the files they work with. + + fast5 files + / \ + / \ + / \ + / generate_per_read_params.py + | | + | | fasta with reference + | per-read-params file sequence for each read + | (tsv, contains shift, (produced with get_refs_from_sam.py + | scale, trim for each read) or some other method) + \ | / + \ | / + \ | / + \ | / + \ | / + \ | / + prepare_mapped_reads.py + (also uses remapping flip-flop + model from models/) + | + | + mapped-signal-file (hdf5) + | + | + train_flipflop.py + (also uses definition + of model to be trained) + | + | + trained flip-flop model + | + | + dump_json.py + | + | + json model definition + (suitable for use by Guppy) + +Each script in bin/ has lots of options, which you can find out about by reading the scripts. +Basic usage is as follows: + + bin/generate_per_read_params.py + + bin/get_refs_from_sam.py > + + bin/prepare_mapped_reads.py remap + + bin/train_flipflop.py --device --chunk_logging_threshold 0 + +We suggest using the **chunk_logging_threshold** 0 to begin with. This results in all chunks (including rejected chunks) being logged in a tsv file in the training directory. +This chunk log can be useful for diagnosing problems, but can get quite large, so may be turned off for very long training runs. + +Some scripts mentioned also have a useful option **--limit** which limits the number of reads to be used. This allows a quick test of a workflow. + + +## Preparing a training set + +The `prepare_mapped_reads.py` script prepares a data set to use to train a new basecaller. Each member of this data set contains: + + * The raw signal for a complete nanopore read (lifted from a fast5 file) + * A reference sequence that is the "ground truth" for the that read + * An alignment between the signal and the reference + +As input to this script, we need a directory containing fast5 files (either single-read or multi-read) and a fasta file that contains the ground-truth reference for each read. In order to match the raw signal to the correct ground-truth sequence, the IDs in the fasta file should be the unique read ID assigned by MinKnow (these are the same IDs that Guppy uses in its fastq output). For example, a record in the fasta file might look like: + + >17296436-f2f1-4713-adaf-169ed9cf6aa6 + TATGATGTGAGCTTATATTATTAATTTTGTATCAATCTTATTTTCTAATGTATGCATTTTAATGCTATAAATTTCCTTCTAAGCACTAC... + +The recommended way to produce this fasta file is as follows: + + 1. Align Guppy fastq basecalls to a reference genome using Guppy Aligner or Minimap. This will produce one or more SAM files. + 2. Use the `get_refs_from_sam.py` script to extract a snippet of the reference for each mapped read. You can filter reads by coverage. + +The final input required by `prepare_mapped_signal.py` is a pre-trained basecaller model, which is used to determine the alignment between raw signal and reference sequence. +An example of such a model (for DNA sequenced with pore r9) is provided at `models/mGru256_flipflop_remapping_model_r9_DNA.checkpoint`. +This does make the entire training process somewhat circular: you need a model to train a model. +However, the new training set can be somewhat different from the data that the remapping model was trained on and things still work out. +So, for example, if your samples are a bit weird and whacky, you may be able to improve basecall accuracy by retraining a model with Taiyaki. +Internally, we use Taiyaki to train basecallers after incremental pore updates, and as a research tool into better basecalling methods. +Taiyaki is not intended to enable training basecallers from scratch for novel nanopores. +If it seems like remapping will not work for your data set, then you can use alternative methods +so long as they produce data conformant with [this format](FILE_FORMATS.md). + + +# Guppy compatibility + +In order to train a model that is compatible with Guppy (version 2.2 at time of writing), we recommend that you +use the model defined in `models/mGru256_flipflop.py` and that you call `train_flipflop.py` with: + + train_flipflop.py --stride 2 --winlen 19 mGru256_flipflop.py + +You should then be able to export your checkpoint to json (using bin/dump_json.py) that can be used to basecall with Guppy. + +See Guppy documentation for more information on how to do this. + +Key options include selecting the Guppy config file to be appropriate for your application, and passing the complete path of your .json file. + +For example: + + guppy_basecaller --input_path /path/to/input_reads --save_path /path/to/save_dir --config dna_r9.4.1_450bps_flipflop.cfg --model path/to/model.json --device cuda:1 + +Certain other model architectures may also be Guppy-compatible, but it is hard to give an exhaustive list +and so we recommend you contact us to get confirmation. + +We are working on adding basecalling functionality to Taiyaki itself to support a wider range of models. + + +# Environment variables + +The environment variables `OMP_NUM_THREADS` and `OPENBLAS_NUM_THREADS` can have an impact on performance. +The optimal value will depend on your system and on the jobs you are running, so experiment. +As a starting point, we recommend: + + OPENBLAS_NUM_THREADS=1 + OMP_NUM_THREADS=8 + + +# CUDA + +In order to use a GPU to accelerate model training, you will need to ensure that CUDA is installed (specifically nvcc) and that CUDA-related environment variables are set. +This should be done before running `make install` described above. If you forgot to do this, just run `make install` again once everything is set up. +The Makefile will try to detect which version of CUDA is present on your system, and install matching versions of pytorch and cupy. + +To see what version of CUDA will be detected and which torch and cupy packages will be installed you can run: + + make show_cuda_version + +Expert users can override the detected versions on the command line. For example, you might want to do this if you are building Taiyaki on one machine to run on another. + + # Force CUDA version 8.0 + CUDA=8.0 make install + + # Override torch package, and don't install cupy at all + TORCH=my-special-torch-package CUPY= make install + +Users who install Taiyaki system-wide or into an existing activated Python environment will need to make sure CUDA and a corresponding version of PyTorch have been installed. + +## Troubleshooting + +During training, if this error occurs: + + AttributeError: module 'torch._C' has no attribute '_cuda_setDevice' + +or any other error related to the device, it suggests that you are trying to use pytorch's CUDA functionality but that CUDA (specifically nvcc) is either not installed or not correctly set up. + +If: + + nvcc --version + +returns + + -bash: nvcc: command not found + +nvcc is not installed or it is not on your path. + +Ensure that you have installed CUDA (check NVIDIA's intructions) and that the CUDA compiler `nvcc` is on your path. + +To place cuda on your path enter the following: + + export PATH=$PATH:/usr/local/cuda/bin + export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64 + +Once CUDA is correctly configured and you are installing Taiyaki in a new virtual environment (as recommended), you may need to run `make install` again to ensure that you have the correct pytorch package to match your CUDA version. + + +# Running on a UGE cluster + +There are two things to get right: (a) installing with the correct CUDA version, and (b) executing with the correct choice of GPU. + +(a) It is important that **when the package is installed**, it knows which version of the CUDA compiler is available on the machine where it will be executed. +When running on a UGE cluster we might want to do installation on a different machine from execution. +There are two ways of getting around this. You can qlogin to a node which has the same resources +as the execution node, and then install using that machine: + + qlogin -l h= + cd + make install + +...or you can tell Taiyaki at the installation stage which version of CUDA to use. For example + + CUDA=8.0 make install + +(b) When **executing** on a UGE cluster you need to make sure you run on a node which has GPUs available, and then tell Taiyaki to use the correct GPU. + +You tell the system to wait for a node which has an available GPU by adding the option **-l gpu=1** to your qsub command. +To find out which GPU has been allocated to your job, you need to look at the environment variable **SGE_HGR_gpu**. If it has the value **cuda0**, then +use GPU number 0, and if it has the value **cuda1**, then use GPU 1. The command line option **--device** (used by **train_flipflop.py** +accepts inputs such as 'cuda0' or 'cuda1' or integers 0 or 1, so SGE_HGR_gpu can be passed straight into the **--device** option. + +The easy way to achieve this is with a Makefile like the one in the directory **workflow**. This Makefile contains comments which will help users run the package on a UGE system. + + +# Diagnostics + +The **misc** directory contains several scripts that are useful for working out where things went wrong (or understanding why they went right). + +Graphs showing the information in mapped read files can be plotted using the script **plot_mapped_signals.py** +A graph showing the progress of training can be plotted using the script **plot_training.py** + +When **train_flipflop.py** is run with the option **--chunk_logging_threshold 0** then all chunks examined are logged (including those used to set +chunk filtering parameters and those rejected for training). The script **plot_chunklog.py** plots several pictures that make use of this logged +information. + +--- + +This is a research release provided under the terms of the Oxford Nanopore Technologies' Public Licence. +Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. +Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. +However much as we would like to rectify every issue and piece of feedback users may have, the developers may have limited resource for support of this software. +Research releases may be unstable and subject to rapid iteration by Oxford Nanopore Technologies. + +© 2019 Oxford Nanopore Technologies Ltd. +Taiyaki is distributed under the terms of the Oxford Nanopore Technologies' Public Licence. \ No newline at end of file diff --git a/bin/dump_json.py b/bin/dump_json.py new file mode 100755 index 0000000..cf74ca3 --- /dev/null +++ b/bin/dump_json.py @@ -0,0 +1,29 @@ +#!/usr/bin/env python3 +import argparse +import json + +from taiyaki.cmdargs import AutoBool, FileExists, FileAbsent +from taiyaki.helpers import load_model +from taiyaki.json import JsonEncoder + +parser = argparse.ArgumentParser(description='Dump JSON representation of model', + formatter_class=argparse.ArgumentDefaultsHelpFormatter) + +parser.add_argument('--out_file', default=None, action=FileAbsent, help='Output JSON file to this file location') +parser.add_argument('--params', default=True, action=AutoBool, help='Output parameters as well as model structure') + +parser.add_argument('model', action=FileExists, help='Model file to read from') + + +if __name__ == "__main__": + args = parser.parse_args() + model = load_model(args.model) + + json_out = model.json(args.params) + + if args.out_file is not None: + with open(args.out_file, 'w') as f: + print("Writing to file: ", args.out_file) + json.dump(json_out, f, indent=4, cls=JsonEncoder) + else: + print(json.dumps(json_out, indent=4, cls=JsonEncoder)) diff --git a/bin/generate_per_read_params.py b/bin/generate_per_read_params.py new file mode 100755 index 0000000..82eec3f --- /dev/null +++ b/bin/generate_per_read_params.py @@ -0,0 +1,79 @@ +#!/usr/bin/env python3 +import argparse +import csv +from functools import partial +import numpy as np +import os +import sys + +from ont_fast5_api import fast5_interface +from taiyaki.cmdargs import Maybe, NonNegative, Positive +import taiyaki.common_cmdargs as common_cmdargs +import taiyaki.fast5utils as fast5utils +from taiyaki.iterators import imap_mp +from taiyaki.maths import med_mad +from taiyaki.signal import Signal + +parser = argparse.ArgumentParser() + +common_cmdargs.add_common_command_args(parser, 'input_folder input_strand_list limit overwrite version jobs'.split()) + +parser.add_argument('--trim', default=(200, 50), nargs=2, type=NonNegative(int), + metavar=('beginning', 'end'), help='Number of samples to trim off start and end') + +parser.add_argument('output', help='Output .tsv file') + + +def one_read_shift_scale(read_tuple): + + read_filename, read_id = read_tuple + + try: + with fast5_interface.get_fast5_file(read_filename, 'r') as f5file: + read = f5file.get_read(read_id) + sig = Signal(read) + + except Exception as e: + sys.stderr.write('Unable to obtain signal for {} from {}.\n{}\n'.format( + read_id, read_filename, repr(e))) + return (None, None, None) + + else: + signal = sig.current + + if len(signal) > 0: + shift, scale = med_mad(signal) + else: + shift, scale = np.NaN, np.NaN + # note - if signal trimmed by ub, it could be of length zero by this point for short reads + # These are taken out later in the existing code, in the new code we'll take out ub trimming + + return (read_id, shift, scale) + + +if __name__ == '__main__': + + args = parser.parse_args() + + if not args.overwrite: + if os.path.exists(args.output): + print("Cowardly refusing to overwrite {}".format(args.output)) + sys.exit(1) + + fast5_reads = fast5utils.iterate_fast5_reads(args.input_folder, + limit=args.limit, + strand_list=args.input_strand_list) + trim_start, trim_end = args.trim + + with open(args.output, 'w') as tsvfile: + writer = csv.writer(tsvfile, delimiter='\t', lineterminator='\n') + # UUID is 32hexdigits and four dashes eg. '43f6a05c-0856-4edc-8cd2-4866d9d60eaa' + writer.writerow(['UUID', 'trim_start', 'trim_end', 'shift', 'scale']) + + results = imap_mp(one_read_shift_scale, fast5_reads, threads=args.jobs) + + for result in results: + if all(result): + read_id, shift, scale = result + writer.writerow([read_id, trim_start, trim_end, shift, scale]) + diff --git a/bin/get_refs_from_sam.py b/bin/get_refs_from_sam.py new file mode 100755 index 0000000..7ceab65 --- /dev/null +++ b/bin/get_refs_from_sam.py @@ -0,0 +1,73 @@ +#!/usr/bin/env python3 +import argparse +from Bio import SeqIO +from collections import OrderedDict +import os +import pysam +import sys +import traceback + +from taiyaki.helpers import fasta_file_to_dict +from taiyaki.bio import reverse_complement +from taiyaki.cmdargs import proportion, FileExists + + +parser = argparse.ArgumentParser( + description='Extract reference sequence for each read from a SAM alignment file', + formatter_class=argparse.ArgumentDefaultsHelpFormatter) +parser.add_argument('--min_coverage', metavar='proportion', default=0.6, type=proportion, + help='Ignore reads with alignments shorter than min_coverage * read length') +parser.add_argument('--pad', type=int, default=0, + help='Number of bases by which to pad reference sequence') +parser.add_argument('reference', action=FileExists, + help="Genomic references that reads were aligned against") +parser.add_argument('input', metavar='input.sam', nargs='+', + help="SAM or BAM file containing read alignments to reference") + +STRAND = {0: '+', + 16: '-'} + + +def get_refs(sam, ref_seq_dict, min_coverage=0.6, pad=0): + """Read alignments from sam file and return accuracy metrics + """ + res = [] + with pysam.Samfile(sam, 'r') as sf: + for read in sf: + if read.flag != 0 and read.flag != 16: + continue + + coverage = float(read.query_alignment_length) / read.query_length + if coverage < min_coverage: + continue + + read_ref = ref_seq_dict.get(sf.references[read.reference_id], None) + if read_ref is None: + continue + + start = max(0, read.reference_start - read.query_alignment_start - pad) + end = min(len(read_ref), read.reference_end + read.query_length - read.query_alignment_end + pad) + + strand = STRAND[read.flag] + read_ref = read_ref.decode() if isinstance(read_ref, bytes) else read_ref + + if strand == "+": + read_ref = read_ref[start:end].upper() + else: + read_ref = reverse_complement(read_ref[start:end].upper()) + + fasta = ">{}\n{}\n".format(read.qname, read_ref) + + yield (read.qname, fasta) + + +if __name__ == '__main__': + args = parser.parse_args() + + sys.stderr.write("* Loading references (this may take a while for large genomes)\n") + references = fasta_file_to_dict(args.reference, allow_N=True) + + sys.stderr.write("* Extracting read references using SAM alignment\n") + for samfile in args.input: + for (name, fasta) in get_refs(samfile, references, args.min_coverage, args.pad): + sys.stdout.write(fasta) diff --git a/bin/map_to_squiggle.py b/bin/map_to_squiggle.py new file mode 100755 index 0000000..4e7b932 --- /dev/null +++ b/bin/map_to_squiggle.py @@ -0,0 +1,57 @@ +#!/usr/bin/env python3 +import argparse + +from taiyaki import common_cmdargs, fast5utils, helpers, squiggle_match +from taiyaki.cmdargs import (display_version_and_exit, FileExists, + Maybe, NonNegative, Positive, proportion) +from taiyaki.iterators import imap_mp +from taiyaki.version import __version__ + + +parser = argparse.ArgumentParser( + description='Map sequence to current trace using squiggle predictor model', + formatter_class=argparse.ArgumentDefaultsHelpFormatter) + + +common_cmdargs.add_common_command_args(parser, "limit jobs version".split()) + +parser.add_argument('--back_prob', default=1e-15, metavar='probability', + type=proportion, help='Probability of backwards move') +parser.add_argument('--input_strand_list', default=None, action=FileExists, + help='Strand summary file containing subset') +parser.add_argument('--localpen', default=None, type=Maybe(NonNegative(float)), + help='Penalty for staying in start and end states, or None to disable them') +parser.add_argument('--minscore', default=None, type=Maybe(NonNegative(float)), + help='Minimum score for matching') +parser.add_argument('--trim', default=(200, 10), nargs=2, type=NonNegative(int), + metavar=('beginning', 'end'), help='Number of samples to trim off start and end') +parser.add_argument('model', action=FileExists, help='Model file') +parser.add_argument('references', action=FileExists, help='Fasta file') +parser.add_argument('read_dir', action=FileExists, help='Directory for fast5 reads') + + +if __name__ == '__main__': + args = parser.parse_args() + + worker_kwarg_names = ['back_prob', 'localpen', 'minscore', + 'trim'] + + model = helpers.load_model(args.model) + + fast5_reads = fast5utils.iterate_fast5_reads(args.read_dir, + limit=args.limit, + strand_list=args.input_strand_list) + + for res in imap_mp(squiggle_match.worker, fast5_reads, threads=args.jobs, + fix_kwargs=helpers.get_kwargs(args, worker_kwarg_names), + unordered=True, init=squiggle_match.init_worker, + initargs=[model, args.references]): + if res is None: + continue + read_id, sig, score, path, squiggle, bases = res + bases = bases.decode('ascii') + print('#{} {}'.format(read_id, score)) + for i, (s, p) in enumerate(zip(sig, path)): + print('{}\t{}\t{}\t{}\t{}\t{}\t{}'.format(read_id, i, s, p, + bases[p], squiggle[p, 0], + squiggle[p, 1], squiggle[p, 2])) diff --git a/bin/predict_squiggle.py b/bin/predict_squiggle.py new file mode 100755 index 0000000..4bfd50b --- /dev/null +++ b/bin/predict_squiggle.py @@ -0,0 +1,37 @@ +#!/usr/bin/env python3 +import argparse +from Bio import SeqIO +import numpy as np +import torch + +from taiyaki import helpers, squiggle_match +from taiyaki.cmdargs import display_version_and_exit, FileExists, Positive +from taiyaki.version import __version__ + + +parser = argparse.ArgumentParser( + description='Predict squiggle from sequence', + formatter_class=argparse.ArgumentDefaultsHelpFormatter) + +parser.add_argument('--version', nargs=0, action=display_version_and_exit, + metavar=__version__, help='Display version information') +parser.add_argument('model', action=FileExists, help='Model file') +parser.add_argument('input', action=FileExists, help='Fasta file') + + +if __name__ == '__main__': + args = parser.parse_args() + + predict_squiggle = helpers.load_model(args.model) + + for seq in SeqIO.parse(args.input, 'fasta'): + seqstr = str(seq.seq).encode('ascii') + embedded_seq_numpy = np.expand_dims(squiggle_match.embed_sequence(seqstr), axis=1) + embedded_seq_torch = torch.tensor(embedded_seq_numpy, dtype=torch.float32) + + with torch.no_grad(): + squiggle = np.squeeze(predict_squiggle(embedded_seq_torch).cpu().numpy(), axis=1) + + print('base', 'current', 'sd', 'dwell', sep='\t') + for base, (mean, logsd, dwell) in zip(seq.seq, squiggle): + print(base, mean, np.exp(logsd), np.exp(-dwell), sep='\t') diff --git a/bin/prepare_mapped_reads.py b/bin/prepare_mapped_reads.py new file mode 100755 index 0000000..361c366 --- /dev/null +++ b/bin/prepare_mapped_reads.py @@ -0,0 +1,61 @@ +#!/usr/bin/env python +import argparse +from taiyaki.iterators import imap_mp +import os +import sys +from taiyaki.cmdargs import FileExists +import taiyaki.common_cmdargs as common_cmdargs +from taiyaki import fast5utils, helpers, prepare_mapping_funcs, variables + + +program_description = "Prepare data for model training and save to hdf5 file by remapping with flip-flop model" +parser = argparse.ArgumentParser(description=program_description, + formatter_class=argparse.ArgumentDefaultsHelpFormatter) + + +common_cmdargs.add_common_command_args(parser, 'device input_folder input_strand_list jobs limit overwrite version'.split()) +default_alphabet_str = variables.DEFAULT_ALPHABET.decode("utf-8") +parser.add_argument('--alphabet', default=default_alphabet_str, + help='Alphabet for basecalling. Defaults to ' + default_alphabet_str) +parser.add_argument('--collapse_alphabet', default=default_alphabet_str, + help='Collapsed alphabet for basecalling. Defaults to ' + default_alphabet_str) +parser.add_argument('input_per_read_params', action=FileExists, + help='Input per read parameter .tsv file') +parser.add_argument('output', help='Output HDF5 file') +parser.add_argument('model', action=FileExists, help='Taiyaki model file') +parser.add_argument('references', action=FileExists, + help='Single fasta file containing references for each read') + + +def main(argv): + """Main function to process mapping for each read using functions in prepare_mapping_funcs""" + args = parser.parse_args() + print("Running prepare_mapping using flip-flop remapping") + + if not args.overwrite: + if os.path.exists(args.output): + print("Cowardly refusing to overwrite {}".format(args.output)) + sys.exit(1) + + # Make an iterator that yields all the reads we're interested in. + fast5_reads = fast5utils.iterate_fast5_reads(args.input_folder, + limit=args.limit, + strand_list=args.input_strand_list) + + # Set up arguments (kwargs) for the worker function for each read + kwargs = helpers.get_kwargs(args, ['alphabet', 'collapse_alphabet', 'device']) + kwargs['per_read_params_dict'] = prepare_mapping_funcs.get_per_read_params_dict_from_tsv(args.input_per_read_params) + kwargs['references'] = helpers.fasta_file_to_dict(args.references) + kwargs['model'] = helpers.load_model(args.model) + workerFunction = prepare_mapping_funcs.oneread_remap # remaps a single read using flip-flip network + + results = imap_mp(workerFunction, fast5_reads, threads=args.jobs, + fix_kwargs=kwargs, unordered=True) + + # results is an iterable of dicts + # each dict is a set of return values from a single read + prepare_mapping_funcs.generate_output_from_results(results, args) + + +if __name__ == '__main__': + sys.exit(main(sys.argv[:])) diff --git a/bin/train_flipflop.py b/bin/train_flipflop.py new file mode 100755 index 0000000..78646f7 --- /dev/null +++ b/bin/train_flipflop.py @@ -0,0 +1,250 @@ +#!/usr/bin/env python3 +import argparse +from collections import defaultdict +import numpy as np +import os +from shutil import copyfile +import sys +import time + +import torch +import taiyaki.common_cmdargs as common_cmdargs +from taiyaki.cmdargs import (FileExists, NonNegative, Positive, proportion) + +from taiyaki import chunk_selection, ctc, flipflopfings, helpers, mapped_signal_files, variables +from taiyaki.version import __version__ + + +# This is here, not in main to allow documentation to be built +parser = argparse.ArgumentParser( + description='Train a flip-flop neural network', + formatter_class=argparse.ArgumentDefaultsHelpFormatter) + +common_cmdargs.add_common_command_args(parser, """adam chunk_logging_threshold device filter_max_dwell filter_mean_dwell + limit lrdecay niteration overwrite quiet save_every + sample_nreads_before_filtering version weight_decay""".split()) + +parser.add_argument('--min_batch_size', default=50, metavar='chunks', type=Positive(int), + help='Number of chunks to run in parallel for chunk_len = chunk_len_max.' + + 'Actual batch size used is (min_batch_size / chunk_len) * chunk_len_max') +parser.add_argument('--chunk_len_min', default=2000, metavar='samples', type=Positive(int), + help='Min length of each chunk in samples (chunk lengths are random between min and max)') +parser.add_argument('--chunk_len_max', default=4000, metavar='samples', type=Positive(int), + help='Max length of each chunk in samples (chunk lengths are random between min and max)') + +parser.add_argument('--input_strand_list', default=None, action=FileExists, + help='Strand summary file containing column read_id. Filenames in file are ignored.') +parser.add_argument('--min_prob', default=1e-30, metavar='p', type=proportion, + help='Minimum probability allowed for training') +parser.add_argument('--seed', default=None, metavar='integer', type=Positive(int), + help='Set random number seed') +parser.add_argument('--sharpen', default=1.0, metavar='factor', + type=Positive(float), help='Sharpening factor') +parser.add_argument('--smooth', default=0.45, metavar='factor', type=proportion, + help='Smoothing factor for reporting progress') +parser.add_argument('--stride', default=2, metavar='samples', type=Positive(int), + help='Stride for model') +parser.add_argument('--winlen', default=19, type=Positive(int), + help='Length of window over data') +parser.add_argument('model', action=FileExists, + help='File to read python model description from') + +parser.add_argument('output', help='Prefix for output files') +parser.add_argument('input', action=FileExists, + help='file containing mapped reads') + + +def save_model(network, output, index=None): + if index is None: + basename = 'model_final' + else: + basename = 'model_checkpoint_{:05d}'.format(index) + + model_file = os.path.join(output, basename + '.checkpoint') + torch.save(network, model_file) + params_file = os.path.join(output, basename + '.params') + torch.save(network.state_dict(), params_file) + + +if __name__ == '__main__': + args = parser.parse_args() + + np.random.seed(args.seed) + + device = torch.device(args.device) + if device.type == 'cuda': + torch.cuda.set_device(device) + + if not os.path.exists(args.output): + os.mkdir(args.output) + elif not args.overwrite: + sys.stderr.write('Error: Output directory {} exists but --overwrite is false\n'.format(args.output)) + exit(1) + if not os.path.isdir(args.output): + sys.stderr.write('Error: Output location {} is not directory\n'.format(args.output)) + exit(1) + + copyfile(args.model, os.path.join(args.output, 'model.py')) + + # Create a logging file to save details of chunks. + # If args.chunk_logging_threshold is set to 0 then we log all chunks including those rejected. + chunk_log = chunk_selection.ChunkLog(args.output) + + log = helpers.Logger(os.path.join(args.output, 'model.log'), args.quiet) + log.write('* Taiyaki version {}\n'.format(__version__)) + log.write('* Command line\n') + log.write(' '.join(sys.argv) + '\n') + log.write('* Loading data from {}\n'.format(args.input)) + log.write('* Per read file MD5 {}\n'.format(helpers.file_md5(args.input))) + + if args.input_strand_list is not None: + read_ids = list(set(helpers.get_read_ids(args.input_strand_list))) + log.write('* Will train from a subset of {} strands, determined by read_ids in input strand list\n'.format(len(read_ids))) + else: + log.write('* Will train from all strands\n') + read_ids = 'all' + + if args.limit is not None: + log.write('* Limiting number of strands to {}\n'.format(args.limit)) + + with mapped_signal_files.HDF5(args.input, "r") as per_read_file: + read_data = per_read_file.get_multiple_reads(read_ids, max_reads=args.limit) + # read_data now contains a list of reads + # (each an instance of the Read class defined in mapped_signal_files.py, based on dict) + + + log.write('* Loaded {} reads.\n'.format(len(read_data))) + + # Get parameters for filtering by sampling a subset of the reads + # Result is a tuple median mean_dwell, mad mean_dwell + # Choose a chunk length in the middle of the range for this + sampling_chunk_len = (args.chunk_len_min + args.chunk_len_max) // 2 + filter_parameters = chunk_selection.sample_filter_parameters(read_data, + args.sample_nreads_before_filtering, + sampling_chunk_len, + args, + log, + chunk_log=chunk_log) + + medmd, madmd = filter_parameters + + log.write("* Sampled {} chunks: median(mean_dwell)={:.2f}, mad(mean_dwell)={:.2f}\n".format( + args.sample_nreads_before_filtering, medmd, madmd)) + log.write('* Reading network from {}\n'.format(args.model)) + nbase = len(read_data[0]['alphabet']) + model_kwargs = { + 'stride': args.stride, + 'winlen': args.winlen, + 'insize': 1, # Number of input features to model e.g. was >1 for event-based models (level, std, dwell) + 'outsize': variables.nstate_flipflop(nbase) + } + network = helpers.load_model(args.model, **model_kwargs).to(device) + log.write('* Network has {} parameters.\n'.format(sum([p.nelement() + for p in network.parameters()]))) + + learning_rate = args.adam.rate + betas = args.adam[1:] + optimizer = torch.optim.Adam(network.parameters(), lr=learning_rate, + betas=betas, weight_decay=args.weight_decay) + lr_decay = lambda step: args.lrdecay / (args.lrdecay + step) + lr_scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_decay) + + score_smoothed = helpers.ExponentialSmoother(args.smooth) + + log.write('* Dumping initial model\n') + save_model(network, args.output, 0) + + total_bases = 0 + total_samples = 0 + total_chunks = 0 + rejection_dict = defaultdict(lambda : 0) # To count the numbers of different sorts of chunk rejection + + t0 = time.time() + log.write('* Training\n') + + + + for i in range(args.niteration): + lr_scheduler.step() + # Chunk length is chosen randomly in the range given but forced to be a multiple of the stride + batch_chunk_len = (np.random.randint(args.chunk_len_min, args.chunk_len_max + 1) // args.stride) * args.stride + # We choose the batch size so that the size of the data in the batch is about the same as + # args.min_batch_size chunks of length args.chunk_len_max + target_batch_size = int(args.min_batch_size * args.chunk_len_max / batch_chunk_len + 0.5) + # ...but it can't be more than the number of reads. + batch_size = min(target_batch_size, len(read_data)) + + + # If the logging threshold is 0 then we log all chunks, including those rejected, so pass the log + # object into assemble_batch + if args.chunk_logging_threshold == 0: + log_rejected_chunks = chunk_log + else: + log_rejected_chunks = None + # Chunk_batch is a list of dicts. + chunk_batch, batch_rejections = chunk_selection.assemble_batch(read_data, batch_size, batch_chunk_len, + filter_parameters, args, log, + chunk_log=log_rejected_chunks) + total_chunks += len(chunk_batch) + + # Update counts of reasons for rejection + for k, v in batch_rejections.items(): + rejection_dict[k] += v + + # Shape of input tensor must be (timesteps) x (batch size) x (input channels) + # in this case batch_chunk_len x batch_size x 1 + stacked_current = np.vstack([d['current'] for d in chunk_batch]).T + indata = torch.tensor(stacked_current, device=device, dtype=torch.float32).unsqueeze(2) + # Sequence input tensor is just a 1D vector, and so is seqlens + seqs = torch.tensor(np.concatenate([flipflopfings.flip_flop_code(d['sequence']) for d in chunk_batch]), + device=device, dtype=torch.long) + seqlens = torch.tensor([len(d['sequence']) for d in chunk_batch], dtype=torch.long, device=device) + + optimizer.zero_grad() + outputs = network(indata) + lossvector = ctc.crf_flipflop_loss(outputs, seqs, seqlens, args.sharpen) + loss = lossvector.sum() / (seqlens > 0.0).float().sum() + loss.backward() + optimizer.step() + + fval = float(loss) + score_smoothed.update(fval) + + # Check for poison chunk and save losses and chunk locations if we're poisoned + # If args.chunk_logging_threshold set to zero then we log everything + if fval / score_smoothed.value >= args.chunk_logging_threshold: + chunk_log.write_batch(i, chunk_batch, lossvector) + + total_bases += int(seqlens.sum()) + total_samples += int(indata.nelement()) + + # Doing this deletion leads to less CUDA memory usage. + del indata, seqs, seqlens, outputs, loss, lossvector + if device.type == 'cuda': + torch.cuda.empty_cache() + + if (i + 1) % args.save_every == 0: + save_model(network, args.output, (i + 1) // args.save_every) + log.write('C') + else: + log.write('.') + + if (i + 1) % 50 == 0: + # In case of super batching, additional functionality must be + # added here + learning_rate = lr_scheduler.get_lr()[0] + tn = time.time() + dt = tn - t0 + t = ' {:5d} {:5.3f} {:5.2f}s ({:.2f} ksample/s {:.2f} kbase/s) lr={:.2e}' + log.write(t.format((i + 1) // 50, score_smoothed.value, + dt, total_samples / 1000.0 / dt, + total_bases / 1000.0 / dt, learning_rate)) + # Write summary of chunk rejection reasons + for k, v in rejection_dict.items(): + log.write(" {}:{} ".format(k, v)) + log.write("\n") + total_bases = 0 + total_samples = 0 + t0 = tn + + save_model(network, args.output) diff --git a/bin/train_squiggle.py b/bin/train_squiggle.py new file mode 100755 index 0000000..529e56d --- /dev/null +++ b/bin/train_squiggle.py @@ -0,0 +1,219 @@ +#!/usr/bin/env python3 + + +import argparse +import numpy as np +import os +import sys +import time +import torch + +from collections import defaultdict +from taiyaki import chunk_selection, helpers, mapped_signal_files +import taiyaki.common_cmdargs as common_cmdargs +from taiyaki.cmdargs import (FileExists, Maybe, NonNegative, Positive, proportion) +from taiyaki import activation, layers +#from taiyaki.optim import Adamski +from taiyaki.squiggle_match import squiggle_match_loss, embed_sequence +from taiyaki.version import __version__ + + +parser = argparse.ArgumentParser( + description='Train a model to predict ionic current levels from sequence', + formatter_class=argparse.ArgumentDefaultsHelpFormatter) + +common_cmdargs.add_common_command_args(parser, """adam chunk_logging_threshold device filter_max_dwell filter_mean_dwell + limit lrdecay niteration overwrite quiet save_every + sample_nreads_before_filtering version weight_decay""".split()) + +parser.add_argument('--batch_size', default=100, metavar='chunks', type=Positive(int), + help='Number of chunks to run in parallel') +parser.add_argument('--back_prob', default=1e-15, metavar='probability', + type=proportion, help='Probability of backwards move') +parser.add_argument('--depth', metavar='layers' , default=4, type=Positive(int), + help='Number of residual convolution layers') +parser.add_argument('--drop_slip', default=5, type=Maybe(Positive(int)), metavar='length', + help='Drop chunks with slips greater than given length (None = off)') +parser.add_argument('--input_strand_list', default=None, action=FileExists, + help='Strand summary file containing column read_id. Filenames in file are ignored.') +parser.add_argument('--sd', default=0.5, metavar='value', type=Positive(float), + help='Standard deviation to initialise with') +parser.add_argument('--seed', default=None, metavar='integer', type=Positive(int), + help='Set random number seed') +parser.add_argument('--size', metavar='n', default=32, type=Positive(int), + help='Size of layers in convolution network') +parser.add_argument('--smooth', default=0.45, metavar='factor', type=proportion, + help='Smoothing factor for reporting progress') +parser.add_argument('--target_len', metavar='n', default=300, type=Positive(int), + help='Target length of sequence') +parser.add_argument('--winlen', metavar='n', default=7, type=Positive(int), + help='Window for convolution network') +parser.add_argument('input', action=FileExists, help='HDF5 file containing mapped reads') +parser.add_argument('output', help='Prefix for output files') + + +def create_convolution(size, depth, winlen): + conv_actfun = activation.tanh + return layers.Serial( + [layers.Convolution(3, size, winlen, stride=1, fun=conv_actfun)] + + [layers.Residual(layers.Convolution(size, size, winlen, stride=1, fun=conv_actfun)) for _ in range(depth)] + + [layers.Convolution(size, 3, winlen, stride=1, fun=activation.linear)] + ) + + +def save_model(network, output, index=None): + if index is None: + basename = 'model_final' + else: + basename = 'model_checkpoint_{:05d}'.format(index) + + model_file = os.path.join(output, basename + '.checkpoint') + torch.save(network, model_file) + params_file = os.path.join(output, basename + '.params') + torch.save(network.state_dict(), params_file) + + +if __name__ == '__main__': + args = parser.parse_args() + np.random.seed(args.seed) + + if not os.path.exists(args.output): + os.mkdir(args.output) + elif not args.overwrite: + sys.stderr.write('Error: Output directory {} exists but --overwrite is false\n'.format(args.output)) + exit(1) + if not os.path.isdir(args.output): + sys.stderr.write('Error: Output location {} is not directory\n'.format(args.output)) + exit(1) + + log = helpers.Logger(os.path.join(args.output, 'model.log'), args.quiet) + log.write('# Taiyaki version {}\n'.format(__version__)) + log.write('# Command line\n') + log.write(' '.join(sys.argv) + '\n') + + if args.input_strand_list is not None: + read_ids = list(set(helpers.get_read_ids(args.input_strand_list))) + log.write('* Will train from a subset of {} strands\n'.format(len(read_ids))) + else: + log.write('* Will train from all strands\n') + read_ids = 'all' + + if args.limit is not None: + log.write('* Limiting number of strands to {}\n'.format(args.limit)) + + with mapped_signal_files.HDF5(args.input, "r") as per_read_file: + read_data = per_read_file.get_multiple_reads(read_ids, max_reads=args.limit) + # read_data now contains a list of reads + # (each an instance of the Read class defined in mapped_signal_files.py, based on dict) + + log.write('* Loaded {} reads.\n'.format(len(read_data))) + + # Create a logging file to save details of chunks. + # If args.chunk_logging_threshold is set to 0 then we log all chunks including those rejected. + chunk_log = chunk_selection.ChunkLog(args.output) + + # Get parameters for filtering by sampling a subset of the reads + # Result is a tuple median mean_dwell, mad mean_dwell + filter_parameters = chunk_selection.sample_filter_parameters(read_data, + args.sample_nreads_before_filtering, + args.target_len, + args, + log, + chunk_log=chunk_log) + + medmd, madmd = filter_parameters + log.write("* Sampled {} chunks: median(mean_dwell)={:.2f}, mad(mean_dwell)={:.2f}\n".format( + args.sample_nreads_before_filtering, medmd, madmd)) + + conv_net = create_convolution(args.size, args.depth, args.winlen) + nparam = sum([p.data.detach().numpy().size for p in conv_net.parameters()]) + log.write('# Created network. {} parameters\n'.format(nparam)) + log.write('# Depth {} layers ({} residual layers)\n'.format(args.depth + 2, args.depth)) + log.write('# Window width {}\n'.format(args.winlen)) + log.write('# Context +/- {} bases\n'.format((args.depth + 2) * (args.winlen // 2))) + + device = torch.device(args.device) + conv_net = conv_net.to(device) + + + + learning_rate = args.adam[0] + betas = args.adam[1:] + optimizer = torch.optim.Adam(conv_net.parameters(), lr=learning_rate, + betas=betas, weight_decay=args.weight_decay) + lr_decay = lambda step: args.lrdecay / (args.lrdecay + step) + lr_scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_decay) + + rejection_dict = defaultdict(lambda : 0) # To count the numbers of different sorts of chunk rejection + t0 = time.time() + score_smoothed = helpers.ExponentialSmoother(args.smooth) + total_chunks = 0 + + for i in range(args.niteration): + learning_rate = args.adam.rate / (1.0 + (i**1.25) / args.lrdecay) + # If the logging threshold is 0 then we log all chunks, including those rejected, so pass the log + # object into assemble_batch + if args.chunk_logging_threshold == 0: + log_rejected_chunks = chunk_log + else: + log_rejected_chunks = None + # chunk_batch is a list of dicts. + chunk_batch, batch_rejections = chunk_selection.assemble_batch(read_data, args.batch_size, args.target_len, + filter_parameters, args, log, + chunk_log=log_rejected_chunks, + chunk_len_means_sequence_len=True) + + total_chunks += len(chunk_batch) + # Update counts of reasons for rejection + for k, v in batch_rejections.items(): + rejection_dict[k] += v + + # Shape of input needs to be seqlen x batchsize x embedding_dimension + embedded_matrix = [embed_sequence(d['sequence'], alphabet=None) for d in chunk_batch] + seq_embed = torch.tensor(embedded_matrix).permute(1,0,2).to(device) + # Shape of labels is a flat vector + batch_signal = torch.tensor(np.concatenate([d['current'] for d in chunk_batch])).to(device) + # Shape of lens is also a flat vector + batch_siglen = torch.tensor([len(d['current']) for d in chunk_batch]).to(device) + + #print("First 10 elements of first sequence in batch",seq_embed[:10,0,:]) + #print("First 10 elements of signal batch",batch_signal[:10]) + #print("First 10 lengths",batch_siglen[:10]) + + optimizer.zero_grad() + + predicted_squiggle = conv_net(seq_embed) + batch_loss = squiggle_match_loss(predicted_squiggle, batch_signal, batch_siglen, args.back_prob) + fval = batch_loss.sum() / float(batch_siglen.sum()) + + fval.backward() + optimizer.step() + + score_smoothed.update(float(fval)) + + # Check for poison chunk and save losses and chunk locations if we're poisoned + # If args.chunk_logging_threshold set to zero then we log everything + if fval / score_smoothed.value >= args.chunk_logging_threshold: + chunk_log.write_batch(i, chunk_batch, batch_loss) + + if (i + 1) % args.save_every == 0: + save_model(conv_net, args.output, (i + 1) // args.save_every) + log.write('C') + else: + log.write('.') + + + ROWLENGTH = 50 + if (i + 1) % ROWLENGTH == 0: + tn = time.time() + dt = tn - t0 + t = ' {:5d} {:5.3f} {:5.2f}s' + log.write(t.format((i + 1) // ROWLENGTH, score_smoothed.value, dt)) + t0 = tn + # Write summary of chunk rejection reasons + for k, v in rejection_dict.items(): + log.write(" {}:{} ".format(k, v)) + log.write("\n") + + + save_model(conv_net, args.output) diff --git a/develop_requirements.txt b/develop_requirements.txt new file mode 100644 index 0000000..65fe683 --- /dev/null +++ b/develop_requirements.txt @@ -0,0 +1,6 @@ +pep8==1.7.0 +autopep8==1.2.4 +ipython==6.1.0 +pytest==3.1.2 +parameterized==0.6.1 +pytest-xdist==1.15.0 diff --git a/misc/align.py b/misc/align.py new file mode 100755 index 0000000..e558195 --- /dev/null +++ b/misc/align.py @@ -0,0 +1,271 @@ +#!/usr/bin/env python3 +import argparse +import csv +from collections import OrderedDict +import numpy as np +import matplotlib +import os +import pysam +from scipy.stats import gaussian_kde +from scipy.optimize import minimize_scalar +import subprocess +import sys +import traceback +from taiyaki.cmdargs import proportion, AutoBool, FileExists + + +parser = argparse.ArgumentParser( + description='Align reads to reference and output accuracy statistics', + formatter_class=argparse.ArgumentDefaultsHelpFormatter) + +# TODO: add several named commonly used values for bwa_mem_args +parser.add_argument('--bwa_mem_args', metavar='args', default='-k14 -W20 -r10 -t 16 -A 1 -B 2 -O 2 -E 1', + help="Command line arguments to pass to bwa mem") +parser.add_argument('--coverage', metavar='proportion', default=0.6, type=proportion, + help='Minimum coverage') +parser.add_argument('--data_set_name', default=None, + help="Data set name. If not set file name is used.") +parser.add_argument('--figure_format', default="png", + help="Figure file format. Must be compatible with matplotlib backend.") +parser.add_argument('--fill', default=True, action=AutoBool, + help='Fill basecall quality histogram with color') +parser.add_argument('--show_median', default=False, action=AutoBool, + help='Show median in a histogram plot') +parser.add_argument('--mpl_backend', default="Agg", help="Matplotlib backend to use") +parser.add_argument('--reference', default=None, + help="Reference sequence to align against") + +parser.add_argument('files', metavar='input', nargs='+', + help="One or more files containing query sequences") + + +STRAND = {0 : '+', + 16 : '-'} + +QUANTILES = [5, 25, 50, 75, 95] + + +def call_bwa_mem(fin, fout, genome, clargs=''): + """Call bwa aligner using the subprocess module + + :param fin: input sequence filename + :param fout: filename for the output sam file + :param genome: path to reference to align against + :param clargs: optional command line arguments to pass to bwa as a string + + :returns: stdout of bwa command + + :raises: subprocess.CalledProcessError + """ + command_line = "bwa mem {} {} {} > {}".format(clargs, genome, fin, fout) + try: + output = subprocess.check_output(command_line, + stderr=subprocess.STDOUT, + shell=True, + universal_newlines=True) + except subprocess.CalledProcessError as e: + sys.stderr.write("Error calling bwa, exit code {}\n".format(e.returncode)) + sys.stderr.write(e.output + "\n") + raise + return output + + +def samacc(sam, min_coverage=0.6): + """Read alignments from sam file and return accuracy metrics + + :param sam: filename of input sam file + :min_coverage: alignments are filtered by coverage + + :returns: list of row dictionaries with keys: + reference: reference name + query: query name + reference_start: first base of reference match + reference_end: last base of reference match + strand: + or - + match: number of matches + mismatch: number of mismatches + insertion: number of insertions + deletion: number of deletions + coverage: query alignment length / query length + id: identity = sequence matches / alignment matches + accuracy: sequence matches / alignment length + """ + res = [] + with pysam.Samfile(sam, 'r') as sf: + ref_name = sf.references + for read in sf: + if read.flag != 0 and read.flag != 16: + continue + + coverage = float(read.query_alignment_length) / read.query_length + if coverage < min_coverage: + continue + + bins = np.zeros(9, dtype='i4') + for flag, count in read.cigar: + bins[flag] += count + + tags = dict(read.tags) + alnlen = np.sum(bins[:3]) + mismatch = tags['NM'] + correct = alnlen - mismatch + readlen = bins[0] + bins[1] + perr = min(0.75, float(mismatch) / readlen) + pmatch = 1.0 - perr + + entropy = pmatch * np.log2(pmatch) + if mismatch > 0: + entropy += perr * np.log2(perr / 3.0) + + row = OrderedDict([ + ('reference', ref_name[read.reference_id]), + ('query', read.qname), + ('strand', STRAND[read.flag]), + ('reference_start', read.reference_start), + ('reference_end', read.reference_end), + ('match', bins[0]), + ('mismatch', mismatch), + ('insertion', bins[1]), + ('deletion', bins[2]), + ('coverage', coverage), + ('id', float(correct) / float(bins[0])), + ('accuracy', float(correct) / alnlen), + ('information', bins[0] * (2.0 + entropy)) + ]) + res.append(row) + return res + + +def acc_plot(acc, mode, median, fill, title): + """Plot accuracy histogram + + :param acc_dat: list of row dictionaries of basecall accuracy data + :param title: plot title + + :returns: (figure handle, axes handle) + """ + f = plt.figure() + ax = f.add_subplot(111) + ax.hist(acc, bins=np.arange(0.65, 1.0, 0.01), fill=fill) + ax.set_xlim(0.65, 1) + _, ymax = ax.get_ylim() + ax.plot([mode, mode], [0, ymax], 'r--') + if median: + ax.plot([median, median], [0, ymax], 'b--') + ax.set_xlabel("Accuracy") + ax.set_ylabel("Frequency") + ax.set_title(title) + return f, ax + + +def summary(acc_dat, data_set_name, fill, show_median): + """Create summary report and plots for accuracy statistics + + :param acc_dat: list of row dictionaries of read accuracy metrics + + :returns: (report string, figure handle, axes handle) + """ + if len(acc_dat) == 0: + res = """*** Summary report for {} *** +No sequences mapped +""".format(data_set_name) + return res, None, None + + acc = np.array([r['accuracy'] for r in acc_dat]) + ciscore = np.array([r['information'] for r in acc_dat]) + mean = acc.mean() + + if len(acc) > 1: + try: + da = gaussian_kde(acc) + optimization_result = minimize_scalar(lambda x: -da(x), bounds=(0, 1), method='Bounded') + if optimization_result.success: + try: + mode = optimization_result.x[0] + except IndexError: + mode = optimization_result.x + else: + sys.stderr.write("Mode computation failed") + mode = 0 + except: + sys.stderr.write("Mode computation failed - da or opt") + mode = 0 + else: + mode = acc[0] + + qstring1 = ''.join(['{:<11}'.format('Q' + str(q)) for q in QUANTILES]).strip() + quantiles = [v for v in np.percentile(acc, QUANTILES)] + qstring2 = ' '.join(['{:.5f}'.format(v) for v in quantiles]) + + if show_median: + median = np.median(acc) + else: + median = None + + a90 = (acc > 0.9).mean() + n_gt_90 = (acc > 0.9).sum() + nmapped = len(set([r['query'] for r in acc_dat])) + + res = """*** Summary report for {} *** +Number of mapped reads: {} +Mean accuracy: {:.5f} +Mode accuracy: {:.5f} +Accuracy quantiles: + {} + {} +Proportion with accuracy >90%: {:.5f} +Number with accuracy >90%: {} +CIscore (Mbits): {:.5f} +""".format(data_set_name, nmapped, mean, mode, qstring1, qstring2, a90, n_gt_90, sum(ciscore) / 1e6) + plot_title = "{} (n = {})".format(data_set_name, nmapped) + f, ax = acc_plot(acc, mode, median, fill, plot_title) + return res, f, ax + + +if __name__ == '__main__': + args = parser.parse_args() + + # Set the mpl backend. The default, Agg, does not require an X server to be running + # Note: this must happen before matplotlib.pyplot is imported + matplotlib.use(args.mpl_backend) + import matplotlib.pyplot as plt + + exit_code = 0 + for fn in args.files: + try: + prefix, suffix = os.path.splitext(fn) + samfile = prefix + '.sam' + samaccfile = prefix + '.samacc' + summaryfile = prefix + '.summary' + graphfile = prefix + '.' + args.figure_format + + # align sequences to reference + if args.reference and not suffix == '.sam': + sys.stdout.write("Aligning {}...\n".format(fn)) + bwa_output = call_bwa_mem(fn, samfile, args.reference, args.bwa_mem_args) + sys.stdout.write(bwa_output) + + # compile accuracy metrics + acc_dat = samacc(samfile, min_coverage=args.coverage) + if len(acc_dat) > 0: + with open(samaccfile, 'w') as fs: + fields = list(acc_dat[0].keys()) + writer = csv.DictWriter(fs, fieldnames=fields, delimiter=' ') + writer.writeheader() + for row in acc_dat: + writer.writerow(row) + + # write summary file and plot + data_set_name = fn if args.data_set_name is None else args.data_set_name + report, f, ax = summary(acc_dat, data_set_name, args.fill, args.show_median) + if f is not None: + f.savefig(graphfile) + sys.stdout.write('\n' + report + '\n') + with open(summaryfile, 'w') as fs: + fs.writelines(report) + except: + sys.stderr.write("{}: something went wrong, skipping\n\n".format(fn)) + sys.stderr.write("Traceback:\n\n{}\n\n".format(traceback.format_exc())) + exit_code = 1 + + sys.exit(exit_code) diff --git a/misc/check_hdf5_contents.py b/misc/check_hdf5_contents.py new file mode 100755 index 0000000..5c40205 --- /dev/null +++ b/misc/check_hdf5_contents.py @@ -0,0 +1,21 @@ +#!/usr/bin/env python3 +# Check that a HDF5 file c given keys +# Return failure condition if they are not present. + +import argparse +import h5py + +parser = argparse.ArgumentParser( + description='Check that given keys exist in an HDF5 file') + +parser.add_argument('input', help='HDF5 file') +parser.add_argument("keys", nargs="+", help="Keys to check") + + +if __name__ == "__main__": + args = parser.parse_args() + with h5py.File(args.input, 'r') as h5: + for key in args.keys: + testobject = h5[key] + print("Key ", key, "present in", args.input) + print("All keys present") diff --git a/misc/compress_hdf5.sh b/misc/compress_hdf5.sh new file mode 100755 index 0000000..f4a6a96 --- /dev/null +++ b/misc/compress_hdf5.sh @@ -0,0 +1,12 @@ +#!/usr/bin/env bash + +if [ "$1" == "" ] +then + echo "Compress data sets within a HDF5 file" + echo "Usage: compress_hdf5.sh file.hdf5" + exit 1 +fi +INFILE=$1 + +TMPFILE=`mktemp -p .` +h5repack -f SHUF -f GZIP=1 ${INFILE} ${TMPFILE} && mv ${TMPFILE} ${INFILE} diff --git a/misc/merge_mappedsignalfiles.py b/misc/merge_mappedsignalfiles.py new file mode 100755 index 0000000..e79bbf5 --- /dev/null +++ b/misc/merge_mappedsignalfiles.py @@ -0,0 +1,41 @@ +#!/usr/bin/env python3 +# Combine mapped-read files in HDF5 format into a single file + +import argparse +from taiyaki import mapped_signal_files +from taiyaki.cmdargs import Positive + +parser = argparse.ArgumentParser( + description='Combine HDF5 mapped-read files into a single file') +parser.add_argument('output',help='Output filename') +parser.add_argument('inputs', nargs='*', help='One or more input files') +parser.add_argument('--version', default=mapped_signal_files._version, type=Positive(int), + help='Version number for mapped read format') + +#To convert to any new mapped read format (e.g. mapped_signal_files.SQL) +#we should be able to just change MAPPED_READ_CLASS to equal the new class. +MAPPED_READ_CLASS = mapped_signal_files.HDF5 + + +if __name__ == '__main__': + args = parser.parse_args() + reads_written = set() + print("Writing reads to ", args.output) + with MAPPED_READ_CLASS(args.output, "w") as hout: + hout.write_version_number(args.version) + for infile in args.inputs: + copied_from_this_file = 0 + with MAPPED_READ_CLASS(infile, "r") as hin: + in_version = hin.get_version_number() + if in_version != args.version: + raise Exception("Version number of files should be {} but version number of {} is {}".format(args.version, infile, in_version)) + for read_id in hin.get_read_ids(): + if read_id in reads_written: + print("* Read",read_id,"already present: not copying from",infile) + else: + hout.write_read(read_id, hin.get_read(read_id)) + reads_written.add(read_id) + copied_from_this_file += 1 + print("Copied",copied_from_this_file,"reads from",infile) + print("Copied",len(reads_written),"reads in total") + \ No newline at end of file diff --git a/misc/motif.py b/misc/motif.py new file mode 100755 index 0000000..9f69a96 --- /dev/null +++ b/misc/motif.py @@ -0,0 +1,70 @@ +#!/usr/bin/env python3 +import argparse +import numpy as np + +from taiyaki.cmdargs import (AutoBool, FileExists, Positive) +from taiyaki.fileio import readtsv +from taiyaki.helpers import fasta_file_to_dict + +parser = argparse.ArgumentParser() +parser.add_argument('--refbackground', default=False, action=AutoBool, + help='Get background from references') +parser.add_argument('--down', metavar='bases', type=Positive(int), + default=15, help='number of bases down stream') +parser.add_argument('--up', metavar='bases', type=Positive(int), + default=15, help='number of bases up stream') +parser.add_argument('references', action=FileExists, + help='Fasta file containing references') +parser.add_argument('coordinates', action=FileExists, + help='coordinates file') + +bases = {b: i for i, b in enumerate('ACGT')} + +if __name__ == '__main__': + args = parser.parse_args() + args.up += 1 + + refdict = fasta_file_to_dict(args.references) + coordinates = readtsv(args.coordinates) + + background_counts = np.zeros(len(bases), dtype=float) + if args.refbackground: + for ref in refdict.values(): + refstr = ref.decode('ascii') + background_counts += [refstr.count(b) for b in bases.keys()] + + frags = [] + for coord in coordinates: + readname, pos = coord['filename'], coord['pos'] + readname = readname.decode('ascii') + if pos < args.down: + continue + if not readname in refdict: + continue + ref = refdict[readname] + if pos + args.up > len(ref): + continue + + frag = ref[pos - args.down : pos + args.up].decode('ascii') + states = [bases[b] for b in frag] + frags.append([np.array(states)]) + + if len(frags) == 0: + print("No reads") + + frag_array = np.concatenate(frags).transpose() + count_array = [] + + for pos_array in frag_array: + counts = np.bincount(pos_array) + count_array.append([counts]) + if not args.refbackground: + background_counts += counts + + background_counts /= sum(background_counts) + + position_counts = np.concatenate(count_array) / len(frags) + relative_abundence = position_counts / background_counts + + for pos, logodds in zip(range(-args.down, args.up), np.log(relative_abundence)): + print(pos, logodds) diff --git a/misc/plot_accuracy_histogram_from_alignment_summary.py b/misc/plot_accuracy_histogram_from_alignment_summary.py new file mode 100755 index 0000000..cb733d7 --- /dev/null +++ b/misc/plot_accuracy_histogram_from_alignment_summary.py @@ -0,0 +1,41 @@ +#!/usr/bin/env python3 +import argparse +import numpy as np + +import matplotlib as mpl +mpl.use('Agg') # So we don't need an x server +import matplotlib.pyplot as plt + +from taiyaki.fileio import readtsv +from taiyaki.cmdargs import FileExists, Positive + +parser = argparse.ArgumentParser(description='Plot an accuracy histogram from a combined read file', + formatter_class=argparse.ArgumentDefaultsHelpFormatter) + +parser.add_argument('combined_read_file', action=FileExists, help='Combined read file to get data from') +parser.add_argument('--bins', default=100, type=Positive(int), help='Number of bins for histogram') +parser.add_argument('--title', default='', help='Figure title') +parser.add_argument('--output_name', default='basecaller_histogram.png', help='Output file name') + +if __name__ == "__main__": + args = parser.parse_args() + + AccVals=readtsv(args.combined_read_file)['alignment_accuracy'] + + fig, ax = plt.subplots() + + ax.set_title(args.title) + ax.set_xlabel('Accuracy') + ax.set_ylabel('Reads') + + ax.minorticks_on() + ax.grid(which='major', linestyle=':') + ax.grid(which='minor', linestyle=':') + + plt.hist(np.array(AccVals[AccVals>=0]), bins = args.bins) + + plt.tight_layout() + + plt.savefig(args.output_name) + + diff --git a/misc/plot_chunklog.py b/misc/plot_chunklog.py new file mode 100755 index 0000000..48235d8 --- /dev/null +++ b/misc/plot_chunklog.py @@ -0,0 +1,73 @@ +#!/usr/bin/env python3 +import matplotlib as mpl +mpl.use('Agg') # So we don't need an x server +import matplotlib.pyplot as plt +import numpy as np +import sys +from taiyaki import fileio + +print("Plots summary of chunk log.") +print("Usage:") +print("plot_chunk_log.py ") +if len(sys.argv) < 3: + print("ERROR: Needs command line arguments!") +else: + chunk_log_file = sys.argv[1] + plotfile = sys.argv[2] + t = fileio.readtsv(chunk_log_file) + + plt.figure(figsize=(16, 12)) + + plt.subplot(2, 2, 1) + plt.title('Mean dwells of chunks sampled to get filter params') + f = (t['iteration'] == -1) & (t['status'] == 'pass') + bases = t['chunk_len_bases'][f] + samples = t['chunk_len_samples'][f] + filter_sample_length = len(bases) + meandwells = samples / (bases + 0.0001) + plt.hist(meandwells, bins=100, log=True) + plt.grid() + + # Remove the part that refers to the sampling for filter params + t = t[filter_sample_length:] + + plt.subplot(2, 2, 2) + plt.title('Lengths of accepted and rejected chunks') + status_choices = np.unique(t['status']) + # Need to do 'pass' first - otherwise it overwhelms everything + status_choices = list(status_choices[status_choices != 'pass']) + status_choices = ['pass'] + status_choices + for status in status_choices: + filt = (t['status'] == status) + bases = t['chunk_len_bases'][filt] + samples = t['chunk_len_samples'][filt] + print("Status", status, "number of chunks=", len(bases)) + plt.scatter(bases, samples, label=status, s=4) + + plt.grid() + plt.ylabel('Length in bases') + plt.xlabel('Length in samples') + plt.legend(loc='upper left', framealpha=0.3) + + for nplot, scale in enumerate('log linear'.split()): + plt.subplot(2, 2, nplot + 3, xscale=scale, yscale=scale) + plt.title('Max and mean dwells') + status_choices = np.unique(t['status']) + # Need to do 'pass' first - otherwise it overwhelms everything + status_choices = list(status_choices[status_choices != 'pass']) + status_choices = ['pass'] + status_choices + for status in status_choices: + filt = (t['status'] == status) + bases = t['chunk_len_bases'][filt] + samples = t['chunk_len_samples'][filt] + count = len(bases) + meandwells = samples / (bases + 0.0001) + maxdwells = t['max_dwell'][filt] + plt.scatter(meandwells, maxdwells, label=status+' ('+str(count)+')', s=4, alpha=0.5) + + plt.grid() + plt.xlabel('Mean dwell') + plt.ylabel('Max dwell') + plt.legend(loc='lower right', framealpha=0.3) + + plt.savefig(plotfile) diff --git a/misc/plot_mapped_signals.py b/misc/plot_mapped_signals.py new file mode 100755 index 0000000..3cfe177 --- /dev/null +++ b/misc/plot_mapped_signals.py @@ -0,0 +1,51 @@ +#!/usr/bin/env python3 +import argparse +import matplotlib as mpl +mpl.use('Agg') # So we don't need an x server +import matplotlib.pyplot as plt +import numpy as np +from taiyaki.cmdargs import Positive +from taiyaki import mapped_signal_files + +parser = argparse.ArgumentParser( + description='Plot graphs of training mapped reads.', + formatter_class=argparse.ArgumentDefaultsHelpFormatter) + +parser.add_argument('mapped_read_file', help='Input: a mapped read file') +parser.add_argument('output', help='Output: a png file') +parser.add_argument('--nreads', type=Positive(int), default=10, + help='Number of reads to plot. Not used if read_ids are given') +parser.add_argument('--read_ids', nargs='+', default=[], + help='One or more read_ids. If not present, plots the first NREADS in the file') + +if __name__=="__main__": + args = parser.parse_args() + print("Opening ", args.mapped_read_file) + with mapped_signal_files.HDF5(args.mapped_read_file, "r") as h5: + all_read_ids = h5.get_read_ids() + print("First ten read_ids in file:") + for read_id in all_read_ids[:10]: + print(" ", read_id) + if len(args.read_ids) > 0: + read_ids = args.read_ids + else: + read_ids = all_read_ids[:args.nreads] + print("Plotting first ", args.nreads, "read ids in file") + plt.figure(figsize=(12, 10)) + for nread, read_id in enumerate(read_ids): + print("Opening read id ",read_id) + r = h5.get_read(read_id) + mapping = r['Ref_to_signal'] + f = mapping >= 0 + maplen = len(mapping) + label = str(nread) + ":" + read_id + " reflen:" + str(maplen - 1) + ", daclen:" + str(len(r['Dacs'])) + plt.plot(np.arange(maplen)[f], mapping[f], label=label) + + plt.grid() + plt.xlabel('Reference location') + plt.ylabel('Signal location') + if len(read_ids) < 15: + plt.legend(loc='upper left', framealpha=0.3) + plt.tight_layout() + print("Saving plot to", args.output) + plt.savefig(args.output) diff --git a/misc/plot_predict_squiggle_output.py b/misc/plot_predict_squiggle_output.py new file mode 100755 index 0000000..3e7318d --- /dev/null +++ b/misc/plot_predict_squiggle_output.py @@ -0,0 +1,29 @@ +#!/usr/bin/env python3 +import matplotlib as mpl +mpl.use('Agg') # So we don't need an x server +import matplotlib.pyplot as plt +import sys +from taiyaki import fileio + +print("Plots output of predict_squiggle.py") +print("Usage:") +print("plot_predict_squiggle_output.py ") +if len(sys.argv) < 3: + print("ERROR: Needs command line arguments!") +else: + predict_squiggle_output_file = sys.argv[1] + plotfile = sys.argv[2] + t = fileio.readtsv(predict_squiggle_output_file) + + plt.figure(figsize=(16, 5)) + tstart = 0 + for nrow in range(len(t)): + i,sd,dwell = t['current'][nrow], t['sd'][nrow], t['dwell'][nrow] + centret = tstart + dwell/2 + plt.bar(centret, sd, dwell, i-sd/2) + plt.text(centret, i, t['base'][nrow]) + tstart +=dwell + plt.xlabel('time') + plt.ylabel('current') + plt.grid() + plt.savefig(plotfile) diff --git a/misc/plot_training.py b/misc/plot_training.py new file mode 100755 index 0000000..c252e90 --- /dev/null +++ b/misc/plot_training.py @@ -0,0 +1,49 @@ +#!/usr/bin/env python3 +import argparse +import matplotlib as mpl +mpl.use('Agg') # So we don't need an x server +import matplotlib.pyplot as plt +import os +from taiyaki.cmdargs import Positive + +parser = argparse.ArgumentParser( + description='Plot graphs of training loss', + formatter_class=argparse.ArgumentDefaultsHelpFormatter) + +parser.add_argument('output', help='Output png file') +parser.add_argument('input_directories', nargs='+', + help='One or more directories containing files called model.log') +parser.add_argument('--upper_y_limit', default=None, + type=Positive(float), help='Upper limit of plot y(loss) axis') + +if __name__=="__main__": + args = parser.parse_args() + plt.figure() + for training_directory in args.input_directories: + blocklist = [] + losslist = [] + filepath = training_directory + "/model.log" + print("Opening", filepath) + with open(filepath, "r") as f: + for line in f: + # The * removes error messges in the log + if line.startswith('.') and not ('*' in line): + splitline = line.split() + try: + # This try...except only needed in the case where training stops after + # some dots and before the numbers are written to the file + blocklist.append(int(splitline[1])) + losslist.append(float(splitline[2])) + except: + break + #The label for the legend is the name of the directory (without its full path) + plt.plot(blocklist, losslist, label = os.path.basename(training_directory)) + plt.grid() + plt.xlabel('Iteration blocks (each block = 50 iterations)') + plt.ylabel('Loss') + if args.upper_y_limit is not None: + plt.ylim(top=args.upper_y_limit) + plt.legend(loc='upper right') + plt.tight_layout() + print("Saving plot to", args.output) + plt.savefig(args.output) diff --git a/misc/split_strandlist.py b/misc/split_strandlist.py new file mode 100755 index 0000000..2177833 --- /dev/null +++ b/misc/split_strandlist.py @@ -0,0 +1,54 @@ +#!/usr/bin/env python3 +# Take a directory or a strand list and make N separate files, +# which together list all the strands. + +import argparse +import os + +from taiyaki.cmdargs import Positive + +parser = argparse.ArgumentParser( + description='Split a strand list into a number of smaller strand lists, or alternatively do the same thing starting with a directory containing fast5s.') +parser.add_argument('--maxlistsize', default=10000, type=Positive(int), + help='Maximum size for a strand list') + +parser.add_argument('--outputbase', default=10000, + help='Strand lists will be saved as _000.txt etc. If outputbase not present then the input will be used as the base name.') + + +parser.add_argument('input', help='either a strand list file or a directory name') + +strandlist_header = "filename" + +if __name__ == '__main__': + args = parser.parse_args() + # If we can read strands from it, then it's a strand list + try: + strands = [] + with open(args.input, "r") as f: + for nline, line in enumerate(f): + cleanedline = line.rstrip() + if nline < 10: + print(cleanedline) + if cleanedline.endswith('fast5'): # First line is often 'filename' + strands.append(cleanedline) + print("Read", len(strands), "files from strand list") + except: + strands = os.listdir(args.input) + print("Read", len(strands), "files from directory") + for fi in strands: + if not (fi.endswith('fast5')): + raise Exception("Not all files in directory are fast5 files") + + filebase = args.outputbase + if filebase is None: + filebase = args.input + nfiles = (len(strands) + args.maxlistsize - 1) // args.maxlistsize + for filenumber in range(nfiles): + fname = filebase + str(filenumber).zfill(3) + with open(fname, "w") as f: + f.write(strandlist_header + "\n") + startnum = filenumber * args.maxlistsize + endnum = min(len(strands), (filenumber + 1) * args.maxlistsize) + for nstrand in range(startnum, endnum): + f.write(strands[nstrand] + "\n") diff --git a/models/mGru256_flipflop.py b/models/mGru256_flipflop.py new file mode 100644 index 0000000..a903a59 --- /dev/null +++ b/models/mGru256_flipflop.py @@ -0,0 +1,21 @@ +import numpy as np + +from taiyaki.activation import tanh +from taiyaki.layers import Convolution, GruMod, Reverse, Serial, GlobalNormFlipFlop + + +def network(insize=1, size=256, winlen=19, stride=2, outsize=40): + nbase = int(np.sqrt(outsize / 2)) + + assert 2 * nbase * (nbase + 1) == outsize,\ + "Invalid size for a flipflop model: nbase = {}, size = {}".format(nbase, outsize) + + return Serial([ + Convolution(insize, size, winlen, stride=stride, fun=tanh), + Reverse(GruMod(size, size)), + GruMod(size, size), + Reverse(GruMod(size, size)), + GruMod(size, size), + Reverse(GruMod(size, size)), + GlobalNormFlipFlop(size, nbase), + ]) diff --git a/models/mGru256_flipflop_remapping_model_r9_DNA.checkpoint b/models/mGru256_flipflop_remapping_model_r9_DNA.checkpoint new file mode 100644 index 0000000..06ca875 Binary files /dev/null and b/models/mGru256_flipflop_remapping_model_r9_DNA.checkpoint differ diff --git a/models/mGru85.py b/models/mGru85.py new file mode 100644 index 0000000..c4db7b9 --- /dev/null +++ b/models/mGru85.py @@ -0,0 +1,14 @@ +from taiyaki.activation import tanh +from taiyaki.layers import Convolution, GruMod, Reverse, Serial, Softmax + + +def network(insize=1, size=85, winlen=19, stride=5, outsize=1025): + return Serial([ + Convolution(insize, size, winlen, stride=stride, fun=tanh), + Reverse(GruMod(size, size)), + GruMod(size, size), + Reverse(GruMod(size, size)), + GruMod(size, size), + Reverse(GruMod(size, size)), + Softmax(size, outsize), + ]) diff --git a/models/mGru96.py b/models/mGru96.py new file mode 100644 index 0000000..5de0877 --- /dev/null +++ b/models/mGru96.py @@ -0,0 +1,14 @@ +from taiyaki.activation import tanh +from taiyaki.layers import Convolution, GruMod, Reverse, Serial, Softmax + + +def network(insize=1, size=96, winlen=19, stride=5, outsize=1025): + return Serial([ + Convolution(insize, size, winlen, stride=stride, fun=tanh), + Reverse(GruMod(size, size)), + GruMod(size, size), + Reverse(GruMod(size, size)), + GruMod(size, size), + Reverse(GruMod(size, size)), + Softmax(size, outsize), + ]) diff --git a/models/tiny_raw_gru.py b/models/tiny_raw_gru.py new file mode 100644 index 0000000..b4b1b3f --- /dev/null +++ b/models/tiny_raw_gru.py @@ -0,0 +1,14 @@ +from taiyaki.activation import tanh +from taiyaki.layers import Convolution, GruMod, Reverse, Serial, Softmax + + +def network(insize=1, size=16, winlen=19, stride=8, outsize=1025): + return Serial([ + Convolution(insize, size, winlen, stride=stride, fun=tanh), + Reverse(GruMod(size, size)), + GruMod(size, size), + Reverse(GruMod(size, size)), + GruMod(size, size), + Reverse(GruMod(size, size)), + Softmax(size, outsize), + ]) diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..258c77e --- /dev/null +++ b/requirements.txt @@ -0,0 +1,11 @@ +h5py >= 2.2.1,<=2.6.0 +numpy >= 1.9.0 +biopython >= 1.63 +Cython >= 0.25.2 +wheel >= 0.29.0 +ont_fast5_api == 1.2.0 +pysam >= 0.10.0 +matplotlib >= 2.0.0 +scipy >= 1 +torch >= 1 + diff --git a/setup.cfg b/setup.cfg new file mode 100644 index 0000000..28f0cbf --- /dev/null +++ b/setup.cfg @@ -0,0 +1,5 @@ +[aliases] +test=pytest + +[tool:pytest] +testpaths = test/unit diff --git a/setup.py b/setup.py new file mode 100644 index 0000000..052b2a8 --- /dev/null +++ b/setup.py @@ -0,0 +1,135 @@ +from glob import glob +import imp +import os +import subprocess +from setuptools import setup, find_packages +from setuptools.extension import Extension +import sys +import time + + +MAJOR = 3 +MINOR = 0 +REVISION = 0 + + +def git_hash(): + commit = subprocess.check_output(['git', 'rev-parse', '--short', 'HEAD']).decode().strip() + return commit + + +def git_revision(): + revision = subprocess.check_output(['git', 'rev-list', '--count', 'HEAD']).decode().strip() + return int(revision) + + +version_module = '''""" +taiyaki/version.py +This file was generated by setup.py at: {time} +""" + +__version__ = "{version}" +major = {major} +minor = {minor} +revision = {revision} +git_revision = {git_revision} +git_hash = "{git_hash}" +''' + + +def write_version(fn="taiyaki/version.py"): + # The git revision may be incorrect if the source tree is not fully checked out + GIT_REVISION = git_revision() + GIT_HASH = git_hash() + + version = "{}.{}.{}+{}".format(MAJOR, MINOR, REVISION, GIT_HASH) + + with open(fn, 'w') as f: + f.write(version_module.format(time=time.strftime("%a, %d %b %Y %H:%M:%S GMT%z", time.localtime()), + version=version, major=MAJOR, minor=MINOR, + revision=REVISION, git_revision=GIT_REVISION, git_hash=GIT_HASH)) + + +THIS_DIR = os.path.dirname(os.path.abspath(__file__)) +if os.path.exists(os.path.join(THIS_DIR, ".git")): + # If this is a git repo, write the version. Otherwise assume we are installing from a wheel + write_version() + + +taiyaki_version = imp.load_source("version", "taiyaki/version.py") +version = taiyaki_version.__version__ + + +try: + root_dir = os.environ['ROOT_DIR'] +except KeyError: + root_dir = '.' + + +install_requires = [ + "h5py >= 2.2.1", + "numpy >= 1.9.0", + "biopython >= 1.63", + "Cython >= 0.25.2", +] + + +# Build extensions +try: + import numpy as np + from Cython.Build import cythonize + extensions = cythonize([ + Extension("taiyaki.squiggle_match.squiggle_match", + [os.path.join("taiyaki/squiggle_match", "squiggle_match.pyx"), + os.path.join("taiyaki/squiggle_match", "c_squiggle_match.c")], + include_dirs=[np.get_include()], + extra_compile_args=["-O3", "-fopenmp", "-std=c99", "-march=native"], + extra_link_args=["-fopenmp"]), + Extension("taiyaki.ctc.ctc", [os.path.join("taiyaki/ctc", "ctc.pyx"), + os.path.join("taiyaki/ctc", "c_crf_flipflop.c")], + include_dirs=[np.get_include()], + extra_compile_args=["-O3", "-fopenmp", "-std=c99", "-march=native"], + extra_link_args=["-fopenmp"]) + ]) +except ImportError: + extensions = [] + sys.stderr.write("WARNING: Numpy and Cython are required to build taiyaki extensions\n") + if any([cmd in sys.argv for cmd in ["install", "build", "build_clib", "build_ext", "bdist_wheel"]]): + raise + + +setup( + name='taiyaki', + version=version, + description='Neural network model training for Nanopore base calling', + maintainer='Tim Massingham', + maintainer_email='tim.massingham@nanoporetech.com', + url='http://www.nanoporetech.com', + long_description="""Taiyaki is a library to support training and developing new base calling models +for Oxford Nanopore Technologies' sequencing platforms.""", + + classifiers=[ + 'Development Status :: 3 - Alpha', + 'Environment :: Console', + 'Intended Audience :: Developers', + 'Intended Audience :: Science/Research', + 'Natural Language :: English', + 'Operating System :: Unix', + 'Programming Language :: Python :: 3 :: Only', + 'Topic :: Scientific/Engineering :: Artificial Intelligence', + 'Topic :: Scientific/Engineering :: Bio-Informatics', + 'Topic :: Scientific/Engineering :: Mathematics' + ], + + packages=find_packages(exclude=["*.test", "*.test.*", "test.*", "test", "bin"]), + package_data={'configs': 'data/configs/*'}, + exclude_package_data={'': ['*.hdf', '*.c', '*.h']}, + ext_modules=extensions, + setup_requires=["pytest-runner", "pytest-xdist"], + tests_require=["parameterized", "pytest"], + install_requires=install_requires, + dependency_links=[], + zip_safe=False, + scripts=[x for x in glob('bin/*.py')], + +) diff --git a/taiyaki/__init__.py b/taiyaki/__init__.py new file mode 100644 index 0000000..ed8c6be --- /dev/null +++ b/taiyaki/__init__.py @@ -0,0 +1 @@ +"""Custard owns my heart!""" diff --git a/taiyaki/activation.py b/taiyaki/activation.py new file mode 100644 index 0000000..e2ce918 --- /dev/null +++ b/taiyaki/activation.py @@ -0,0 +1,135 @@ +import torch +# Some activation functions +# Many based on M-estimations functions, see +# http://research.microsoft.com/en-us/um/people/zhang/INRIA/Publis/Tutorial-Estim/node24.html + + +# Unbounded +def sqr(x): + # See https://github.com/pytorch/pytorch/issues/2618 + return torch.pow(x, 2) + + +def linear(x): + return x + + +def relu(x): + return torch.relu(x) + + +def relu_smooth(x): + y = torch.clamp(x, 0.0, 1.0) + return sqr(y) - 2.0 * y + x + abs(x) + + +def softplus(x): + """ Softplus function log(1 + exp(x)) + + Calculated in a way stable to large and small values of x. The version + of this routine in theano.tensor.nnet clips the range of x, potential + causing NaN's to occur in the softmax (all inputs clipped to zero). + + x >=0 --> x + log1p(exp(-x)) + x < 0 --> log1p(exp(x)) + + This is equivalent to relu(x) + log1p(exp(-|x|)) + """ + absx = abs(x) + softplus_neg = torch.log1p(torch.exp(-absx)) + return relu(x) + softplus_neg + + +def elu(x, alpha=1.0): + """ Exponential Linear Unit + See "Fast and Accuracte Deep Network Learning By Exponential Linear + Units" Clevert, Unterthiner and Hochreiter. + https://arxiv.org/pdf/1511.07289.pdf + + :param alpha: Exponential scaling parameter, see paper for details. + """ + return selu(x, alpha, 1.0) + + +def selu(x, alpha=1.6733, lam=1.0507): + """ Scaled Exponential Linear Unit + See "Self-Normalizing Neural Networks" Klambauer, Unterthiner, Mayr + and Hocreiter. https://arxiv.org/pdf/1706.02515.pdf + + :param alpha: Exponential scaling parameter, see paper for details. + :param lam: Scaling parameter, see paper for details. + """ + return lam * torch.where(x > 0, x, alpha * torch.expm1(x)) + + +def exp(x): + return torch.exp(x) + + +# Bounded and monotonic + + +def tanh(x): + return torch.tanh(x) + + +def sigmoid(x): + return torch.sigmoid(x) + + +def erf(x): + return torch.erf(x) + + +def L1mL2(x): + return x / torch.sqrt(1.0 + 0.5 * x * x) + + +def fair(x): + return x / (1.0 + abs(x) / 1.3998) + + +def retu(x): + """ Rectifying activation followed by Tanh + + Inspired by more biological neural activation, see figure 1 + http://jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf + """ + return tanh(relu(x)) + + +def tanh_pm(x): + """ Poor man's tanh + Linear approximation by tangent at x=0. Clip into valid range. + """ + return torch.clamp(x, -1.0, 1.0) + + +def sigmoid_pm(x): + """ Poor man's sigmoid + Linear approximation by tangent at x=0. Clip into valid range. + """ + return torch.clamp(0.5 + 0.25 * x, 0.0, 1.0) + + +def bounded_linear(x): + """ Linear activation clipped into -1, 1 + """ + return torch.clamp(x, -1.0, 1.0) + + +# Bounded and redescending +def sin(x): + return torch.sin(x) + + +def cauchy(x): + return x / (1.0 + sqr(x / 2.3849)) + + +def geman_mcclure(x): + return x / sqr(1.0 + sqr(x)) + + +def welsh(x): + return x * exp(-sqr(x / 2.9846)) diff --git a/taiyaki/bio.py b/taiyaki/bio.py new file mode 100644 index 0000000..4e09114 --- /dev/null +++ b/taiyaki/bio.py @@ -0,0 +1,21 @@ +""" Module containing collection of functions for operating on sequences +represented as strings, and lists thereof. +""" +from taiyaki.iterators import product, window + +# Base complements +_COMPLEMENT = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C', 'X': 'X', 'N': 'N', + 'a': 't', 't': 'a', 'c': 'g', 'g': 'c', 'x': 'x', 'n': 'n', + '-': '-'} + + +def reverse_complement(seq, compdict=_COMPLEMENT): + """ Return reverse complement of a base sequence. + + :param seq: A string of bases. + :param compdict: A dictionary containing base complements + + :returns: A string of bases. + + """ + return ''.join(compdict[b] for b in seq)[::-1] diff --git a/taiyaki/c_decoding.c b/taiyaki/c_decoding.c new file mode 100644 index 0000000..6f9470b --- /dev/null +++ b/taiyaki/c_decoding.c @@ -0,0 +1,288 @@ +#include +#include +#include "c_decoding.h" + +#define BIG_FLOAT 1.e30f + + +/** + * + * @param x[nr*nc] Array containing matrix stored column-major + * @param nr Number of rows + * @param nc Number of columns + * @param idx[nc] Array[out] to write indices + * + * @returns Indices found stored in array `idx` + **/ +void colmaxf(float * x, size_t nr, size_t nc, int * idx){ + assert(nr > 0); + assert(nc > 0); + assert(NULL != x); + assert(NULL != idx); + + for(int r=0 ; r < nr ; r++){ + // Initialise + idx[r] = 0; + } + + for(int c=1 ; c < nc ; c++){ + const size_t offset2 = c * nr; + for(int r=0 ; r x[idx[r] * nr + r]){ + idx[r] = c; + } + } + } +} + + +/** Find location of maximum element of array + * + * @param x Array + * @param n Length of array + * + * @returns Index of maximum element or -1 on error + **/ +int argmaxf(const float *x, size_t n) { + assert(n > 0); + if (NULL == x) { + return -1; + } + int imax = 0; + float vmax = x[0]; + for (int i = 1; i < n; i++) { + if (x[i] > vmax) { + vmax = x[i]; + imax = i; + } + } + return imax; +} + + +/** Backtrace to determine best Viterbi path + * + * @param curr_score[nstate] Array containing forward scores of last position + * @param nstate Number of states including start and end states + * @param nblock Number of block + * @param traceback[nstate*nblock] Array containing traceback matrix stored + * column-major format, or NULL + * @param path[nblock] Array[out] to write path, or NULL + * + * @returns Score of best path. Best scoring path is writen to the array path if both + * traceback and path are non-NULL + **/ +float viterbi_local_backtrace(float const *curr_score, size_t nstate, size_t nblock, int const * traceback, int32_t * path){ + assert(NULL != curr_score); + + const int32_t START_STATE = nstate - 2; + const int32_t END_STATE = nstate - 1; + + int32_t last_state = argmaxf(curr_score, nstate); + float logscore = curr_score[last_state]; + + if(NULL != path && NULL != traceback){ + // Decode + for(size_t i=0 ; i<=nblock ; i++){ + // Initialise entries to stay + path[i] = -1; + } + + for(int i=0 ; i < nblock ; i++){ + const int ri = nblock - i - 1; + const int32_t state = traceback[ri * nstate + last_state]; + if(state >= 0){ + path[ri + 1] = last_state; + last_state = state; + } + } + path[0] = last_state; + + // Transcode start to stay + for(int i=0 ; i < nblock ; i++){ + if(path[i] == START_STATE){ + path[i] = -1; + } else { + break; + } + } + // Transcode end to stay + for(int i=nblock ; i >= 0 ; i--){ + if(path[i] == END_STATE){ + path[i] = -1; + } else { + break; + } + } + } + + return logscore; +} + + +/** Forwards sweep of Viterbi algorithm, + * + * @param logpost Array containing weights for each block. Strided matrix with + * column-major storage. + * @param nblock Number of blocks in each chunk + * @param nparam Number of parameters (weights) output per block + * @param nbase Number of bases + * @param stride Stride of matrix logpost + * @param stay_pen Penalty to suppress stays (positive == more suppression) + * @param skip_pen Penalty to suppress skips (positive == more suppression) + * @param local_pen Local matching penalty (positive == less clipping) + * @param path[nblock] Array[out] to write path, or NULL + * + * @returns Score of best path. Best scoring path is writen to the array path if both + * traceback and path are non-NULL + **/ +float fast_viterbi(float const * logpost, size_t nblock, size_t nparam, size_t nbase, size_t stride, + float stay_pen, float skip_pen, float local_pen, + int32_t *path){ + float logscore = NAN; + assert(NULL != logpost); + + const int nstep = nbase; + const int nskip = nbase * nbase; + + const size_t nhst = nparam - 1; + assert(nhst % nstep == 0); + assert(nhst % nskip == 0); + const int step_rem = nhst / nstep; + const int skip_rem = nhst / nskip; + + const size_t nstate = nhst + 2; // States including start and end for local matching + const size_t START_STATE = nhst; + const size_t END_STATE = nhst + 1; + const size_t STAY = 0; + + float * cscore = calloc(nstate, sizeof(float)); + float * pscore = calloc(nstate, sizeof(float)); + int * step_idx = calloc(step_rem, sizeof(int)); + int * skip_idx = calloc(skip_rem, sizeof(int)); + int * traceback = calloc(nstate * nblock, sizeof(int)); + + if(NULL != cscore && NULL != pscore && NULL != step_idx && NULL != skip_idx && NULL != traceback){ + // Initialise -- must begin in start state + for(size_t i=0 ; i < nstate ; i++){ + cscore[i] = -BIG_FLOAT; + } + cscore[START_STATE] = 0.0f; + + // Forwards Viterbi + for(int i=0 ; i < nblock ; i++){ + const size_t lpoffset = i * stride; + const size_t toffset = i * nstate; + { // Swap vectors + float * tmp = pscore; + pscore = cscore; + cscore = tmp; + } + + // Step indices + colmaxf(pscore, step_rem, nstep, step_idx); + // Skip indices + colmaxf(pscore, skip_rem, nskip, skip_idx); + + // Update score for step and skip + for(int hst=0 ; hst < nhst ; hst++){ + int step_prefix = hst / nstep; + int skip_prefix = hst / nskip; + int step_hst = step_prefix + step_idx[step_prefix] * step_rem; + int skip_hst = skip_prefix + skip_idx[skip_prefix] * skip_rem; + + float step_score = pscore[step_hst]; + float skip_score = pscore[skip_hst] - skip_pen; + if(step_score > skip_score){ + // Arbitrary assumption here! Should be >= ? + cscore[hst] = step_score; + traceback[toffset + hst] = step_hst; + } else { + cscore[hst] = skip_score; + traceback[toffset + hst] = skip_hst; + } + cscore[hst] += logpost[lpoffset + hst + 1]; + } + + // Stay + for(int hst=0 ; hst < nhst ; hst++){ + const float score = pscore[hst] + logpost[lpoffset + STAY] - stay_pen; + if(score > cscore[hst]){ + // Arbitrary assumption here! Should be >= ? + cscore[hst] = score; + traceback[toffset + hst] = -1; + } + } + + // Remain in start state -- local penalty or stay + cscore[START_STATE] = pscore[START_STATE] + fmaxf(-local_pen, logpost[lpoffset + STAY] - stay_pen); + traceback[toffset + START_STATE] = START_STATE; + // Exit start state + for(int hst=0 ; hst < nhst ; hst++){ + const float score = pscore[START_STATE] + logpost[lpoffset + hst + 1]; + if(score > cscore[hst]){ + cscore[hst] = score; + traceback[toffset + hst] = START_STATE; + } + } + + // Remain in end state -- local penalty or stay + cscore[END_STATE] = pscore[END_STATE] + fmaxf(-local_pen, logpost[lpoffset + STAY] - stay_pen); + traceback[toffset + END_STATE] = END_STATE; + // Enter end state + for(int hst=0 ; hst < nhst ; hst++){ + const float score = pscore[hst] - local_pen; + if(score > cscore[END_STATE]){ + cscore[END_STATE] = score; + traceback[toffset + END_STATE] = hst; + } + } + } + + logscore = viterbi_local_backtrace(cscore, nstate, nblock, traceback, path); + } + + free(traceback); + free(skip_idx); + free(step_idx); + free(pscore); + free(cscore); + + return logscore; +} + + +/** + * + * @param weights[nparam*nbatch*nblock] Array containing weight tensor + * @param nblock Number of blocks in each chunk + * @param nbatch Batch size (number of chunks) + * @param nparam Number of parameters (weights) output per block + * @param nbase Number of bases + * @param stay_pen Penalty to suppress stays (positive == more suppression) + * @param skip_pen Penalty to suppress skips (positive == more suppression) + * @param local_pen Local matching penalty (positive == less clipping) + * @param score[nbatch] Array[out] to contain score of best path for each chunk + * @param path[nblock*nbatch] Array[out] to contain best path for each chunk, stored as + * column in a matrix (column-major format) + * + * @returns Scores of best paths are written to score array, path + **/ +void fast_viterbi_blocks(float const * weights, size_t nblock, size_t nbatch, size_t nparam, size_t nbase, + float stay_pen, float skip_pen, float local_pen, float * score, int32_t * path){ + assert(NULL != weights); // weights [nblock x nbatch x nparam] + assert(NULL != score); // score [nbatch] + assert(NULL != path); // path [nbatch x (nblock + 1)] + + #pragma omp parallel for + for(size_t batch=0 ; batch < nbatch ; batch++){ + const size_t path_offset = batch * nblock; + const size_t w_offset = batch * nparam; + const size_t w_stride = nbatch * nparam; + + score[batch] = fast_viterbi(weights + w_offset, nblock, nparam, nbase, w_stride, + stay_pen, skip_pen, local_pen, + path + path_offset); + } +} + diff --git a/taiyaki/c_decoding.h b/taiyaki/c_decoding.h new file mode 100644 index 0000000..16f6e19 --- /dev/null +++ b/taiyaki/c_decoding.h @@ -0,0 +1,5 @@ +#include +#include + +void fast_viterbi_blocks(float const * weights, size_t nblock, size_t nbatch, size_t nparam, size_t nbase, + float stay_pen, float skip_pen, float local_pen, float * score, int32_t * seq); diff --git a/taiyaki/chunk_selection.py b/taiyaki/chunk_selection.py new file mode 100644 index 0000000..d17ca06 --- /dev/null +++ b/taiyaki/chunk_selection.py @@ -0,0 +1,224 @@ +# Functions to select and filter chunks for training. +# Data structures are based on the read dictionary defined in mapped_signal_files.py +from collections import defaultdict +import os +import numpy as np +from taiyaki.maths import med_mad + + +def get_mean_dwell(chunkdict, TINY=0.00000001): + """Calculate mean dwell from the dict of data returned by + mapped_signal_files.Read.get_chunk_with_sample_length() + or by mapped_signal_files.Read.get_chunk_with_sequence_length(). + TINY is added to the denominator to avoid overflow in the case + of zero sequence length""" + if not ('current' in chunkdict and 'sequence' in chunkdict): + return None + return len(chunkdict['current']) / (len(chunkdict['sequence']) + TINY) + + +def chunk_filter(chunkdict, args, filter_parameters): + """Given a chunk dict as returned by mapped_signal_files.Read._get_chunk(), + apply filtering conditions, returning "pass" if everything OK + or a string describing reason for failure if not. + + param : chunkdict : a dictionary as returned by mapped_signal_files.get_chunk() + param : args : command-line args object used to determine filter limits + param : filter_parameters: tuple median(mean_dwell), mad(mean_dwell) from sampled + reads, used to determine filter centre values + + If filter_parameters is None, then don't filter according to dwell, + but still reject reads which haven't produced a chunk at all because + they're not long enough or end in a slip. + """ + if chunkdict is None: # Not possible to get a chunk + return "nochunk" + if 'rejected' in chunkdict: + # The chunkdict contains a reason why it should be rejected. Return this. + return chunkdict['rejected'] + + if filter_parameters is not None: + mean_dwell = get_mean_dwell(chunkdict) + if mean_dwell is None: + return 'data_missing' + median_meandwell, mad_meandwell = filter_parameters + mean_dwell_dev_from_median = abs(mean_dwell - median_meandwell) + if mean_dwell_dev_from_median > args.filter_mean_dwell * mad_meandwell: + return 'meandwell' + if chunkdict['max_dwell'] > args.filter_max_dwell * median_meandwell: + return 'maxdwell' + return 'pass' + + +def sample_chunks(read_data, number_to_sample, chunk_len, args, chunkfunc, + fraction_of_fails_allowed=0.5, + filter_parameters=None, log=None, + chunk_log=None, + log_accepted_chunks=False, + chunk_len_means_sequence_len=False): + """Sample chunks from a list of read_data, returning + a tuple (chunklist, rejection_dict), where chunklist contains the + results of applying to each chunk that is not rejected. + + rejection_dict is a dictionary with keys describing the reasons for + rejection and values being the number rejected for that reason. E.g. + {'pass':3,'meandwell':3, 'maxdwell':4}. + + param: read_data : a list of Read objects as defined in mapped_signal_files.py + param: number_to_sample : target number of data elements to return, each from + a sampled chunk. If number_to_sample is 0 or None + then get the same number of chunks as the number + of read_data items supplied. + param: chunk_len : desired length of chunk in samples, or length + of sequence in bases if chunk_len_means_sequence_len + param: args : command-line args object from argparse. Used to + pass the filter command-line arguments. + param: chunkfunc : the function to be applied to the chunkdict to get a single + data item in the list returned + param: fraction_of_fails_allowed : Visit a maximum of + (number_to_sample / fraction_of_fails_allowed) reads + before stopping. + param: filter_parameters: a tuple (median_meandwell, mad_meandwell) which + determines the filter used. If None, then no filtering. + param: log : log object used to report if not enough chunks + passing the tests can be found. + param: chunk_log : ChunkLog object used to record rejected chunks + (accepted chunks will be recorded along with their + loss scores after the training step) + param: log_accepted_chunks : If this is True, then we log all chunks. + If it's false, then we log only rejected ones. + During training we use log_accepted_chunks=False + because the accepted chunks are logged later + when their loss has been calculated. + param: chunk_len_means_sequence_len : if this is False (the default) then + chunk_len determines the length in samples of the + chunk, and we use mapped_signal_files.get_chunk_with_sample_length(). + If this is True, then chunk_len determines the length + in bases of the sequence in the chunk, and we use + mapped_signal_files.get_chunk_with_sequence_length() + """ + nreads = len(read_data) + if number_to_sample is None or number_to_sample == 0: + number_to_sample_used = nreads + else: + number_to_sample_used = number_to_sample + maximum_attempts_allowed = int(number_to_sample_used / fraction_of_fails_allowed) + chunklist = [] + count_dict = defaultdict(lambda: 0) # Will contain counts of numbers of rejects and passes + attempts = 0 + while(count_dict['pass'] < number_to_sample_used and attempts < maximum_attempts_allowed): + attempts += 1 + read_number = np.random.randint(nreads) + read = read_data[read_number] + if chunk_len_means_sequence_len: + chunkdict = read.get_chunk_with_sequence_length(chunk_len) + else: + chunkdict = read.get_chunk_with_sample_length(chunk_len) + passfail_str = chunk_filter(chunkdict, args, filter_parameters) + count_dict[passfail_str] += 1 + if passfail_str == 'pass': + chunklist.append(chunkfunc(chunkdict)) + if log_accepted_chunks or passfail_str != 'pass': + if chunk_log is not None: + chunk_log.write_chunk(-1, chunkdict, passfail_str) + + if count_dict['pass'] < number_to_sample_used and log is not None: + log.write('* Warning: only {} chunks passed tests after {} attempts.\n'.format(count_dict['pass'], attempts)) + log.write('* Summary:') + for k, v in count_dict.items(): + log.write(' {}:{}'.format(k, v)) + log.write('\n') + + return chunklist, count_dict + + +def sample_filter_parameters(read_data, number_to_sample, chunk_len, args, + log=None, chunk_log=None, + chunk_len_means_sequence_len = False): + """Sample number_to_sample reads from read_data, calculate median and MAD + of mean dwell. Note the MAD has an adjustment factor so that it would give the + same result as the std for a normal distribution. + + See docstring for sample_chunks() for the parameters. + """ + meandwells, _ = sample_chunks(read_data, number_to_sample, chunk_len, args, get_mean_dwell, + log=log, chunk_log=chunk_log, log_accepted_chunks=True, + chunk_len_means_sequence_len=chunk_len_means_sequence_len) + return med_mad(meandwells) + + +def assemble_batch(read_data, batch_size, chunk_len, filter_parameters, args, log, + chunk_log=None, chunk_len_means_sequence_len=False): + """Assemble a batch of data by repeatedly choosing a random read and location + in that read, continuing until we have found batch_size chunks that pass the + tests. + + Returns tuple (chunklist, rejection_dict) + + where chunklist is a list of dicts, each with entries + (signal_chunk, sequence_chunk, start_sample, read_id). + signal_chunks and sequence_chunks are np arrays. + and rejection_dict is a dictionary with keys describing the reasons for + rejection and values being the number rejected for that reason. E.g. + {'pass':3,'meandwell':3, 'maxdwell':4}. + + If we can't find enough chunks after the allowed number of attempts ,then + return the short batch, but output a message to the log. + + See docstring for sample_chunks for parameters. + """ + return sample_chunks(read_data, batch_size, chunk_len, chunkfunc=lambda x: x, + filter_parameters=filter_parameters, log=log, + chunk_log=chunk_log, args=args, + chunk_len_means_sequence_len=chunk_len_means_sequence_len) + + +class ChunkLog: + """Handles saving of chunk metadata to file""" + + def __init__(self, outputdirectory, outputfilename="chunklog.tsv"): + """Open and write header line""" + filepath = os.path.join(outputdirectory, outputfilename) + self.dumpfile = open(filepath, "w") + self.dumpfile.write( + "iteration\t read_id\t start_sample\t chunk_len_samples\t chunk_len_bases\t max_dwell\t status\t loss\n") + + def write_chunk(self, iteration, chunk_dict, status, lossvalue=None, loss_not_calculated=-1.0): + """Write a single line of data to the chunk log, using -1 to indicate missing data. + param iteration : the training iteration (measured in batches, or -1 if not used in training) + param chunk_dict : chunk dictionary + param status : string for reject/accept status (e.g. 'pass', 'meandwell') + param lossvalue : loss if available (not calculated for rejected chunks) + param loss_not_calculated : value to store in the log file in the loss column + for chunks where loss has not been calculated + """ + format_string = ("{}\t" * 6) + "{}\n" + if lossvalue is None: + lossvalue_written = loss_not_calculated + else: + lossvalue_written = lossvalue + if chunk_dict is None: + self.dumpfile.write(format_string.format(iteration, '--------', -1, -1, -1, status, lossvalue_written)) + else: + # Some elements of dict may be missing if chunk construction failed + self.dumpfile.write('{}\t{}\t'.format(iteration, chunk_dict['read_id'])) + if 'start_sample' in chunk_dict: + self.dumpfile.write('{}\t'.format(chunk_dict['start_sample'])) + else: + self.dumpfile.write('-1\t') + for k in ['current', 'sequence']: + if k in chunk_dict: + self.dumpfile.write('{}\t'.format(len(chunk_dict[k]))) + else: + self.dumpfile.write('-1\t') + if 'max_dwell' in chunk_dict: + self.dumpfile.write('{}\t'.format(chunk_dict['max_dwell'])) + else: + self.dumpfile.write('-1\t') + self.dumpfile.write("{}\t{}\n".format(status, lossvalue_written)) + + def write_batch(self, iteration, chunk_batch, lossvector): + """Write information about a single batch to the log. + All these chunks will have been accepted, so their status is 'pass'""" + for chunk_dict, lossvalue in zip(chunk_batch, lossvector): + self.write_chunk(iteration, chunk_dict, "pass", lossvalue) diff --git a/taiyaki/cmdargs.py b/taiyaki/cmdargs.py new file mode 100644 index 0000000..4497e29 --- /dev/null +++ b/taiyaki/cmdargs.py @@ -0,0 +1,352 @@ +import argparse +from collections import namedtuple +import multiprocessing +import numpy as np +import os +import re +import warnings + +"""ArgParse extensions. + +Contains many actions for parsing arguments into explicit types and +checking of values are within explicit sets. + +""" + + +class ByteString(argparse.Action): + + def __call__(self, parser, namespace, values, option_string=None): + setattr(namespace, self.dest, values.encode('ascii')) + + +def checkProbabilities(probabilities): + try: + for probability in iter(probabilities): + assert 0.0 <= probability <= 1.0, 'Probability {} not in [0,1]'.format(probability) + except TypeError: + assert 0.0 <= probabilities <= 1.0, 'Probability not in [0,1]' + + +class display_version_and_exit(argparse.Action): + """Ronseal.""" + + def __init__(self, **kwdargs): + self.__version__ = kwdargs['metavar'] + super(display_version_and_exit, self).__init__(**kwdargs) + + def __call__(self, parser, namespace, values, option_string=None): + print(self.__version__) + exit(0) + + +class FileExists(argparse.Action): + """Check if the input file exist.""" + + def __call__(self, parser, namespace, values, option_string=None): + if not os.path.exists(values): + raise RuntimeError("File/path for '{}' does not exist, {}".format(self.dest, values)) + setattr(namespace, self.dest, values) + + +class FileExist(FileExists): + + def __init__(self, **kwdargs): + warnings.warn("FileExist is deprecated. Use FileExists instead.", DeprecationWarning) + super(FileExist, self).__init__(**kwdargs) + + +class FileAbsent(argparse.Action): + """Check that input file doesn't exist.""" + + def __call__(self, parser, namespace, values, option_string=None): + if os.path.exists(values): + raise RuntimeError("File/path for '{}' exists, {}".format(self.dest, values)) + setattr(namespace, self.dest, values) + + +class CheckCPU(argparse.Action): + """Make sure people do not overload the machine""" + + def __call__(self, parser, namespace, values, option_string=None): + num_cpu = multiprocessing.cpu_count() + if int(values) <= 0 or int(values) > num_cpu: + raise RuntimeError('Number of jobs can only be in the range of {} and {}'.format(1, num_cpu)) + setattr(namespace, self.dest, values) + + +class ParseToNamedTuple(argparse.Action): + """Parse to a namedtuple + """ + + def __init__(self, **kwdargs): + assert 'metavar' in kwdargs, "Argument 'metavar' must be defined" + assert 'type' in kwdargs, "Argument 'type' must be defined" + assert len(kwdargs['metavar']) == kwdargs['nargs'], 'Number of arguments and descriptions inconstistent' + assert len(kwdargs['type']) == kwdargs['nargs'], 'Number of arguments and types inconstistent' + self._types = kwdargs['type'] + kwdargs['type'] = str + self.Values = namedtuple('Values', ' '.join(kwdargs['metavar'])) + super(ParseToNamedTuple, self).__init__(**kwdargs) + self.default = self.Values(*self.default) if self.default is not None else None + + def __call__(self, parser, namespace, values, option_string=None): + value_dict = self.Values(*[f(v) for f, v in zip(self._types, values)]) + setattr(namespace, self.dest, value_dict) + + @staticmethod + def value_as_string(value): + return ' '.join(str(x) for x in value) + + +class NegBound(argparse.Action): + """Create a negative list bound suitable for trimming arrays.""" + + def __call__(self, parser, namespace, values, option_string=None): + if values == 0: + setattr(namespace, self.dest, None) + else: + try: + setattr(namespace, self.dest, -int(values)) + except: + raise ValueError('Illegal value for {} ({}), should be castable to int') + + +class ExpandRanges(argparse.Action): + """Translate a str like 1,2,3:5,40 to [1,2,3,4,5,40]""" + + def __call__(self, parser, namespace, values, option_string=None): + elts = [] + for item in values.replace(' ', '').split(','): + mo = re.search(r'(\d+):(\d+)', item) + if mo is not None: + rng = [int(x) for x in mo.groups()] + elts.extend(list(range(rng[0], rng[1] + 1))) + else: + elts.append(int(item)) + setattr(namespace, self.dest, elts) + + +class ChannelList(ExpandRanges): + + def __init__(self, **kwdargs): + warnings.warn("ChannelList is deprecated. Use ExpandRanges instead.", DeprecationWarning) + super(ChannelList, self).__init__(**kwdargs) + + +class AutoBool(argparse.Action): + + def __init__(self, option_strings, dest, default=None, required=False, help=None): + """Automagically create --foo / --no-foo argument pairs""" + + if default is None: + raise ValueError('You must provide a default with AutoBool action') + if len(option_strings) != 1: + raise ValueError('Only single argument is allowed with AutoBool action') + opt = option_strings[0] + if not opt.startswith('--'): + raise ValueError('AutoBool arguments must be prefixed with --') + + opt = opt[2:] + opts = ['--' + opt, '--no-' + opt] + if default: + default_opt = opts[0] + else: + default_opt = opts[1] + super(AutoBool, self).__init__(opts, dest, nargs=0, const=None, + default=default, required=required, + help='{} (Default: {})'.format(help, default_opt)) + + def __call__(self, parser, namespace, values, option_strings=None): + if option_strings.startswith('--no-'): + setattr(namespace, self.dest, False) + else: + setattr(namespace, self.dest, True) + + @staticmethod + def filter_option_strings(strings): + for s in strings: + s = s.strip('-') + if s[:3] != 'no-': + yield s + + +class Maybe(object): + """Create an argparse argument type that accepts either given type or 'None' + + :param mytype: Type function for type to accept, e.g. `int` or `float` + """ + + def __init__(self, mytype): + self.mytype = mytype + + def __repr__(self): + return "None or {}".format(self.mytype) + + def __call__(self, y): + try: + if y == 'None': + res = None + else: + res = self.mytype(y) + except: + raise argparse.ArgumentTypeError('Argument must be {}'.format(self)) + return res + + +def TypeOrNone(mytype): + warnings.warn("TypeOrNone is deprecated. Use Maybe instead.", DeprecationWarning) + return Maybe(mytype) + + +class Bounded(object): + """Create an argparse argument type that accepts values in [lower, upper] + + :param mytype: Type function for type to accept, e.g. `int` or `float` + """ + + def __init__(self, mytype, lower=None, upper=None): + self.mytype = mytype + + assert lower is not None or upper is not None + + if lower is not None and upper is not None: + assert lower <= upper + + self.lower = lower + self.upper = upper + + def __repr__(self): + if self.lower is not None and self.upper is not None: + return "{} in range [{}, {}]".format(self.mytype, self.lower, self.upper) + else: + if self.lower is not None: + return "{} in range [{}, inf]".format(self.mytype, self.lower) + else: + assert self.upper is not None + return "{} in range [-inf, {}]".format(self.mytype, self.upper) + + def __call__(self, y): + yt = self.mytype(y) + + if self.lower is not None and yt < self.lower: + raise argparse.ArgumentTypeError('Argument must be {}'.format(self)) + + if self.upper is not None and yt > self.upper: + raise argparse.ArgumentTypeError('Argument must be {}'.format(self)) + + return yt + + +def NonNegative(mytype): + """Create an argparse argument type that accepts only non-negative values + + :param mytype: Type function for type to accept, e.g. `int` or `float` + """ + return Bounded(mytype, lower=mytype(0)) + + +class Positive(object): + """Create an argparse argument type that accepts only positive values + + :param mytype: Type function for type to accept, e.g. `int` or `float` + """ + + def __init__(self, mytype): + self.mytype = mytype + + def __repr__(self): + return "positive {}".format(self.mytype) + + def __call__(self, y): + yt = self.mytype(y) + if yt <= 0: + raise argparse.ArgumentTypeError('Argument must be {}'.format(self)) + return yt + + +def proportion(p): + """Type function for proportion""" + return Bounded(float, 0.0, 1.0)(p) + + +def probability(p): + warnings.warn("probability is deprecated. Use proportion instead.", DeprecationWarning) + return proportion(p) + + +def Vector(mytype): + """Return an argparse.Action that will convert a list of values into a numpy + array of given type + """ + + class MyNumpyAction(argparse.Action): + """Parse a list of values into numpy array""" + + def __call__(self, parser, namespace, values, option_string=None): + try: + setattr(namespace, self.dest, np.array(values, dtype=mytype)) + except: + raise argparse.ArgumentTypeError('Cannot convert {} to array of {}'.format(values, mytype)) + + @staticmethod + def value_as_string(value): + return ' '.join(str(x) for x in value) + return MyNumpyAction + + +def str_to_numeric(x): + """Up-type a str to either int or float, or leave alone.""" + if not isinstance(x, str): + return x + try: + return int(x) + except: + try: + return float(x) + except: + return x + + +class DeviceAction(argparse.Action): + """Parse string specifying a device (either CPU or GPU) and return a normalised version + + Converts None to 'cpu' + Converts a string like '2' to int 2 + Converts a string like 'cuda2' to int 2 (for UGE compatibility) + All other inputs are left as they are + """ + + def __call__(self, parser, namespace, value, option_string=None): + setattr(namespace, self.dest, self._convert(value)) + + def _convert(self, value): + if value is None: + return 'cpu' + + # if value is (a string representation of) a positive integer, convert to int + int_match = re.match('[0-9]+', value) + if int_match: + return int(int_match.group()) + + # for UGE: convert string of form 'cudaN' to int N + uge_match = re.match('cuda(?P[0-9]+)', value) + if uge_match: + return int(uge_match.group('id')) + + # in all other cases, do nothing, and let torch.device decide + return value + + +str_to_type = { + 'None': None, + 'True': True, 'False': False, + 'true': True, 'false': False, + 'TRUE': True, 'FALSE': False +} + +bool_actions = { + AutoBool, + argparse._StoreTrueAction, + argparse._StoreFalseAction +} diff --git a/taiyaki/common_cmdargs.py b/taiyaki/common_cmdargs.py new file mode 100644 index 0000000..df66022 --- /dev/null +++ b/taiyaki/common_cmdargs.py @@ -0,0 +1,109 @@ +# Command-line args used in more than one script defined here + +from taiyaki.cmdargs import (AutoBool, DeviceAction, FileExists, Maybe, NonNegative, + ParseToNamedTuple, Positive, display_version_and_exit) +from taiyaki.version import __version__ + + +def add_common_command_args(parser, arglist): + """Given an argparse parser object and a list of keys such as + ['input_strand_list', 'jobs'], add these command line args + to the parser. + + Note that not all command line args used in the package are + included in this func: only those that are used by more than + one script and which have the same defaults. + + Also note that some args are positional and some are optional. + The optional ones are listed first below.""" + + ############################################################################ + # + # Optional arguments + # + ############################################################################ + + if 'adam' in arglist: + parser.add_argument('--adam', nargs=3, metavar=('rate', 'decay1', 'decay2'), + default=(1e-3, 0.9, 0.999), type=(NonNegative(float), NonNegative(float), + NonNegative(float)), action=ParseToNamedTuple, + help='Parameters for Exponential Decay Adaptive Momementum') + + if 'chunk_logging_threshold' in arglist: + parser.add_argument('--chunk_logging_threshold', default=10.0, metavar='multiple', + type=NonNegative(float), + help='If loss > (threshold * smoothed loss) for a batch, then log chunks to ' + + 'output/chunklog.tsv. Set to zero to log all, including rejected chunks') + + if 'device' in arglist: + parser.add_argument('--device', default='cpu', action=DeviceAction, + help='Integer specifying which GPU to use, or "cpu" to use CPU only. ' + 'Other accepted formats: "cuda" (use default GPU), "cuda:2" ' + 'or "cuda2" (use GPU 2).') + + if 'filter_max_dwell' in arglist: + parser.add_argument('--filter_max_dwell', default=10.0, metavar='multiple', + type=Maybe(Positive(float)), + help='Drop chunks with max dwell more than multiple of median (over chunks)') + + if 'filter_mean_dwell' in arglist: + parser.add_argument('--filter_mean_dwell', default=3.0, metavar='radius', + type=Maybe(Positive(float)), + help='Drop chunks with mean dwell more than radius deviations from the median (over chunks)') + + if 'input_strand_list' in arglist: + parser.add_argument('--input_strand_list', default=None, action=FileExists, + help='Strand summary file containing subset') + + if 'jobs' in arglist: + parser.add_argument('--jobs', default=1, metavar='n', type=Positive(int), + help='Number of threads to use when processing data') + + if 'limit' in arglist: + parser.add_argument('--limit', default=None, type=Maybe(Positive(int)), + help='Limit number of reads to process') + + if 'lrdecay' in arglist: + parser.add_argument('--lrdecay', default=5000, metavar='n', type=Positive(float), + help='Learning rate for batch i is adam.rate / (1.0 + i / n)') + + if 'niteration' in arglist: + parser.add_argument('--niteration', metavar='batches', type=Positive(int), + default=50000, help='Maximum number of batches to train for') + + if 'overwrite' in arglist: + parser.add_argument('--overwrite', default=False, action=AutoBool, + help='Whether to overwrite any output files') + + if 'quiet' in arglist: + parser.add_argument('--quiet', default=False, action=AutoBool, + help="Don't print progress information to stdout") + + if 'sample_nreads_before_filtering' in arglist: + parser.add_argument('--sample_nreads_before_filtering', metavar='n', type=NonNegative(int), default=1000, + help='Sample n reads to decide on bounds for filtering before training. Set to 0 to do all.') + + if 'save_every' in arglist: + parser.add_argument('--save_every', metavar='x', type=Positive(int), default=5000, + help='Save model every x batches') + + if 'version' in arglist: + parser.add_argument('--version', nargs=0, action=display_version_and_exit, metavar=__version__, + help='Display version information.') + + if 'weight_decay' in arglist:parser.add_argument('--weight_decay', default=0.0, metavar='penalty', + type=NonNegative(float), + help='Adam weight decay (L2 normalisation penalty)') + + + + + ############################################################################ + # + # Positional arguments + # + ############################################################################ + + if 'input_folder' in arglist: + parser.add_argument('input_folder', action=FileExists, + help='Directory containing single-read fast5 files') diff --git a/taiyaki/config.py b/taiyaki/config.py new file mode 100644 index 0000000..a707c51 --- /dev/null +++ b/taiyaki/config.py @@ -0,0 +1,6 @@ +import numpy as np +import torch + +taiyaki_dtype = np.float32 +numpy_dtype = np.float32 +torch_dtype = torch.float32 diff --git a/taiyaki/ctc/__init__.py b/taiyaki/ctc/__init__.py new file mode 100644 index 0000000..55f2847 --- /dev/null +++ b/taiyaki/ctc/__init__.py @@ -0,0 +1 @@ +from .ctc import * diff --git a/taiyaki/ctc/c_crf_flipflop.c b/taiyaki/ctc/c_crf_flipflop.c new file mode 100644 index 0000000..386ea6b --- /dev/null +++ b/taiyaki/ctc/c_crf_flipflop.c @@ -0,0 +1,436 @@ +#include +#include +#include +#include +#include + +#include "c_crf_flipflop.h" + +#define _OFFSET_STAY 32 +#define _NSTATE 8 +#define _NBASE 4 +#define LARGE_VAL 1e30f + + +static inline float logsumexpf(float x, float y, float a){ + return fmaxf(x, y) + log1pf(expf(-a * fabsf(x-y))) / a; +} + +void crf_flipflop_forward_step(float const * logpost, float const * fwdprev, int32_t const * seq, + size_t nseqpos, float * fwdcurr, float sharpfact){ + assert(nseqpos > 0); + assert(NULL != logpost); + assert(NULL != fwdprev); + assert(NULL != seq); + assert(NULL != fwdcurr); + + + for(size_t pos=0 ; pos < nseqpos ; pos++){ + // Stay in current position + const size_t base = seq[pos]; + fwdcurr[pos] = (base < _NBASE) ? logpost[base * _NSTATE + base]: + logpost[_OFFSET_STAY + base]; + fwdcurr[pos] += fwdprev[pos]; + } + for(size_t pos=1 ; pos < nseqpos ; pos++){ + // Move to new position + const size_t base_to = seq[pos]; + const size_t base_from = seq[pos - 1]; + + assert(base_to != base_from); // Can't have repeated bases + assert(base_to < _NBASE || base_from + _NBASE == base_to); + const float score = (base_to < _NBASE) ? logpost[base_to * _NSTATE + base_from] : + logpost[_OFFSET_STAY + base_from]; + fwdcurr[pos] = logsumexpf(fwdcurr[pos], fwdprev[pos - 1] + score, sharpfact); + } +} + + +float crf_flipflop_forward(float const * logpost, size_t nblk, size_t ldp, int32_t const * seq, + size_t nseqpos, float sharpfact, float * fwd){ + assert(nseqpos > 0); + assert(NULL != logpost); + assert(NULL != seq); + assert(NULL != fwd); + + // Point prior -- must start in stay at beginning of sequence + for(size_t pos=0 ; pos < nseqpos ; pos++){ + fwd[pos] = -LARGE_VAL; + } + fwd[0] = 0.0; + + for(size_t blk=0 ; blk < nblk ; blk++){ + float const * fwdprev = fwd + blk * nseqpos; + float * fwdcurr = fwd + (blk + 1) * nseqpos; + float const * logpostcurr = logpost + blk * ldp; + + crf_flipflop_forward_step(logpostcurr, fwdprev, seq, nseqpos, fwdcurr, sharpfact); + } + + // Final score is sum of final state + its stay + float score = fwd[nblk * nseqpos + nseqpos - 1]; + return score; +} + + +void crf_flipflop_backward_step(float const * logpost, float const * bwdprev, int32_t const * seq, + size_t nseqpos, float * bwdcurr, float sharpfact){ + assert(nseqpos > 0); + assert(NULL != logpost); + assert(NULL != bwdprev); + assert(NULL != seq); + assert(NULL != bwdcurr); + + + for(size_t pos=0 ; pos < nseqpos ; pos++){ + // Stay in current position + const size_t base = seq[pos]; + bwdcurr[pos] = (base < _NBASE) ? logpost[base * _NSTATE + base]: + logpost[_OFFSET_STAY + base]; + bwdcurr[pos] += bwdprev[pos]; + + } + for(size_t pos=1 ; pos < nseqpos ; pos++){ + // Move to new position + const size_t base_to = seq[pos]; + const size_t base_from = seq[pos - 1]; + + assert(base_to != base_from); // Can't have repeated bases + assert(base_to < _NBASE || base_from + _NBASE == base_to); + const float score = (base_to < _NBASE) ? logpost[base_to * _NSTATE + base_from] : + logpost[_OFFSET_STAY + base_from]; + bwdcurr[pos - 1] = logsumexpf(bwdcurr[pos - 1], bwdprev[pos] + score, sharpfact); + } +} + + +float crf_flipflop_backward(float const * logpost, size_t nblk, size_t ldp, int32_t const * seq, + size_t nseqpos, float sharpfact, float * bwd){ + assert(nseqpos > 0); + assert(NULL != logpost); + assert(NULL != seq); + assert(NULL != bwd); + + + // Point prior -- must have ended in either final stay or state + for(size_t pos=0 ; pos < nseqpos ; pos++){ + bwd[nblk * nseqpos + pos] = -LARGE_VAL; + } + // Final stay + bwd[nblk * nseqpos + nseqpos - 1] = 0.0; + + for(size_t blk=nblk ; blk > 0 ; blk--){ + float const * bwdprev = bwd + blk * nseqpos; + float * bwdcurr = bwd + (blk - 1) * nseqpos; + float const * logpostcurr = logpost + (blk - 1) * ldp; + + crf_flipflop_backward_step(logpostcurr, bwdprev, seq, nseqpos, bwdcurr, sharpfact); + } + + return bwd[0]; +} + + +void crf_flipflop_cost(float const * logprob, size_t nstate, size_t nblk , size_t nbatch, + int32_t const * seqs, int32_t const * seqlen, float sharpfact, float * score){ + size_t ldp = nbatch * nstate; + size_t seqidx[nbatch]; + seqidx[0] = 0; + for(size_t idx=1 ; idx < nbatch ; idx++){ + seqidx[idx] = seqidx[idx - 1] + seqlen[idx - 1]; + } + +#pragma omp parallel for + for(size_t batch=0 ; batch < nbatch ; batch++){ + if(0 == seqlen[batch]){ + score[batch] = 0.0; + continue; + } + + const size_t offset = batch * nstate; + float * fwd = calloc((1 + nblk) * seqlen[batch], sizeof(float)); + score[batch] = crf_flipflop_forward(logprob + offset, nblk, ldp, seqs + seqidx[batch], + seqlen[batch], sharpfact, fwd); + free(fwd); + } +} + + +void crf_flipflop_scores_fwd(float const * logprob, size_t nstate, size_t nblk , size_t nbatch, + int32_t const * seqs, int32_t const * seqlen, float sharpfact, + float * score){ + crf_flipflop_cost(logprob, nstate, nblk, nbatch, seqs, seqlen, sharpfact, score); +} + + +void crf_flipflop_scores_bwd(float const * logprob, size_t nstate, size_t nblk , size_t nbatch, + int32_t const * seqs, int32_t const * seqlen, float sharpfact, + float * score){ + size_t ldp = nbatch * nstate; + size_t seqidx[nbatch]; + seqidx[0] = 0; + for(size_t idx=1 ; idx < nbatch ; idx++){ + seqidx[idx] = seqidx[idx - 1] + seqlen[idx - 1]; + } + +#pragma omp parallel for + for(size_t batch=0 ; batch < nbatch ; batch++){ + if(0 == seqlen[batch]){ + score[batch] = 0.0; + continue; + } + const size_t offset = batch * nstate; + float * bwd = calloc((1 + nblk) * seqlen[batch], sizeof(float)); + score[batch] = crf_flipflop_backward(logprob + offset, nblk, ldp, seqs + seqidx[batch], + seqlen[batch], sharpfact, bwd); + free(bwd); + } +} + + +void crf_flipflop_grad_step(float const * fwdcurr, float const * bwdnext, float const * logprob, + int32_t const * seq, int32_t nseqpos, float * grad, size_t nstate, + float fact, float sharpfact){ + + // Make sure gradient calc is zero'd + memset(grad, 0, nstate * sizeof(float)); + + for(size_t pos=0 ; pos < nseqpos ; pos++){ + // State state + const size_t base = seq[pos]; + const size_t idx = (base < _NBASE) ? (base * _NSTATE + base) + : (_OFFSET_STAY + base); + grad[idx] += expf(sharpfact * (fwdcurr[pos] + bwdnext[pos] + logprob[idx] - fact)); + } + for(size_t pos=1 ; pos < nseqpos ; pos++){ + const size_t base_to = seq[pos]; + const size_t base_from = seq[pos - 1]; + const size_t idx = (base_to < _NBASE) ? (base_to * _NSTATE + base_from) + : (_OFFSET_STAY + base_from); + + assert(base_to != base_from); // Can't have repeated bases + assert(base_to < _NBASE || base_from + _NBASE == base_to); + grad[idx] += expf(sharpfact * (fwdcurr[pos - 1] + bwdnext[pos] + logprob[idx] - fact)); + } +} + + +void crf_flipflop_grad(float const * logprob, size_t nstate, size_t nblk , size_t nbatch, + int32_t const * seqs, int32_t const * seqlen, float sharpfact, + float * score, float * grad){ + const size_t ldp = nbatch * nstate; + + size_t seqidx[nbatch]; + seqidx[0] = 0; + for(size_t idx=1 ; idx < nbatch ; idx++){ + seqidx[idx] = seqidx[idx - 1] + seqlen[idx - 1]; + } + +#pragma omp parallel for + for(size_t batch=0 ; batch < nbatch ; batch++){ + const size_t batch_offset = batch * nstate; + if(0 == seqlen[batch]){ + for(size_t blk=0 ; blk < nblk ; blk++){ + memset(grad + batch_offset + blk * nbatch * nstate, 0, nstate * sizeof(float)); + } + continue; + } + const int32_t nseqpos = seqlen[batch]; + int32_t const * seq = seqs + seqidx[batch]; + float * fwd = calloc((nblk + 1) * nseqpos, sizeof(float)); + float * bwd = calloc((nblk + 1) * nseqpos, sizeof(float)); + score[batch] = crf_flipflop_forward(logprob + batch_offset, nblk, ldp, seq, nseqpos, sharpfact, fwd); + crf_flipflop_backward(logprob + batch_offset, nblk, ldp, seq, nseqpos, sharpfact, bwd); + + // Normalised transition matrix + for(size_t blk=0 ; blk < nblk ; blk++){ + float const * fwdcurr = fwd + blk * nseqpos; + float const * bwdcurr = bwd + blk * nseqpos; + float const * bwdnext = bwd + blk * nseqpos + nseqpos; + float const * logprobcurr = logprob + batch_offset + blk * nbatch * nstate; + float * gradcurr = grad + batch_offset + blk * nbatch * nstate; + + // Recalculate close to position to reduce numerical error + float fact = fwdcurr[0] + bwdcurr[0]; + for(size_t pos=1; pos < nseqpos ; pos++){ + fact = logsumexpf(fact, fwdcurr[pos] + bwdcurr[pos], sharpfact); + } + + crf_flipflop_grad_step(fwdcurr, bwdnext, logprobcurr, seq, nseqpos, gradcurr, nstate, fact, sharpfact); + } + + free(bwd); + free(fwd); + } +} + + + +#ifdef CRF_TWOSTATE_TEST + +const int32_t test_seq1[12] = {0, 1, 5, 1, 3, 2, + 0, 1, 5, 1, 3, 2}; + +const int32_t test_seqlen1[2] = {6, 6}; + +float test_logprob1[560] = { + // t = 0, blk = 0 -- stay in 0 + 0.7137395145, 0.0058640570, 0.0043273252, 0.0057024065, 0.0001304555, 0.0167860687, 0.0014591201, 0.0039324691, + 0.0117071924, 0.0045297625, 0.0105104226, 0.0018303745, 0.0004133878, 0.0121020079, 0.0179132788, 0.0008446391, + 0.0003954364, 0.0046109826, 0.0061280611, 0.0037487558, 0.0002867797, 0.0021094619, 0.0090478168, 0.0088021810, + 0.0166425156, 0.0008985700, 0.0030807985, 0.0150129722, 0.0033072104, 0.0225965258, 0.0017120223, 0.0080003635, + 0.0086164755, 0.0085638228, 0.0090326148, 0.0184277679, 0.0128914220, 0.0024000880, 0.0143853339, 0.0075095396, + // t = 0, blk = 1 + 0.7137395145, 0.0058640570, 0.0043273252, 0.0057024065, 0.0001304555, 0.0167860687, 0.0014591201, 0.0039324691, + 0.0117071924, 0.0045297625, 0.0105104226, 0.0018303745, 0.0004133878, 0.0121020079, 0.0179132788, 0.0008446391, + 0.0003954364, 0.0046109826, 0.0061280611, 0.0037487558, 0.0002867797, 0.0021094619, 0.0090478168, 0.0088021810, + 0.0166425156, 0.0008985700, 0.0030807985, 0.0150129722, 0.0033072104, 0.0225965258, 0.0017120223, 0.0080003635, + 0.0086164755, 0.0085638228, 0.0090326148, 0.0184277679, 0.0128914220, 0.0024000880, 0.0143853339, 0.0075095396, + + // t = 1, blk = 0 -- move 0 to 1 + 0.0138651518, 0.0068715546, 0.0137762669, 0.0142378858, 0.0038887475, 0.0002837213, 0.0009213002, 0.0046096374, + 0.7005158726, 0.0041189393, 0.0057012358, 0.0196555714, 0.0034917922, 0.0031160895, 0.0027309383, 0.0068903076, + 0.0016565445, 0.0013069584, 0.0067694923, 0.0071836470, 0.0012639324, 0.0110877851, 0.0064367276, 0.0085079412, + 0.0003521574, 0.0035635810, 0.0043749238, 0.0027222466, 0.0139259729, 0.0152291942, 0.0044505049, 0.0039157630, + 0.0096219943, 0.0208794052, 0.0031593320, 0.0516253381, 0.0051720879, 0.0038762056, 0.0067444477, 0.0014988048, + // t = 1, blk = 1 + 0.0138651518, 0.0068715546, 0.0137762669, 0.0142378858, 0.0038887475, 0.0002837213, 0.0009213002, 0.0046096374, + 0.7005158726, 0.0041189393, 0.0057012358, 0.0196555714, 0.0034917922, 0.0031160895, 0.0027309383, 0.0068903076, + 0.0016565445, 0.0013069584, 0.0067694923, 0.0071836470, 0.0012639324, 0.0110877851, 0.0064367276, 0.0085079412, + 0.0003521574, 0.0035635810, 0.0043749238, 0.0027222466, 0.0139259729, 0.0152291942, 0.0044505049, 0.0039157630, + 0.0096219943, 0.0208794052, 0.0031593320, 0.0516253381, 0.0051720879, 0.0038762056, 0.0067444477, 0.0014988048, + + // t = 2, blk = 0 -- move 1 to 5 + 0.0104973116, 0.0278749046, 0.0016333734, 0.0132478834, 0.0108985734, 0.0326813004, 0.0104401808, 0.0281931252, + 0.0002602418, 0.0004849826, 0.0069461090, 0.0337142774, 0.0066522165, 0.0002687968, 0.0081917502, 0.0014596191, + 0.0033038509, 0.0071742025, 0.0079209436, 0.0027446117, 0.0001922884, 0.0002173728, 0.0022822792, 0.0063767010, + 0.0062269709, 0.0008360773, 0.0009815072, 0.0138239322, 0.0006819603, 0.0004184386, 0.0005169712, 0.0038701156, + 0.0018582183, 0.7184016070, 0.0038719050, 0.0057834926, 0.0016248741, 0.0121355831, 0.0023164603, 0.0029949899, + // t = 2, blk = 1 + 0.0104973116, 0.0278749046, 0.0016333734, 0.0132478834, 0.0108985734, 0.0326813004, 0.0104401808, 0.0281931252, + 0.0002602418, 0.0004849826, 0.0069461090, 0.0337142774, 0.0066522165, 0.0002687968, 0.0081917502, 0.0014596191, + 0.0033038509, 0.0071742025, 0.0079209436, 0.0027446117, 0.0001922884, 0.0002173728, 0.0022822792, 0.0063767010, + 0.0062269709, 0.0008360773, 0.0009815072, 0.0138239322, 0.0006819603, 0.0004184386, 0.0005169712, 0.0038701156, + 0.0018582183, 0.7184016070, 0.0038719050, 0.0057834926, 0.0016248741, 0.0121355831, 0.0023164603, 0.0029949899, + + // t = 3, blk = 0 -- stay in 5 + 0.0132238486, 0.0067462421, 0.0065735995, 0.0002313058, 0.0350482900, 0.0038167453, 0.0013436872, 0.0047910351, + 0.0005511208, 0.0152455357, 0.0002505248, 0.0009566527, 0.0016608534, 0.0036526310, 0.0038930839, 0.0102019269, + 0.0040538124, 0.0121608248, 0.0026858640, 0.0024698387, 0.0077258147, 0.0063036375, 0.0015254714, 0.0015248249, + 0.0008483379, 0.0194108435, 0.0065140833, 0.0189690442, 0.0005446999, 0.0072716624, 0.0002782992, 0.0124768655, + 0.0239038132, 0.0108786276, 0.0208670656, 0.0076679875, 0.0086667116, 0.7072362974, 0.0038886950, 0.0039397951, + // t = 3, blk = 1 + 0.0132238486, 0.0067462421, 0.0065735995, 0.0002313058, 0.0350482900, 0.0038167453, 0.0013436872, 0.0047910351, + 0.0005511208, 0.0152455357, 0.0002505248, 0.0009566527, 0.0016608534, 0.0036526310, 0.0038930839, 0.0102019269, + 0.0040538124, 0.0121608248, 0.0026858640, 0.0024698387, 0.0077258147, 0.0063036375, 0.0015254714, 0.0015248249, + 0.0008483379, 0.0194108435, 0.0065140833, 0.0189690442, 0.0005446999, 0.0072716624, 0.0002782992, 0.0124768655, + 0.0239038132, 0.0108786276, 0.0208670656, 0.0076679875, 0.0086667116, 0.7072362974, 0.0038886950, 0.0039397951, + + // t = 4, blk = 0 -- move 5 to 1 + 0.0162499295, 0.0042696969, 0.0190051755, 0.0162959320, 0.0038385851, 0.0010900080, 0.0051636429, 0.0088802400, + 0.0035193397, 0.0100004109, 0.0182444400, 0.0002015949, 0.0051056114, 0.7237303612, 0.0135142243, 0.0065390854, + 0.0029951279, 0.0029123437, 0.0010848643, 0.0320041842, 0.0029855054, 0.0001557548, 0.0043323211, 0.0161734933, + 0.0051668898, 0.0007899601, 0.0024293827, 0.0107437912, 0.0005963283, 0.0004204642, 0.0008271684, 0.0036831630, + 0.0058302092, 0.0044612666, 0.0090699795, 0.0135366090, 0.0087714458, 0.0033968323, 0.0002088134, 0.0117758241, + // t = 4, blk = 1 + 0.0162499295, 0.0042696969, 0.0190051755, 0.0162959320, 0.0038385851, 0.0010900080, 0.0051636429, 0.0088802400, + 0.0035193397, 0.0100004109, 0.0182444400, 0.0002015949, 0.0051056114, 0.7237303612, 0.0135142243, 0.0065390854, + 0.0029951279, 0.0029123437, 0.0010848643, 0.0320041842, 0.0029855054, 0.0001557548, 0.0043323211, 0.0161734933, + 0.0051668898, 0.0007899601, 0.0024293827, 0.0107437912, 0.0005963283, 0.0004204642, 0.0008271684, 0.0036831630, + 0.0058302092, 0.0044612666, 0.0090699795, 0.0135366090, 0.0087714458, 0.0033968323, 0.0002088134, 0.0117758241, + + // t = 5, blk = 0 -- move 1 to 3 + 0.0054995373, 0.0003135968, 0.0036685129, 0.0239510419, 0.0039243790, 0.0019827996, 0.0129521071, 0.0066243852, + 0.0072536818, 0.0159209645, 0.0116239255, 0.0211135167, 0.0071678950, 0.0168522449, 0.0034948831, 0.0148879133, + 0.0084620257, 0.0075577618, 0.0042788046, 0.0007793942, 0.0038023124, 0.0116145280, 0.0025982395, 0.0022352670, + 0.0019744321, 0.7117781744, 0.0044554214, 0.0010030397, 0.0047838417, 0.0005540779, 0.0085588124, 0.0001078087, + 0.0019562465, 0.0097635189, 0.0012854310, 0.0076597643, 0.0032004197, 0.0354927128, 0.0017610103, 0.0071055704, + // t = 5, blk = 1 + 0.0054995373, 0.0003135968, 0.0036685129, 0.0239510419, 0.0039243790, 0.0019827996, 0.0129521071, 0.0066243852, + 0.0072536818, 0.0159209645, 0.0116239255, 0.0211135167, 0.0071678950, 0.0168522449, 0.0034948831, 0.0148879133, + 0.0084620257, 0.0075577618, 0.0042788046, 0.0007793942, 0.0038023124, 0.0116145280, 0.0025982395, 0.0022352670, + 0.0019744321, 0.7117781744, 0.0044554214, 0.0010030397, 0.0047838417, 0.0005540779, 0.0085588124, 0.0001078087, + 0.0019562465, 0.0097635189, 0.0012854310, 0.0076597643, 0.0032004197, 0.0354927128, 0.0017610103, 0.0071055704, + + // t = 6, blk = 0 -- move 3 to 2 + 0.0027753489, 0.0042800652, 0.0131082339, 0.0027542745, 0.0073969560, 0.0022332778, 0.0063905429, 0.0225312653, + 0.0083716146, 0.0018647020, 0.0080511935, 0.0062377027, 0.0096483698, 0.0050934491, 0.0002518356, 0.0089501860, + 0.0019424988, 0.0028867039, 0.0362414220, 0.7084635261, 0.0012042079, 0.0016243873, 0.0089677837, 0.0001407093, + 0.0007788545, 0.0061531496, 0.0116723082, 0.0160689361, 0.0045947877, 0.0025051798, 0.0016243552, 0.0025087153, + 0.0037103848, 0.0021407879, 0.0141961964, 0.0206362499, 0.0234809816, 0.0151728742, 0.0018537195, 0.0014922626, + // t = 6, blk = 1 + 0.0027753489, 0.0042800652, 0.0131082339, 0.0027542745, 0.0073969560, 0.0022332778, 0.0063905429, 0.0225312653, + 0.0083716146, 0.0018647020, 0.0080511935, 0.0062377027, 0.0096483698, 0.0050934491, 0.0002518356, 0.0089501860, + 0.0019424988, 0.0028867039, 0.0362414220, 0.7084635261, 0.0012042079, 0.0016243873, 0.0089677837, 0.0001407093, + 0.0007788545, 0.0061531496, 0.0116723082, 0.0160689361, 0.0045947877, 0.0025051798, 0.0016243552, 0.0025087153, + 0.0037103848, 0.0021407879, 0.0141961964, 0.0206362499, 0.0234809816, 0.0151728742, 0.0018537195, 0.0014922626 +}; + + + +#include + +int main(int argc, char * argv[]){ + + const size_t nblk = 7; + const size_t nstate = 40; + const size_t nbatch = 2; + float score[2] = {0.0f}; + float score2[2] = {0.0f}; + const float DELTA = 1e-2f; + const float sharpfact = (argc > 1) ? atof(argv[1]) : 1.0f; + const size_t msize = nblk * nstate * nbatch; + + for(size_t i=0 ; i < msize ; i++){ + test_logprob1[i] = logf(test_logprob1[i]); + } + + // + // F / B calculations + // + crf_flipflop_scores_fwd(test_logprob1, nstate, nblk, nbatch, test_seq1, test_seqlen1, + sharpfact, score); + printf("Forwards scores: %f %f\n", score[0], score[1]); + + crf_flipflop_scores_bwd(test_logprob1, nstate, nblk, nbatch, test_seq1, test_seqlen1, + sharpfact, score); + printf("Backwards scores: %f %f\n", score[0], score[1]); + + float * grad = calloc(msize, sizeof(float)); + crf_flipflop_grad(test_logprob1, nstate, nblk, nbatch, test_seq1, test_seqlen1, sharpfact, score2, grad); + float maxdelta = 0.0; + for(size_t blk=0 ; blk < nblk ; blk++){ + const size_t offset = blk * nbatch * nstate; + for(size_t st=0 ; st < nstate ; st++){ + maxdelta = fmaxf(maxdelta, fabsf(grad[offset + st] - grad[offset + nstate + st])); + } + } + printf("Max grad delta = %f\n", maxdelta); + + printf("Derviatives:\n"); + float fscore[2] = {0.0f}; + for(size_t blk=0 ; blk < nblk ; blk++){ + printf(" Block %zu\n", blk); + const size_t offset = blk * nbatch * nstate; + for(size_t st=0 ; st < nstate ; st++){ + // Positive difference + const float orig = test_logprob1[offset + st]; + test_logprob1[offset + st] = orig + DELTA; + crf_flipflop_scores_fwd(test_logprob1, nstate, nblk, nbatch, test_seq1, test_seqlen1, + sharpfact, score); + fscore[0] = score[0]; + fscore[1] = score[1]; + // Negative difference + test_logprob1[offset + st] = orig - DELTA; + crf_flipflop_scores_fwd(test_logprob1, nstate, nblk, nbatch, test_seq1, test_seqlen1, + sharpfact, score); + fscore[0] = (fscore[0] - score[0]) / (2.0f * DELTA); + fscore[1] = (fscore[1] - score[1]) / (2.0f * DELTA); + // Report and reset + test_logprob1[offset + st] = orig; + printf(" %f d=%f r=%f [%f %f]\n", grad[offset + st], fabsf(grad[offset + st] - fscore[0]), grad[offset + st] / fscore[0], fscore[0], fscore[1]); + } + } + +} +#endif /* CRF_TWOSTATE_TEST */ diff --git a/taiyaki/ctc/c_crf_flipflop.h b/taiyaki/ctc/c_crf_flipflop.h new file mode 100644 index 0000000..1e26224 --- /dev/null +++ b/taiyaki/ctc/c_crf_flipflop.h @@ -0,0 +1,8 @@ +#include + +void crf_flipflop_grad(float const * logprob, size_t nstate, size_t nblk , size_t nbatch, + int32_t const * seqs, int32_t const * seqlen, float sharpfact, float * score, + float * grad); + +void crf_flipflop_cost(float const * logprob, size_t nstate, size_t nblk , size_t nbatch, + int32_t const * seqs, int32_t const * seqlen, float sharpfact, float * score); diff --git a/taiyaki/ctc/ctc.pyx b/taiyaki/ctc/ctc.pyx new file mode 100644 index 0000000..6d005ee --- /dev/null +++ b/taiyaki/ctc/ctc.pyx @@ -0,0 +1,71 @@ +cimport libctc +import cython +import numpy as np +cimport numpy as np + +import torch + + +@cython.boundscheck(False) +@cython.wraparound(False) +def crf_flipflop_cost(np.ndarray[np.float32_t, ndim=3, mode="c"] logprob, + np.ndarray[np.int32_t, ndim=1, mode="c"] seqs, + np.ndarray[np.int32_t, ndim=1, mode="c"] seqlen, + sharpfact): + """ + :param logprob: Tensor containing log probabilities + :param seqs: Vector containing flip-flop coded sequences (see flipflopfings.flip_flop_code()), concatenated + :param seqlen: Length of each sequence + """ + cdef size_t nblk, nbatch, nstate + nblk, nbatch, nstate = logprob.shape[0], logprob.shape[1], logprob.shape[2] + assert nstate == 40, "Number of states is {} not 40 as expected".format(nstate) + + cdef np.ndarray[np.float32_t, ndim=1, mode="c"] costs = np.zeros((nbatch,), dtype=np.float32) + libctc.crf_flipflop_cost(&logprob[0, 0, 0], nstate, nblk, nbatch, &seqs[0], + &seqlen[0], sharpfact, &costs[0]) + assert np.all(costs <= 0.), "Error -- costs must be negative, got {}".format(costs) + return -costs / nblk + + +@cython.boundscheck(False) +@cython.wraparound(False) +def crf_flipflop_grad(np.ndarray[np.float32_t, ndim=3, mode="c"] logprob, + np.ndarray[np.int32_t, ndim=1, mode="c"] seqs, + np.ndarray[np.int32_t, ndim=1, mode="c"] seqlen, + sharpfact): + """ + :param logprob: Tensor containing log probabilities + :param seqs: Vector containing flip-flop coded sequences (see flipflopfings.flip_flop_code()), concatenated + :param seqlen: Length of each sequence + """ + cdef size_t nblk, nbatch, nstate + nblk, nbatch, nstate = logprob.shape[0], logprob.shape[1], logprob.shape[2] + assert nstate == 40, "Number of states is {} not 40 as expected".format(nstate) + + cdef np.ndarray[np.float32_t, ndim=1, mode="c"] costs = np.zeros((nbatch,), dtype=np.float32) + cdef np.ndarray[np.float32_t, ndim=3, mode="c"] grads = np.zeros_like(logprob, dtype=np.float32) + libctc.crf_flipflop_grad(&logprob[0, 0, 0], nstate, nblk, nbatch, &seqs[0], + &seqlen[0], sharpfact, &costs[0], &grads[0, 0, 0]) + return -costs / nblk, -grads / nblk + + +class FlipFlopCRF(torch.autograd.Function): + @staticmethod + def forward(ctx, logprob, seqs, seqlen, sharpfact): + lp = logprob.detach().cpu().numpy().astype(np.float32) + seqs = seqs.cpu().numpy().astype(np.int32) + seqlen = seqlen.cpu().numpy().astype(np.int32) + sharpfact = float(sharpfact) + cost, grads = crf_flipflop_grad(lp, seqs, seqlen, sharpfact) + ctx.save_for_backward(torch.tensor(grads, device=logprob.device)) + return torch.tensor(cost, device=logprob.device) + + @staticmethod + def backward(ctx, output_grads): + grads, = ctx.saved_tensors + output_grads = output_grads.unsqueeze(1) + return grads * output_grads, None, None, None + + +crf_flipflop_loss = FlipFlopCRF.apply diff --git a/taiyaki/ctc/libctc.pxd b/taiyaki/ctc/libctc.pxd new file mode 100644 index 0000000..afc80e9 --- /dev/null +++ b/taiyaki/ctc/libctc.pxd @@ -0,0 +1,9 @@ +from libc.stdint cimport int32_t + +cdef extern from "c_crf_flipflop.h": + void crf_flipflop_grad(const float * logprob, size_t nstate, size_t nblk , size_t nbatch, + const int32_t * seqs, const int32_t * seqlen, float sharpfact, + float * score, float * grad); + + void crf_flipflop_cost(const float * logprob, size_t nstate, size_t nblk , size_t nbatch, + const int32_t * seqs, const int32_t * seqlen, float sharpfact, float * score); diff --git a/taiyaki/cupy_extensions/__init__.py b/taiyaki/cupy_extensions/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/taiyaki/cupy_extensions/flipflop.py b/taiyaki/cupy_extensions/flipflop.py new file mode 100644 index 0000000..5b2d7a6 --- /dev/null +++ b/taiyaki/cupy_extensions/flipflop.py @@ -0,0 +1,288 @@ +import cupy as cp +import numpy as np +import torch +from torch.autograd import Function + + +_flipflop_fwd = cp.RawKernel(r''' +extern "C" __global__ +void flipflop_fwd( + const float* all_scores, + float* all_fwd, + float* all_fact, + long long T, + long long N, + long long nbase +) { + // all_scores is a (T, N, S) tensor where S = 2 * nbase * (nbase + 1) + // all_fwd is the output tensor of shape (T + 1, N, 2 * nbase) + // all_fwd should be filled with zeros + // all_fact is a (T + 1, N) matrix of normalisation factors + // all_scores and all_fwd should both be contiguous + // a 1D grid parallelises over elements in the batch (N dimension) + // a 1D threadpool parallelises over the fwd calculations for each base + // should be launched with blockDim = (N, 1, 1) and threadDim = (nbase, 1, 1) + + int S = 2 * nbase * (nbase + 1); + + const float* scores = all_scores + S * blockIdx.x; + int scores_stride = S * N; + + float* fwd = all_fwd + 2 * nbase * blockIdx.x; + int fwd_stride = 2 * nbase * N; + + float* fact = all_fact + blockIdx.x; + int fact_stride = N; + + int to_base = threadIdx.x; + + // t = 0 + fwd[to_base] = -log(2.0 * nbase); + fwd[to_base + nbase] = -log(2.0 * nbase); + fwd += fwd_stride; + fact[0] = log(2.0 * nbase); + fact += N; + __syncthreads(); + + float u, v; + for (int t = 0; t < T; t++) { + // to flip + u = fwd[-fwd_stride] + scores[2 * nbase * to_base]; + for (int from_base = 1; from_base < 2 * nbase; from_base++) { + v = fwd[from_base - fwd_stride] + scores[2 * nbase * to_base + from_base]; + u = max(u, v) + log1p(exp(-abs(u - v))); + } + fwd[to_base] = u; + + // to flop + u = fwd[to_base - fwd_stride] + scores[2 * nbase * nbase + to_base]; + v = fwd[to_base + nbase - fwd_stride] + scores[(2 * nbase + 1) * nbase + to_base]; + fwd[to_base + nbase] = max(u, v) + log1p(exp(-abs(u - v))); + __syncthreads(); + + // calculate normalisation factor + if (to_base == 0) { + u = fwd[0]; + for (int from_base = 1; from_base < 2 * nbase; from_base++) { + v = fwd[from_base]; + u = max(u, v) + log1p(exp(-abs(u - v))); + } + fact[0] = u; + } + __syncthreads(); + + // normalise + fwd[to_base] -= fact[0]; + fwd[to_base + nbase] -= fact[0]; + __syncthreads(); + + //if (to_base == 0) { + //fact[0] += fact[-N]; + //} + + scores += scores_stride; + fwd += fwd_stride; + fact += fact_stride; + } +} +''', 'flipflop_fwd') + + +def flipflop_fwd(scores): + index = scores.device.index + T, N, S = scores.shape + nbase = int(np.sqrt(S / 2)) + + fwd = torch.zeros((T + 1, N, 2 * nbase), dtype=scores.dtype, device=scores.device) + fact = torch.zeros((T + 1, N, 1), dtype=scores.dtype, device=scores.device) + with cp.cuda.Device(index): + _flipflop_fwd(grid=(N, 1, 1), block=(nbase, 1, 1), args=( + scores.data_ptr(), fwd.data_ptr(), fact.data_ptr(), T, N, nbase)) + return fwd, fact + + +_flipflop_bwd = cp.RawKernel(r''' +extern "C" __global__ +void flipflop_bwd( + const float* all_scores, + float* all_bwd, + float* all_fact, + long long T, + long long N, + long long nbase +) { + // all_scores is a (T, N, S) tensor where S = 2 * nbase * (nbase + 1) + // all_bwd is the output tensor of shape (T + 1, N, 2 * nbase) + // all_bwd should be filled with zeros + // all_fact is a (T + 1, N) matrix of normalisation factors + // all_scores and all_bwd should both be contiguous + // a 1D grid parallelises over elements in the batch (N dimension) + // a 1D threadpool parallelises over the bwd calculations for each state + // should be launched with blockDim = (N, 1, 1) and threadDim = (2 * nbase, 1, 1) + + int S = 2 * nbase * (nbase + 1); + + const float* scores = all_scores + S * ((T - 1) * N + blockIdx.x); + int scores_stride = S * N; + + float* bwd = all_bwd + 2 * nbase * (T * N + blockIdx.x); + int bwd_stride = 2 * nbase * N; + + float* fact = all_fact + T * N + blockIdx.x; + int fact_stride = N; + + int from_base = threadIdx.x; + int to_base; + + // t = T + bwd[from_base] = -log(2.0 * nbase); + bwd -= bwd_stride; + fact[0] = log(2.0 * nbase); + fact -= N; + __syncthreads(); + + float u, v; + for (int t = 0; t < T; t++) { + // to flip + u = bwd[bwd_stride] + scores[from_base]; + for (int to_base = 1; to_base < nbase; to_base++) { + v = bwd[to_base + bwd_stride] + scores[2 * nbase * to_base + from_base]; + u = max(u, v) + log1p(exp(-abs(u - v))); + } + // to flop + to_base = (from_base < nbase) ? from_base + nbase : from_base; + v = bwd[to_base + bwd_stride] + scores[2 * nbase * nbase + from_base]; + u = max(u, v) + log1p(exp(-abs(u - v))); + + bwd[from_base] = u; + + // calculate normalisation factor + if (from_base == 0) { + u = bwd[0]; + for (int to_base = 1; to_base < 2 * nbase; to_base++) { + v = bwd[to_base]; + u = max(u, v) + log1p(exp(-abs(u - v))); + } + fact[0] = u; + } + __syncthreads(); + + // normalise + bwd[from_base] -= fact[0]; + __syncthreads(); + + //if (from_base == 0) { + //fact[0] += fact[N]; + //} + + scores -= scores_stride; + bwd -= bwd_stride; + fact -= fact_stride; + } +} +''', 'flipflop_bwd') + + +def flipflop_bwd(scores): + index = scores.device.index + T, N, S = scores.shape + nbase = int(np.sqrt(S / 2)) + + bwd = torch.zeros((T + 1, N, 2 * nbase), dtype=scores.dtype, device=scores.device) + fact = torch.zeros((T + 1, N, 1), dtype=scores.dtype, device=scores.device) + with cp.cuda.Device(index): + _flipflop_bwd(grid=(N, 1, 1), block=(2 * nbase, 1, 1), + args=(scores.data_ptr(), bwd.data_ptr(), fact.data_ptr(), T, N, nbase)) + return bwd, fact + + +_flipflop_make_trans = cp.RawKernel(r''' +extern "C" __global__ +void flipflop_make_trans( + const float* scores, + const float* fwd, + const float* bwd, + float* trans, + long long T, + long long N, + long long nbase +) { + // scores is a (T, N, S) tensor where S = 2 * nbase * (nbase + 1) + // fwd is (T + 1, N, 2 * nbase) matrix of forward scores + // bwd is (T + 1, N, 2 * nbase) matrix of backward scores + // trans is of the same shape as scores + // trans should be filled with zeros + // all tensors should be contiguous + // a 1D grid parallelises over elements in the batch (N dimension) + // a 1D threadpool parallelises over the calculations for each base + // should be launched with blockDim = (N, 1, 1) and threadDim = (2 * nbase, 1, 1) + int S = 2 * nbase * (nbase + 1); + + int scores_offset = S * blockIdx.x; + int scores_stride = S * N; + int fwd_offset = 2 * nbase * blockIdx.x; + int fwd_stride = 2 * nbase * N; + + int from_base = threadIdx.x; + int to_base; + + float f, s, b; + for (int t = 0; t < T; t++) { + f = fwd[fwd_offset + from_base]; + for (int to_base = 0; to_base < nbase; to_base++) { + b = bwd[fwd_offset + fwd_stride + to_base]; + s = scores[scores_offset + from_base + 2 * nbase * to_base]; + trans[scores_offset + from_base + 2 * nbase * to_base] = f + s + b; + } + to_base = (from_base < nbase) ? from_base + nbase : from_base; + b = bwd[fwd_offset + fwd_stride + to_base]; + s = scores[scores_offset + 2 * nbase * nbase + from_base]; + trans[scores_offset + 2 * nbase * nbase + from_base] = f + s + b; + scores_offset += scores_stride; + fwd_offset += fwd_stride; + __syncthreads(); + } +} +''', 'flipflop_make_trans') + + +def flipflop_make_trans(scores): + index = scores.device.index + T, N, S = scores.shape + nbase = int(np.sqrt(S / 2)) + fwd, fwd_fact = flipflop_fwd(scores) + bwd, bwd_fact = flipflop_bwd(scores) + trans = torch.zeros_like(scores) + kernel_args = ( + scores.data_ptr(), + fwd.data_ptr(), + bwd.data_ptr(), + trans.data_ptr(), + T, N, nbase, + ) + with cp.cuda.Device(index): + _flipflop_make_trans(grid=(N,), block=(2 * nbase,), args=kernel_args) + return trans, fwd_fact, bwd_fact + + +class LogZ(Function): + + @staticmethod + def forward(ctx, scores): + T, N, S = scores.shape + trans, fwd_fact, bwd_fact = flipflop_make_trans(scores) + ctx.save_for_backward(trans) + return bwd_fact.sum(0)[:, 0] + + @staticmethod + def backward(ctx, g): + trans, = ctx.saved_tensors + return trans.softmax(2) * g[:, None] + + +def logz(scores): + return LogZ.apply(scores) + + +def global_norm(scores): + return scores - logz(scores)[:, None] / len(scores) diff --git a/taiyaki/fast5utils.py b/taiyaki/fast5utils.py new file mode 100644 index 0000000..43ea7c2 --- /dev/null +++ b/taiyaki/fast5utils.py @@ -0,0 +1,222 @@ +# Utilities to read and write information from HDF5 files, +# including ONT fast5 files. +# ONT fast5 access is built on top of the ont_fast5_api +import os +import sys +import ont_fast5_api.conversion_tools.conversion_utils +import ont_fast5_api.fast5_interface +from taiyaki.fileio import readtsv + +########################################################## +# +# +# FUNCTIONS TO ITERATE OVER READS IN FAST5 FILES +# +# +########################################################## + + +def iterate_file_read_pairs(filepaths, read_ids, limit=None, verbose=0): + """Iterate over pairs of (filepath, read_id), + yielding a tuple (filepath, read_id) at each step. + Yield a maximum of limit tuples in total.""" + nyielded = 0 + for filepath, read_id in zip(filepaths, read_ids): + if not os.path.exists(filepath): + sys.stderr.write('File {} does not exist, skipping\n'.format(filepath)) + continue + with ont_fast5_api.fast5_interface.get_fast5_file(filepath, 'r') as f5file: + if read_id not in f5file.get_read_ids(): + continue + if verbose > 0: + print("Reading", read_id, "from", filepath) + yield filepath, read_id + nyielded += 1 + if limit is not None and nyielded >= limit: + return # ends iterator + return + + +def iterate_files_reads_unpaired(filepaths, read_ids, limit=None, verbose=0): + """Iterate over lists of filepaths and read_ids, looking in all the files + given and returning only those read_ids in the read_ids list. + read_ids may be None: in that case get all the reads in the files. + + yields a tuple (filepath, read_id) at each step. + yields a maximum of limit tuples in total.""" + nyielded = 0 + for filepath in filepaths: + if not os.path.exists(filepath): + sys.stderr.write('File {} does not exist, skipping\n'.format(filepath)) + continue + with ont_fast5_api.fast5_interface.get_fast5_file(filepath, 'r') as f5file: + for read_id in f5file.get_read_ids(): + if read_ids is None or read_id in read_ids: + if verbose > 0: + print("Reading", read_id, "from", filepath) + yield filepath, read_id + nyielded += 1 + else: + if verbose > 0: + print("Skipping", read_id, "from", filepath, ":not in read_id list") + if limit is not None and nyielded >= limit: + return # ends iterator + + +def iterate_fast5_reads(path, + strand_list=None, limit=None, verbose=0): + """Return iterator yielding reads in a directory of fast5 files or a single fast5 file. + + Each read is specified by a tuple (filepath, read_id) + Files may be single or multi-read fast5s + + You may say, "why not yield an ont_fast_api object instead of this nasty tuple?" + I would then say. "yes, I did try that, but it led to unfathomable nastiness when + I fed these objects in as arguments to multiple processes." + + If strand_list is given, then only return the reads spcified, according to + the following rules: + + (A) If the strand list file has a column 'read_id' and no column 'filename' or 'filename_fast5' + then look through all fast5 files in the path and return all reads with read_ids + in that column. + (B) If the strand list file has a column 'filename' or 'filename_fast5' and no column 'read_id' + then look through all filenames specified and return all reads in them. + (C) If the strand list has a column 'filename' or 'filename_fast5' _and_ a column 'read_id' + then loop through the rows in the strand list, returning the appropriate tuple + for each row. We check that each file exists and contains the read_id. + + :param path: Directory ( or filename for a single file) + :param strand_list: Path to file containing list of files and/or read ids to iterate over. + :param limit: Limit number of reads to consider + :param verbose : an integer. verbose=0 prints no progress messages, verbose=1 + prints a message for every file read. Verbose =2 prints the + list of files before starting as well. + + Example usage: + + read_iterator = iterate_fast5_reads('directory') + for read_tuple in read_iterator: + fname,read_id = read_tuple + print("Filename=",fname,", read id = ",read_id) + with fast5_interface.get_fast5_file(fname, 'r') as f5file: + read = f5file.get_read(read_id) + dacs = read.get_raw_data() + print("Length of rawget_file_names data:",len(dacs)) + """ + filepaths, read_ids = None, None + + if strand_list is not None: + strand_table = readtsv(strand_list) + if verbose >= 2: + print("Columns in strand list file:") + print(strand_table.dtype.names) + if 'filename' in strand_table.dtype.names: + filepaths = strand_table['filename'] + elif 'filename_fast5' in strand_table.dtype.names: + filepaths = strand_table['filename_fast5'] + if 'read_id' in strand_table.dtype.names: + read_ids = strand_table['read_id'] + # The strand list supplies filenames, not paths, so we supply the rest + if filepaths is not None: + filepaths = [os.path.join(path, x) for x in filepaths] + + if (filepaths is not None) and (read_ids is not None): + # This is the case (C) above. Both filenames and read_ids come from the strandlist + # and we therefore know which read_id goes with which file + for y in iterate_file_read_pairs(filepaths, read_ids, limit, verbose): + yield y + return + + if filepaths is None: + # Filenames not supplied by strand list, so we get them from the path + if os.path.isdir(path): + filepaths = ont_fast5_api.conversion_tools.conversion_utils.get_fast5_file_list(path, recursive=False) + else: + filepaths = [path] + + for y in iterate_files_reads_unpaired(filepaths, read_ids, limit, verbose): + yield y + + +########################################################## +# +# +# FUNCTIONS TO READ INFORMATION FROM ONT FAST5 FILES +# +# +########################################################## +# +# These functions start with a read object generated by +# the ONT fast5 api. For example +# +# SINGLE READ +# +# from ont_fast5_api import fast5_interface +# s5 = ont_fast5_api.fast5_interface.get_fast5_file(singleReadFile, 'r') +# read_id = s5.get_read_ids()[0] +# read = s5.get_read(read_id) +# read_summary(read) +# +# MULTI-READ +# +# m5 = ont_fast5_api.fast5_interface.get_fast5_file(multiReadFile, 'r') +# for nread,read_id in enumerate(m5.get_read_ids()): +# read = m5.get_read(read_id) +# read_summary(read) + + +def get_filename(read): + """Get filename""" + return read.handle[read.global_key + 'context_tags'].attrs['filename'] + + +def get_channel_info(read): + """Get channel info for read. This is a dict including + digitisation, range, offset, sampling_rate. + + param read: an ont_fast5_api read object + + returns : dict-like object containing channel info + """ + # This is how it is done in _load_raw() in AbstractFast5File in ont_fast5_api.fast5_file.py + return read.handle[read.global_key + 'channel_id'].attrs + + +def get_read_attributes(read): + """Get read attributes for read. This is a dict including + start_time, read_id, duration, etc + + param read: an ont_fast5_api read object + + returns : dict-like object containing attributes + """ + # In a multi-read file, they should be here... + r = read.handle['Raw'].attrs + if len(r) > 0: + return r + # In a single-read file, they are here... + # We want the highest numbered read (latest) + # in the tree 'Raw/Reads/Read_XXXX' + # where XXXX is a number like 0021 or 0001 + numbered_reads = list(read.handle['Raw/Reads'].keys()) + last_numbered_read = sorted(numbered_reads)[-1] + return read.handle['Raw/Reads/' + last_numbered_read].attrs + + +def read_summary(read): + """Print summary of information available in fast5 file on a particular read + + param read: an ont_fast5_api read object + """ + print("ONT interface: read information") + dacs = read.get_raw_data() + channel_info = get_channel_info(read) + read_attributes = get_read_attributes(read) + print(" signal data =", dacs[:10], '...') + print(" signal metadata: channel info") + for k, v in channel_info.items(): + print(" ", k, v) + print(" signal metadata: read attributes") + for k, v in read_attributes.items(): + print(" ", k, v) diff --git a/taiyaki/fileio.py b/taiyaki/fileio.py new file mode 100644 index 0000000..8623815 --- /dev/null +++ b/taiyaki/fileio.py @@ -0,0 +1,150 @@ +from copy import deepcopy +from itertools import islice +import numpy as np +import os +import warnings + +from gzip import open as gzopen +from bz2 import BZ2File as bzopen + +from taiyaki.iterators import empty_iterator + + +_fval = {k: k for k in ['i', 'f', 'd', 's']} +_fval['b'] = 'i' + + +def _numpyfmt(a): + """Return a list of formats with which to output a numpy array + + :param a: :class:`ndrecarray` + """ + fmt = (np.dtype(s[1]).kind.lower() for s in a.dtype.descr) + return ['%' + _fval.get(f, f) for f in fmt] + + +def file_has_fields(fname, fields=None): + """Check that a tsv file has given fields + + :param fname: filename to read. If the filename extension is + gz or bz2, the file is first decompressed. + :param fields: list of required fields. + + :returns: boolean + """ + + # Allow a quick return + req_fields = deepcopy(fields) + if isinstance(req_fields, str): + req_fields = [fields] + if req_fields is None or len(req_fields) == 0: + return True + req_fields = set(req_fields) + + inspector = open + ext = os.path.splitext(fname)[1] + if ext == '.gz': + inspector = gzopen + elif ext == '.bz2': + inspector = bzopen + + has_fields = None + with inspector(fname, 'r') as fh: + present_fields = set(fh.readline().rstrip('\n').split('\t')) + has_fields = req_fields.issubset(present_fields) + return has_fields + + +def read_chunks(fname, n_lines, n_chunks=None, header=True): + """Yield successive chunks of a file + + :param fname: file to read + :param n_lines: number of lines per chunk + :param n_chunks: number of chunks to read + :param header: if True one line is added to first chunk + """ + with open(fname) as fh: + first = True + yielded = 0 + while True: + n = n_lines + if first and header: + n += 1 + sl = islice(fh, n) + is_empty, sl = empty_iterator(sl) + if is_empty: + break + else: + yield sl + yielded += 1 + if n_chunks is not None and yielded == n_chunks: + break + + +def take_a_peak(fname, n_lines=4): + """Read the head of a file + + :param fname: file to read + :param n_lines: number of lines to read + """ + with open(fname, 'r') as fh: + for l in islice(fh, n_lines): + yield l + + +def savetsv(fname, X, header=True): + """Save a structured array to a .tsv file + + :param fname: filename or file handle + If the filename ends in ``.gz``, the file is automatically saved in + compressed gzip format. `loadtxt` understands gzipped files + transparently. + :param X: array_like, Data to be saved to a text file. + """ + if header: + header = '\t'.join(X.dtype.names) + else: + header = '' + fmt = '\t'.join(_numpyfmt(X)) + np.savetxt(fname, X, fmt=fmt, header=header, comments='', delimiter='\t') + + +def readtsv(fname, fields=None, **kwargs): + """Read a tsv file into a numpy array with required field checking + + :param fname: filename to read. If the filename extension is + gz or bz2, the file is first decompressed. + :param fields: list of required fields. + """ + + if not file_has_fields(fname, fields): + raise KeyError('File {} does not contain requested required fields {}'.format(fname, fields)) + + for k in ['names', 'delimiter', 'dtype']: + kwargs.pop(k, None) + table = np.genfromtxt(fname, names=True, delimiter='\t', dtype=None, encoding=None, **kwargs) + # Numpy tricks to force single element to be array of one row + return table.reshape(-1) + + +def readchunkedtsv(fname, chunk_size=100, **kwargs): + """Read chunks of a .tsv file at a time. + + :param fname: file to read + :param chunk_size: length of resultant chunks + :param **kwargs: kwargs of np.genfromtxt + """ + for k in ['names', 'delimiter', 'dtype']: + kwargs.pop(k, None) + + prototype = readtsv(take_a_peak(fname, chunk_size)) + dtype = prototype.dtype + + with warnings.catch_warnings(): + warnings.filterwarnings('error') + for i, chunk in enumerate(read_chunks(fname, chunk_size)): + names = True if i == 0 else None + try: + yield np.genfromtxt(chunk, names=names, delimiter='\t', dtype=dtype, **kwargs) + except: + break diff --git a/taiyaki/flipflop_remap.py b/taiyaki/flipflop_remap.py new file mode 100644 index 0000000..5a40583 --- /dev/null +++ b/taiyaki/flipflop_remap.py @@ -0,0 +1,122 @@ +import numpy as np +from taiyaki import flipflopfings + + +_LARGE_VAL = 1e30 + + +def map_to_crf_viterbi(scores, step_index, stay_index, localpen=_LARGE_VAL): + """Find highest scoring path corresponding to a given label sequence + + :param scores: a 2D array of CRF transition scores (log-space) + :param step_index: index of scores to use to step to the next sequence position, + corresponding to diagonal moves in the alignment matrix, e.g. for a flipflop + CRF and the sequence ATTC the step_index would correspond to the moves + flipA->flipT, flipT->flopT, flopT->flipC + :param stay_index: index of scores to use to stay at the same sequence position, + which should be length 1 longer than the step_index (we assume start at position 0) + e.g. in a flipflop CRF these would be the flipN->flipN and flopN->flopN states + :param localpen: score for skipping over signal at the start or end of the alignment + + :returns: score of best path, best path + """ + N, M = len(scores), len(stay_index) + assert len(step_index) == len(stay_index) - 1 + + pscore = np.full(M, -_LARGE_VAL) + cscore = np.full(M, -_LARGE_VAL) + cscore[0] = 0 + + start_score = 0.0 + end_score = -_LARGE_VAL + alignment_end = 0 + + traceback = np.zeros((N + 1, M), dtype='i1') + + for n in range(N): + step_scores = scores[n, step_index] + stay_scores = scores[n, stay_index] + + pscore, cscore = cscore, pscore + + # stay + cstay = pscore + stay_scores + + # step + cstep = pscore[:-1] + step_scores + + # start + leave_start_score = start_score - localpen + start_score = start_score + max(stay_scores[0], -localpen) + + # update cscore + cscore[:] = cstay[:] + cscore[1:] = np.maximum(cscore[1:], cstep) + cscore[0] = max(cscore[0], start_score) + traceback[n + 1, 1:] = cstay[1:] < cstep + traceback[n + 1, 0] = 1 if leave_start_score > cstay[0] else 0 + + # end + remain_in_end_score = end_score + max(stay_scores[-1], -localpen) + step_into_end_score = pscore[-1] - localpen + end_score = max(remain_in_end_score, step_into_end_score) + if step_into_end_score > remain_in_end_score: + alignment_end = n + + path = np.full(N + 1, -1, dtype=int) + if cscore[-1] > end_score: + # traceback starts at end of sequence + n, m = N, M - 1 + else: + # traceback starts in "end" state + n, m = alignment_end, M - 1 + + while n >= 0 and m >= 0: + path[n] = m + move = traceback[n, m] + m -= move + n -= 1 + + return max(cscore[-1], end_score), path + + +def flipflop_remap(transition_scores, sequence, alphabet="ACGT", localpen=_LARGE_VAL): + """Finds the best alignment between a matrix of flipflip transition scores and a sequence + + Returns the score calculated for the best path, and an array of sequence positions + that correspond to that path. The positions array has length 1 more than the scores + matrix; this is because the scores matrix contains scores for transitions that will + either move us to the next position, or stay at the same position. + + The entire sequence must be used in the alignment, but the scores might be clipped, + depending on the value of localpen. This is acheived by introducing "start" and "end" + states. The alignment must start in the "start" state, move out of "start" into the + first position in the sequence, traverse the entire sequence, and then enter the "end" + state. The alignment can stay in the "start" or "end" states by paying a cost of + localpen while ignoring the next row of transition scores. Therefore, a large value of + localpen will force the entire scores matrix to be used in the alignment ("global mapping"), + while smaller values will lead to more clipping ("glocal mapping"). The time spent in + the "start" and "end" states will be marked with -1s. + + The output positions array will have 3 sections: + 1. zero or more -1s for time spent in the "start" state + 2. a monotonic sequence of positions starting with 0 and ending with len(sequence) - 1 + 3. zero or more -1s for time spend in the "end" state + + :param scores: an array of network outputs of shape (T, K) where K = 2 * nbase * (nbase + 1) + :param sequence: reference sequence to map to + :param alphabet: alphabet of length nbase from which the sequence in drawn + :param localpen: score for staying in the start or end states + + :returns: alignment score, array of sequence positions of length T + 1 + """ + nbase = len(alphabet) + bases = np.array([alphabet.find(b) for b in sequence]) + flops = flipflopfings.flopmask(bases) + + stay_index = np.where(flops, bases + (2 * nbase + 1) * nbase, bases + 2 * nbase * bases) + from_base = (bases + flops * nbase)[:-1] + to_base = np.maximum(bases, nbase * flops)[1:] + step_index = from_base + 2 * nbase * to_base + + return map_to_crf_viterbi(transition_scores, step_index, stay_index, localpen=localpen) diff --git a/taiyaki/flipflopfings.py b/taiyaki/flipflopfings.py new file mode 100644 index 0000000..33a9267 --- /dev/null +++ b/taiyaki/flipflopfings.py @@ -0,0 +1,40 @@ +# Utilities for flip-flop coding +import numpy as np + + +def flopmask(labels): + """Determine which labels are in even positions within runs of identical labels + + param labels : np array of digits representing bases (usually 0-3 for ACGT) + or of bases (bytes) + returns: bool array fm such that fm[n] is True if labels[n] is in + an even position in a run of identical symbols + + E.g. + >> x=np.array([1, 3, 2, 3, 3, 3, 3, 1, 1]) + >> flopmask(x) + array([False, False, False, False, True, False, True, False, True]) + """ + move = np.ediff1d(labels, to_begin=1) != 0 + cumulative_flipflops = (1 - move).cumsum() + offsets = np.maximum.accumulate(move * cumulative_flipflops) + return (cumulative_flipflops - offsets) % 2 == 1 + + +def flip_flop_code(labels, alphabet_length=4): + """Given a list of digits representing bases, add offset to those in even + positions within runs of indentical bases. + param labels : np array of digits representing bases (usually 0-3 for ACGT) + param alphabet_length : number of symbols in alphabet + returns: np array c such that c[n] = labels[n] + alphabet_length where labels[n] is in + an even position in a run of identical symbols, or c[n] = labels[n] + otherwise + + E.g. + >> x=np.array([1, 3, 2, 3, 3, 3, 3, 1, 1]) + >> flip_flop_code(x) + array([1, 3, 2, 3, 7, 3, 7, 1, 5]) + """ + x = labels.copy() + x[flopmask(x)] += alphabet_length + return x diff --git a/taiyaki/helpers.py b/taiyaki/helpers.py new file mode 100644 index 0000000..356e2a1 --- /dev/null +++ b/taiyaki/helpers.py @@ -0,0 +1,253 @@ +from Bio import SeqIO +from collections import Mapping, Sequence +import hashlib +import imp +import numpy as np +import os +import re +import sys +import torch + +from taiyaki import maths +from taiyaki.fileio import readtsv +from taiyaki.variables import DEFAULT_ALPHABET + + + + +def _load_python_model(model_file, **model_kwargs): + netmodule = imp.load_source('netmodule', model_file) + network = netmodule.network(**model_kwargs) + return network + + +def load_model(model_file, params_file=None, **model_kwargs): + _, extension = os.path.splitext(model_file) + + if extension == '.py': + network = _load_python_model(model_file, **model_kwargs) + else: + network = torch.load(model_file, map_location='cpu') + + if params_file is not None: + param_dict = torch.load(params_file, map_location='cpu') + network.load_state_dict(param_dict) + + return network + + +def guess_model_stride(net, input_shape=(720, 1, 1), device='cpu'): + """ Infer the stride of a pytorch network by running it on some test input. + Assume that net is already able to accept input on device specified""" + out = net(torch.zeros(input_shape).to(device)) + return int(round(input_shape[0] / out.size()[0])) + + +def objwalk(obj, types=(object,), path=(), memo=None): + """Recursively walk a python object yielding instances of specified types + + Ripped off from: https://goo.gl/brwMce + """ + if memo is None: + memo = set() + + if isinstance(obj, Mapping): + children = obj.items() + elif isinstance(obj, Sequence) and not isinstance(obj, (bytes, str)): + children = enumerate(obj) + elif hasattr(obj, "__dict__"): + children = vars(obj).items() + else: + children = [] + + if id(obj) not in memo: + memo.add(id(obj)) + + if isinstance(obj, types): + yield path, obj + + for (path_component, value) in children: + for res in objwalk(value, types, path + (path_component,), memo): + yield res + + +def set_at_path(obj, path, val): + """Set value on object following a path as built by objwalk""" + if len(path) == 1: + if isinstance(obj, Mapping): + obj[path[0]] = val + elif isinstance(obj, Sequence) and not isinstance(obj, (bytes, str)): + obj[path[0]] = val + else: + setattr(obj, path[0], val) + elif len(path) > 1: + if isinstance(obj, Mapping): + set_at_path(obj[path[0]], path[1:], val) + elif isinstance(obj, Sequence) and not isinstance(obj, (bytes, str)): + set_at_path(obj[path[0]], path[1:], val) + else: + set_at_path(obj.__dict__[path[0]], path[1:], val) + + +def get_kwargs(args, names): + kwargs = {} + for name in names: + kwargs[name] = getattr(args, name) + return kwargs + + +def trim_array(x, from_start, from_end): + assert from_start >= 0 + assert from_end >= 0 + + from_end = None if from_end == 0 else -from_end + return x[from_start:from_end] + + +def subsample_array(x, length): + if length is None: + return x + assert len(x) > length + startpos = np.random.randint(0, len(x) - length + 1) + return x[startpos : startpos + length] + + +def fasta_file_to_dict(fasta_file_name, allow_N=False, alphabet=DEFAULT_ALPHABET): + """Load records from fasta file as a dictionary""" + has_nonalphabet = re.compile('[^{}]'.format(alphabet)) + + references = {} + with open(fasta_file_name, 'r') as fh: + for ref in SeqIO.parse(fh, 'fasta'): + refseq = str(ref.seq) + if len(refseq) == 0: + continue + if not allow_N and re.search(has_nonalphabet, refseq) is not None: + continue + references[ref.id] = refseq.encode('utf-8') + + return references + + +class ReadIndex(dict): + """dict subclass mapping from read names (str or bytes) to index (int)""" + @staticmethod + def _force_str(s): + return s.decode('utf-8') if isinstance(s, bytes) else s + + def __init__(self, *args, **kwargs): + data = dict(*args, **kwargs) + items = ((ReadIndex._force_str(basename), read_id) for basename, read_id in data.items()) + super().__init__(items) + + def to_numpy(self): + """Convert ReadIndex instance into a numpy structured array""" + max_key_len = max(map(len, self.keys())) + key_dtype = 'S{}'.format(max_key_len) + dtype = [('read_name', key_dtype), ('read_id', ' 0 + self._count = 0 + self.every = every + self._line_len = maxlen + self.fh = fh + + def step(self): + self._count += 1 + if self.count % self.every == 0: + dotcount = self.count // self.every + self.fh.write('\033[1;{}m.\033[m'.format(COLOURS[dotcount % len(COLOURS)])) + if dotcount % self.line_len == 0: + self.fh.write('{:8d}\n'.format(self.count)) + self.fh.flush() + + @property + def line_len(self): + return self._line_len + + @property + def count(self): + return self._count + + @property + def nline(self): + return (self.count // self.every) // self.line_len + + @property + def is_newline(self): + return self.count % (self.dotcount * self.line_len) == 0 diff --git a/taiyaki/iterators.py b/taiyaki/iterators.py new file mode 100644 index 0000000..d334a42 --- /dev/null +++ b/taiyaki/iterators.py @@ -0,0 +1,364 @@ +""" +Mostly shamelessly borrowed from: +https://docs.python.org/2/library/itertools.html#recipes + +because its all so useful! + +""" +from collections import deque +from itertools import * +from functools import partial +import operator +import numpy as np +import random +from multiprocessing import Pool +import sys +import traceback + + +def try_except_pass(func, *args, **kwargs): + """Try function: if error occurs, print traceback and return None + + When wrapping a function we would ordinarily form a closure over a (sub)set of + the inputs. Such closures cannot be pickled however since the wrapper name is not + importable. We get around this by using functools.partial (which is + pickleable). The result is that we can decorate a function to mask + exceptions thrown by it. + """ + try: + return func(*args, **kwargs) + except: + exc_info = sys.exc_info() + traceback.print_tb(exc_info[2]) + return None + + +def empty_iterator(it): + """Check if an iterator is empty and prepare a fresh one for use + + :param it: iterator to test + + :returns: bool, iterator + """ + it, any_check = tee(it) + try: + next(any_check) + except StopIteration: + return True, it + else: + return False, it + + +def take(n, iterable): + "Return first n items of the iterable as a list" + return list(islice(iterable, n)) + + +def tabulate(function, start=0): + "Return function(0), function(1), ..." + return map(function, count(start)) + + +def consume(iterator, n): + "Advance the iterator n-steps ahead. If n is none, consume entirely." + # Use functions that consume iterators at C speed. + if n is None: + # feed the entire iterator into a zero-length deque + deque(iterator, maxlen=0) + else: + # advance to the empty slice starting at position n + next(islice(iterator, n, n), None) + + +def nth(iterable, n, default=None): + "Returns the nth item or a default value" + return next(islice(iterable, n, None), default) + + +def quantify(iterable, pred=bool): + "Count how many times the predicate is true" + return sum(map(pred, iterable)) + + +def padnone(iterable): + """Returns the sequence elements and then returns None indefinitely. + + Useful for emulating the behavior of the built-in map() function. + """ + return chain(iterable, repeat(None)) + + +def ncycles(iterable, n): + "Returns the sequence elements n times" + return chain.from_iterable(repeat(tuple(iterable), n)) + + +def dotproduct(vec1, vec2): + return sum(map(operator.mul, vec1, vec2)) + + +def flatten(listOfLists): + "Flatten one level of nesting" + return chain.from_iterable(listOfLists) + + +def repeatfunc(func, times=None, *args): + """Repeat calls to func with specified arguments. + + Example: repeatfunc(random.random) + """ + if times is None: + return starmap(func, repeat(args)) + return starmap(func, repeat(args, times)) + + +def pairwise(iterable): + "s -> (s0,s1), (s1,s2), (s2, s3), ..." + a, b = tee(iterable) + next(b, None) + return zip(a, b) + + +def grouper(iterable, n, fillvalue=None): + "Collect data into fixed-length chunks or blocks" + # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx + args = [iter(iterable)] * n + return zip_longest(fillvalue=fillvalue, *args) + + +def grouper_it(iterable, n): + "As grouper but doesn't pad final chunk" + it = iter(iterable) + while True: + chunk_it = islice(it, n) + try: + first_el = next(chunk_it) + except StopIteration: + return + yield chain((first_el,), chunk_it) + + +def blocker(iterable, n): + """Yield successive n-sized blocks from iterable + as numpy array. Doesn't pad final block. + """ + for i in range(0, len(iterable), n): + yield np.array(iterable[i:i + n]) + + +def roundrobin(*iterables): + "roundrobin('ABC', 'D', 'EF') --> A D E B F C" + # Recipe credited to George Sakkis + pending = len(iterables) + nexts = cycle(iter(it).__next__ for it in iterables) + while pending: + try: + for callable_next in nexts: + yield callable_next() + except StopIteration: + pending -= 1 + nexts = cycle(islice(nexts, pending)) + + +def powerset(iterable): + "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)" + s = list(iterable) + return chain.from_iterable(combinations(s, r) for r in range(len(s) + 1)) + + +def unique_everseen(iterable, key=None): + "List unique elements, preserving order. Remember all elements ever seen." + # unique_everseen('AAAABBBCCDAABBB') --> A B C D + # unique_everseen('ABBCcAD', str.lower) --> A B C D + seen = set() + seen_add = seen.add + if key is None: + for element in filterfalse(seen.__contains__, iterable): + seen_add(element) + yield element + else: + for element in iterable: + k = key(element) + if k not in seen: + seen_add(k) + yield element + + +def unique_justseen(iterable, key=None): + "List unique elements, preserving order. Remember only the element just seen." + # unique_justseen('AAAABBBCCDAABBB') --> A B C D A B + # unique_justseen('ABBCcAD', str.lower) --> A B C A D + return map(next, map(itemgetter(1), groupby(iterable, key))) + + +def iter_except(func, exception, first=None): + """ Call a function repeatedly until an exception is raised. + + Converts a call-until-exception interface to an iterator interface. + Like __builtin__.iter(func, sentinel) but uses an exception instead + of a sentinel to end the loop. + + Examples: + bsddbiter = iter_except(db.next, bsddb.error, db.first) + heapiter = iter_except(functools.partial(heappop, h), IndexError) + dictiter = iter_except(d.popitem, KeyError) + dequeiter = iter_except(d.popleft, IndexError) + queueiter = iter_except(q.get_nowait, Queue.Empty) + setiter = iter_except(s.pop, KeyError) + + """ + try: + if first is not None: + yield first() + while 1: + yield func() + except exception: + pass + + +def random_product(*args, **kwds): + "Random selection from itertools.product()" + pools = list(map(tuple, args)) * kwds.get('repeat', 1) + return tuple(random.choice(pool) for pool in pools) + + +def random_permutation(iterable, r=None): + "Random selection from itertools.permutations(iterable, r)" + pool = tuple(iterable) + r = len(pool) if r is None else r + return tuple(random.sample(pool, r)) + + +def random_combination(iterable, r): + "Random selection from itertools.combinations(iterable, r)" + pool = tuple(iterable) + n = len(pool) + indices = sorted(random.sample(range(n), r)) + return tuple(pool[i] for i in indices) + + +def random_combination_with_replacement(iterable, r): + "Random selection from itertools.combinations_with_replacement(iterable, r)" + pool = tuple(iterable) + n = len(pool) + indices = sorted(random.randrange(n) for i in range(r)) + return tuple(pool[i] for i in indices) + + +def tee_lookahead(t, i): + """Inspect the i-th upcomping value from a tee object + while leaving the tee object at its current position. + + Raise an IndexError if the underlying iterator doesn't + have enough values. + + """ + for value in islice(t.__copy__(), i, None): + return value + raise IndexError(i) + + +def window(iterable, size): + """Create an iterator returning a sliding window from another iterator + + :param iterable: Iterator + :param size: Size of window + + :returns: an iterator returning a tuple containing the data in the window + + """ + assert size > 0, "Window size for iterator should be strictly positive, got {0}".format(size) + iters = tee(iterable, size) + for i in range(1, size): + for each in iters[i:]: + next(each, None) + return zip(*iters) + + +def centered_truncated_window(iterable, size): + """A sliding window generator padded with shorter windows at edges, + output is the same length as the input. Will pad on the right more. + [1,2,3,4,5] -> (1,2), (1,2,3), (2,3,4), (3,4,5), (4,5) + + :param iterable: Iterator + :param size: Size of window + """ + edge, bulk = tee(iterable, 2) + edge = take(size + 1, edge) + for i in range(size // 2 + 1, size): + yield tuple(edge[:i]) + + # bulk can be handled by window() + count = 0 + for win in window(bulk, size): + yield win + count += 1 + + edge = list(win)[1:] + for i in range(size // 2): + yield tuple(edge[i:]) + + +class __NotGiven(object): + + def __init__(self): + """Some horrible voodoo""" + pass + + +def imap_mp( + function, args, fix_args=__NotGiven(), fix_kwargs=__NotGiven(), + pass_exception=False, threads=1, unordered=False, chunksize=1, init=None, initargs=() +): + """Map a function using multiple processes + + :param function: the function to apply, must be pickalable for multiprocess + mapping (problems will results if the function is not at the top level + of scope). + :param args: iterable of argument values of function to map over + :param fix_args: arguments to hold fixed + :param fix_kwargs: keyword arguments to hold fixed + :param threads: number of subprocesses + :param unordered: use unordered multiprocessing map + :param chunksize: multiprocessing job chunksize + :param pass_exception: ignore exceptions thrown by function? + :param init: function to each thread to call when it is created. + :param initargs: list of arguments for init + + .. note:: + This function is a generator, the caller will need to consume this. + Not all options of all mapping functions are supported (why have a + wrapper in such cases?). If there is a compelling need for more + flexibility, it can be added. + + If fix_args or fix_kwargs are given, these are first used to create a + partially evaluated version of function. + + The special :class:`__NotGiven` is used here to flag when optional arguments + are to be used. + """ + + my_function = function + if not isinstance(fix_args, __NotGiven): + my_function = partial(my_function, *fix_args) + if not isinstance(fix_kwargs, __NotGiven): + my_function = partial(my_function, **fix_kwargs) + + if pass_exception: + my_function = partial(try_except_pass, my_function) + + if threads == 1: + if init is not None: + init(*initargs) + for r in map(my_function, args): + yield r + else: + pool = Pool(threads, init, initargs) + if unordered: + mapper = pool.imap_unordered + else: + mapper = pool.imap + for r in mapper(my_function, args, chunksize=chunksize): + yield r + pool.close() + pool.join() diff --git a/taiyaki/json.py b/taiyaki/json.py new file mode 100644 index 0000000..075aef6 --- /dev/null +++ b/taiyaki/json.py @@ -0,0 +1,25 @@ +import json +import numpy as np +import torch + + +# +# Some numpy types are not serializable to JSON out-of-the-box in Python3 -- need coersion. See +# http://stackoverflow.com/questions/27050108/convert-numpy-type-to-python/27050186#27050186 +# + +class JsonEncoder(json.JSONEncoder): + + def default(self, obj): + if isinstance(obj, np.integer): + return int(obj) + elif isinstance(obj, np.floating): + return float(obj) + elif isinstance(obj, np.ndarray): + return obj.tolist() + elif isinstance(obj, torch.nn.Parameter): + return obj.data + elif isinstance(obj, torch.Tensor): + return obj.detach_().numpy() + else: + return super(JsonEncoder, self).default(obj) diff --git a/taiyaki/layers.py b/taiyaki/layers.py new file mode 100644 index 0000000..d3cd941 --- /dev/null +++ b/taiyaki/layers.py @@ -0,0 +1,634 @@ +from collections import OrderedDict +import numpy as np + +import torch +from torch import nn +from torch.nn import Parameter +from scipy.stats import truncnorm + +from taiyaki import activation +from taiyaki.config import taiyaki_dtype + + +""" Convention: inMat row major (C ordering) as (time, batch, state) +""" +_FORGET_BIAS = 2.0 + + +def truncated_normal(size, sd): + """Truncated normal for Xavier style initiation""" + res = sd * truncnorm.rvs(-2, 2, size=size) + return res.astype('f4') + + +def init_(param, value): + """Set parameter value (inplace) from tensor, numpy array, list or tuple""" + value_as_tensor = torch.tensor(value, dtype=param.data.dtype) + param.data.detach_().set_(value_as_tensor) + + +def reverse(x): + """Reverse input on the first axis""" + inv_idx = torch.arange(x.size(0) - 1, -1, -1).long() + if x.is_cuda: + inv_idx = inv_idx.cuda(x.get_device()) + return x.index_select(0, inv_idx) + + +class Reverse(nn.Module): + + def __init__(self, layer): + super().__init__() + self.layer = nn.ModuleList([layer])[0] + + def forward(self, x): + return reverse(self.layer(reverse(x))) + + def json(self, params=False): + return OrderedDict([('type', "reverse"), + ('sublayers', self.layer.json(params))]) + + +class Residual(nn.Module): + + def __init__(self, layer): + super().__init__() + self.layer = nn.ModuleList([layer])[0] + + def forward(self, x): + return x + self.layer(x) + + def json(self, params=False): + return OrderedDict([('type', "Residual"), + ('sublayers', self.layer.json(params))]) + + +class GatedResidual(nn.Module): + + def __init__(self, layer, gate_init=0.0): + super().__init__() + self.layer = nn.ModuleList([layer])[0] + self.alpha = Parameter(torch.tensor([gate_init])) + + def forward(self, x): + gate = activation.sigmoid(self.alpha) + y = self.layer(x) + return gate * x + (1 - gate) * y + + def json(self, params=False): + res = OrderedDict([('type', "GatedResidual"), + ('sublayers', self.layer.json(params))]) + if params: + res['params'] = OrderedDict([('alpha', float(self.alpha.detach_().numpy()[0]))]) + return res + + +class FeedForward(nn.Module): + """ Basic feedforward layer + out = f( inMat W + b ) + + :param insize: Size of input to layer + :param size: Layer size + :param has_bias: Whether layer has bias + :param fun: The activation function. + """ + + def __init__(self, insize, size, has_bias=True, fun=activation.linear): + super().__init__() + self.insize = insize + self.size = size + self.has_bias = has_bias + self.linear = nn.ModuleList([nn.Linear(insize, size, bias=has_bias)])[0] + self.activation = fun + self.reset_parameters() + + def reset_parameters(self): + winit = truncated_normal(list(self.linear.weight.shape), sd=0.5) + init_(self.linear.weight, winit / np.sqrt(self.insize + self.size)) + if self.has_bias: + binit = truncated_normal(list(self.linear.bias.shape), sd=0.5) + init_(self.linear.bias, binit) + + def forward(self, x): + return self.activation(self.linear(x)) + + def json(self, params=False): + res = OrderedDict([('type', "feed-forward"), + ('activation', self.activation.__name__), + ('size', self.size), + ('insize', self.insize), + ('bias', self.has_bias)]) + if params: + res['params'] = OrderedDict([('W', self.linear.weight)] + + [('b', self.linear.bias)] if self.has_bias else []) + return res + + +class Softmax(nn.Module): + """ Softmax layer + tmp = exp( inmat W + b ) + out = log( row_normalise( tmp ) ) + + :param insize: Size of input to layer + :param size: Layer size + :param has_bias: Whether layer has bias + """ + + def __init__(self, insize, size, has_bias=True): + super().__init__() + self.insize = insize + self.size = size + self.has_bias = has_bias + self.linear = nn.ModuleList([nn.Linear(insize, size, bias=has_bias)])[0] + self.activation = nn.LogSoftmax(2) + self.reset_parameters() + + def reset_parameters(self): + winit = truncated_normal(list(self.linear.weight.shape), sd=0.5) + init_(self.linear.weight, winit / np.sqrt(self.insize + self.size)) + if self.has_bias: + binit = truncated_normal(list(self.linear.bias.shape), sd=0.5) + init_(self.linear.bias, binit) + + def forward(self, x): + return self.activation(self.linear(x)) + + def json(self, params=False): + res = OrderedDict([('type', "softmax"), + ('size', self.size), + ('insize', self.insize), + ('bias', self.has_bias)]) + if params: + res['params'] = OrderedDict([('W', self.linear.weight)] + + [('b', self.linear.bias)] if self.has_bias else []) + return res + + +class CudnnGru(nn.Module): + """ Gated Recurrent Unit compatable with cudnn + + :param insize: Size of input to layer + :param size: Layer size + :param has_bias: Whether layer has bias + """ + + def __init__(self, insize, size, bias=True): + super().__init__() + self.cudnn_gru = nn.GRU(insize, size, bias=bias) + self.insize = insize + self.size = size + self.has_bias = bias + self.reset_parameters() + + def reset_parameters(self): + for name, param in self.named_parameters(): + shape = list(param.shape) + if 'weight_hh' in name: + init_(param, truncated_normal(shape, sd=0.5) / np.sqrt(2 * self.size)) + elif 'weight_ih' in name: + init_(param, truncated_normal(shape, sd=0.5) / np.sqrt(self.insize + self.size)) + else: + init_(param, truncated_normal(shape, sd=0.5)) + + def forward(self, x): + y, hy = self.cudnn_gru.forward(x) + return y + + def json(self, params=False): + res = OrderedDict([('type', "CudnnGru"), + ('activation', "tanh"), + ('gate', "sigmoid"), + ('size', self.size), + ('insize', self.insize), + ('bias', self.has_bias), + ('state0', False)]) + if params: + iW = _cudnn_to_guppy_gru(self.cudnn_gru.weight_ih_l0) + sW = _cudnn_to_guppy_gru(self.cudnn_gru.weight_hh_l0) + ib = _cudnn_to_guppy_gru(self.cudnn_gru.bias_ih_l0) + sb = _cudnn_to_guppy_gru(self.cudnn_gru.bias_hh_l0) + res['params'] = OrderedDict([('iW', _reshape(iW, (3, self.size, self.insize))), + ('sW', _reshape(sW, (3, self.size, self.size))), + ('ib', _reshape(ib, (3, self.size))), + ('sb', _reshape(sb, (3, self.size)))]) + return res + + +class Lstm(nn.Module): + """ LSTM layer wrapper around the cudnn LSTM kernel + See http://colah.github.io/posts/2015-08-Understanding-LSTMs/ for a good + introduction to LSTMs. + + Step: + v = [ input_new, output_old ] + Pforget = gatefun( v W2 + b2 + state * p1) + Pupdate = gatefun( v W1 + b1 + state * p0) + Update = fun( v W0 + b0 ) + state_new = state_old * Pforget + Update * Pupdate + Poutput = gatefun( v W3 + b3 + state * p2) + output_new = fun(state) * Poutput + + :param insize: Size of input to layer + :param size: Layer size + :param has_bias: Whether layer has bias + """ + + def __init__(self, insize, size, has_bias=True): + super().__init__() + self.lstm = nn.LSTM(insize, size, bias=has_bias) + self.insize = insize + self.size = size + self.has_bias = has_bias + self._disable_state_bias() + self.reset_parameters() + + def _disable_state_bias(self): + for name, param in self.lstm.named_parameters(): + if 'bias_hh' in name: + param.requires_grad = False + param.zero_() + + def reset_parameters(self): + for name, param in self.named_parameters(): + shape = list(param.shape) + if 'weight_hh' in name: + init_(param, truncated_normal(shape, sd=0.5) / np.sqrt(2 * self.size)) + elif 'weight_ih' in name: + init_(param, truncated_normal(shape, sd=0.5) / np.sqrt(self.insize + self.size)) + else: + # TODO: initialise forget gate bias to positive value + init_(param, truncated_normal(shape, sd=0.5)) + + def named_parameters(self, prefix='', recurse=True): + for name, param in self.lstm.named_parameters(prefix=prefix, recurse=recurse): + if 'bias_hh' not in name: + yield name, param + + def forward(self, x): + y, hy = self.lstm.forward(x) + return y + + def json(self, params=False): + res = OrderedDict([('type', "LSTM"), + ('activation', "tanh"), + ('gate', "sigmoid"), + ('size', self.size), + ('insize', self.insize), + ('bias', self.has_bias)]) + if params: + res['params'] = OrderedDict([('iW', _reshape(self.lstm.weight_ih_l0, (4, self.size, self.insize))), + ('sW', _reshape(self.lstm.weight_hh_l0, (4, self.size, self.size))), + ('b', _reshape(self.lstm.bias_ih_l0, (4, self.size)))]) + return res + + +class GruMod(nn.Module): + """ Gated Recurrent Unit compatable with guppy + + This version of the Gru should be compatable with guppy. It differs from the + CudnnGru in that the CudnnGru has an additional bias parameter. + + :param insize: Size of input to layer + :param size: Layer size + :param has_bias: Whether layer has bias + """ + + def __init__(self, insize, size, has_bias=True): + super().__init__() + self.cudnn_gru = nn.GRU(insize, size, bias=has_bias) + self.insize = insize + self.size = size + self.has_bias = has_bias + self._disable_state_bias() + self.reset_parameters() + + def reset_parameters(self): + for name, param in self.named_parameters(): + shape = list(param.shape) + if 'weight_hh' in name: + init_(param, truncated_normal(shape, sd=0.5) / np.sqrt(2 * self.size)) + elif 'weight_ih' in name: + init_(param, truncated_normal(shape, sd=0.5) / np.sqrt(self.insize + self.size)) + else: + init_(param, truncated_normal(shape, sd=0.5)) + + def _disable_state_bias(self): + for name, param in self.cudnn_gru.named_parameters(): + if 'bias_hh' in name: + param.requires_grad = False + param.zero_() + + def named_parameters(self, prefix='', recurse=True): + prefix = prefix + ('.' if prefix else '') + for name, param in self.cudnn_gru.named_parameters(recurse=recurse): + if not 'bias_hh' in name: + yield prefix + name, param + + def forward(self, x): + y, hy = self.cudnn_gru.forward(x) + return y + + def json(self, params=False): + res = OrderedDict([('type', "GruMod"), + ('activation', "tanh"), + ('gate', "sigmoid"), + ('size', self.size), + ('insize', self.insize), + ('bias', self.has_bias)]) + if params: + iW = _cudnn_to_guppy_gru(self.cudnn_gru.weight_ih_l0) + sW = _cudnn_to_guppy_gru(self.cudnn_gru.weight_hh_l0) + b = _cudnn_to_guppy_gru(self.cudnn_gru.bias_ih_l0) + res['params'] = OrderedDict([('iW', _reshape(iW, (3, self.size, self.insize))), + ('sW', _reshape(sW, (3, self.size, self.size))), + ('b', _reshape(b, (3, self.size)))]) + return res + + +def _cudnn_to_guppy_gru(p): + """Reorder GRU params from order expected by CUDNN to that required by guppy""" + x, y, z = torch.chunk(p, 3) + return torch.cat([y, x, z], 0) + + +class Convolution(nn.Module): + """1D convolution over the first dimension + + Takes input of shape [time, batch, features] and produces output of shape + [ceil((time + padding) / stride), batch, features] + + :param insize: number of features on input + :param size: number of output features + :param winlen: size of window over input + :param stride: step size between successive windows + :param has_bias: whether layer has bias + :param fun: the activation function + :param pad: (int, int) of padding applied to start and end, or None in which + case the padding used is (winlen // 2, (winlen - 1) // 2) which ensures + that the output length does not depend on winlen + """ + + def __init__(self, insize, size, winlen, stride=1, pad=None, fun=activation.tanh, has_bias=True): + super().__init__() + self.insize = insize + self.size = size + self.stride = stride + self.winlen = winlen + if pad is None: + pad = (winlen // 2, (winlen - 1) // 2) + self.padding = pad + self.pad = nn.ConstantPad1d(pad, 0) + self.conv = nn.Conv1d(kernel_size=winlen, in_channels=insize, out_channels=size, stride=stride, bias=has_bias) + self.activation = fun + self.reset_parameters() + + def reset_parameters(self): + fanin = self.insize * self.winlen + fanout = self.size * self.winlen / self.stride + winit = truncated_normal(list(self.conv.weight.shape), sd=0.5) + init_(self.conv.weight, winit / np.sqrt(fanin + fanout)) + binit = truncated_normal(list(self.conv.bias.shape), sd=0.5) + init_(self.conv.bias, binit) + + def forward(self, x): + x = x.permute(1, 2, 0) + out = self.activation(self.conv(self.pad(x))) + return out.permute(2, 0, 1) + + def json(self, params=False): + res = OrderedDict([("type", "convolution"), + ("insize", self.insize), + ("size", self.size), + ("winlen", self.conv.kernel_size[0]), + ("stride", self.conv.stride[0]), + ("padding", self.padding), + ("activation", self.activation.__name__)]) + if params: + res['params'] = OrderedDict([("W", self.conv.weight), + ("b", self.conv.bias)]) + return res + + +class Parallel(nn.Module): + + def __init__(self, layers): + super().__init__() + self.sublayers = nn.ModuleList(layers) + + def forward(self, x): + ys = [layer(x) for layer in self.sublayers] + return torch.cat(ys, 2) + + def json(self, params=False): + return OrderedDict([('type', "parallel"), + ('sublayers', [layer.json(params) for layer in self.sublayers])]) + + +class Serial(nn.Module): + + def __init__(self, layers): + super().__init__() + self.sublayers = nn.ModuleList(layers) + + def forward(self, x): + for layer in self.sublayers: + x = layer(x) + return x + + def json(self, params=False): + return OrderedDict([('type', "serial"), + ('sublayers', [layer.json(params) for layer in self.sublayers])]) + + +class SoftChoice(nn.Module): + + def __init__(self, layers): + super().__init__() + self.sublayers = nn.ModuleList(layers) + self.alpha = Parameter(torch.zeros(len(layers))) + + def forward(self, x): + ps = torch.nn.Softmax(0)(self.alpha) + ys = [p * layer(x) for p, layer in zip(ps, self.sublayers)] + return torch.stack(ys).sum(0) + + def json(self, params=False): + res = OrderedDict([('type', "softchoice"), + ('sublayers', [layer.json(params) for layer in self.sublayers])]) + if params: + res['params'] = OrderedDict([('alpha', self.alpha)]) + return res + + +def zeros(size): + return np.zeros(size, dtype=taiyaki_dtype) + + +def _reshape(x, shape): + return x.detach_().numpy().reshape(shape) + + +class Identity(nn.Module): + """The identity transform""" + + def json(self, params=False): + return OrderedDict([('type', 'Identity')]) + + def forward(self, x): + return x + + +class Studentise(nn.Module): + """ Normal all features in batch + + :param epsilon: Stabilsation layer + """ + + def __init__(self, epsilon=1e-4): + super().__init__() + self.epsilon = epsilon + + def json(self, params=False): + return {'type' : "studentise"} + + def forward(self, x): + features = x.shape[-1] + m = x.view(-1, features).mean(0) + v = x.view(-1, features).var(0, unbiased=False) + return (x - m) / torch.sqrt(v + self.epsilon) + + +class DeltaSample(nn.Module): + """ Returns difference between neighbouring features + + Right is padded with zero + """ + + def json(self, params=False): + return OrderedDict([('type', "DeltaSample")]) + + def forward(self, x): + output = x[1:] - x[:-1] + padding = torch.zeros_like(x[:1]) + return torch.cat((output, padding), dim=0) + + +class Window(nn.Module): + """ Create a sliding window over input + + :param w: Size of window + """ + + def __init__(self, w): + super().__init__() + assert w > 0, "Window size must be positive" + assert w % 2 == 1, 'Window size should be odd' + self.w = w + + def json(self, params=False): + res = OrderedDict([('type', "window")]) + if params: + res['params'] = OrderedDict([('w', self.w)]) + return res + + def forward(self, x): + length = x.shape[0] + pad = self.w // 2 + zeros = x.new_zeros((pad,) + x.shape[1:]) + padded_x = torch.cat([zeros, x, zeros], 0) + + xs = [padded_x[i:length + i] for i in range(self.w)] + return torch.cat(xs, x.ndimension() - 1) + + +def birnn(forward, backward): + """ Creates a bidirectional RNN from two RNNs + + :param forward: A layer to run forwards + :param backward: A layer to run backwards + """ + return Parallel([forward, Reverse(backward)]) + + +def logaddexp(x, y): + return torch.max(x, y) + torch.log1p(torch.exp(-abs(x - y))) + + +def global_norm_flipflop(scores): + T, N, C = scores.shape + nbase = int(np.sqrt(C / 2)) + assert 2 * nbase * (nbase + 1) == C,\ + "Unexpected shape for flipflop scores: nbase = {}, shape = {}".format(nbase, (T, N, C)) + + def step(in_vec, in_state): + in_vec_reshape = in_vec.reshape((-1, nbase + 1, 2 * nbase)) + in_state_reshape = in_state.unsqueeze(1) + scores = in_state_reshape + in_vec_reshape + base1_state = scores[:, :nbase].logsumexp(2) + base2_state = logaddexp(scores[:, nbase, :nbase], scores[:, nbase, nbase:]) + new_state = torch.cat([base1_state, base2_state], dim=1) + factors = new_state.logsumexp(1, keepdim=True) + new_state = new_state - factors + return factors, new_state + + fwd = scores.new_zeros((N, 2 * nbase)) + logZ = fwd.logsumexp(1, keepdim=True) + fwd = fwd - logZ + for scores_t in scores: + factors, fwd = step(scores_t, fwd) + logZ = logZ + factors + return scores - logZ / T + + +class GlobalNormFlipFlop(nn.Module): + + def __init__(self, insize, nbase, has_bias=True, _never_use_cupy=False): + super().__init__() + self.insize = insize + self.nbase = nbase + self.size = 2 * nbase * (nbase + 1) + self.has_bias = has_bias + self.linear = nn.Linear(insize, self.size, bias=has_bias) + self.reset_parameters() + self._never_use_cupy = _never_use_cupy + + def json(self, params=False): + res = OrderedDict([ + ('type', 'GlobalNormTwoState'), + ('size', self.size), + ('insize', self.insize), + ('bias', self.has_bias)]) + if params: + res['params'] = OrderedDict([('W', self.linear.weight)] + + [('b', self.linear.bias)] if self.has_bias else []) + return res + + def reset_parameters(self): + winit = truncated_normal(list(self.linear.weight.shape), sd=0.5) + init_(self.linear.weight, winit / np.sqrt(self.insize + self.size)) + if self.has_bias: + binit = truncated_normal(list(self.linear.bias.shape), sd=0.5) + init_(self.linear.bias, binit) + + def _use_cupy(self, x): + # getattr in stead of simple look-up for backwards compatibility + if getattr(self, '_never_use_cupy', False): + return False + + if not x.is_cuda: + return False + + try: + from .cupy_extensions import flipflop + return True + except ImportError: + return False + + def forward(self, x): + y = 5.0 * activation.tanh(self.linear(x)) + + if self._use_cupy(x): + from .cupy_extensions import flipflop + return flipflop.global_norm(y) + else: + return global_norm_flipflop(y) diff --git a/taiyaki/libdecoding.pxd b/taiyaki/libdecoding.pxd new file mode 100644 index 0000000..066c5f7 --- /dev/null +++ b/taiyaki/libdecoding.pxd @@ -0,0 +1,4 @@ +from libc.stdint cimport int32_t +cdef extern from "c_decoding.h": + void fast_viterbi_blocks(const float * weights, size_t nblock, size_t nbatch, size_t nparam, size_t nbase, + float stay_pen, float skip_pen, float local_pen, float * score, int32_t * path) diff --git a/taiyaki/mapped_signal_files.py b/taiyaki/mapped_signal_files.py new file mode 100644 index 0000000..68b3b67 --- /dev/null +++ b/taiyaki/mapped_signal_files.py @@ -0,0 +1,516 @@ +# Defines an abstract class used to read and write per-read "chunk" files +# and a derived class using HDF5 in the simplest way possible +# The base class provides a prototype for other file formats. +# If the class interface is fixed, we can swap to other classes +# (for example, Per_read_Fast5 or Per_read_SQLite) +# + +from abc import ABC, abstractmethod +import h5py +import numpy as np + +_version = 7 + +class Read(dict): + """Class to represent the information about a read that is stored in + a per-read file. Includes lots of checking methods, and methods + to output chunk data. + + Much of the code in this class definition is the checking functions, which + check that the data is consistent with the 'Chunkify version 7 discussion' + https://wiki/display/RES/Chunkify+version+7+discussion + + range,offset and digitisation describe the mapping from Dacs to current in pA: + + current = (dacs + offset ) * range / digitisation + + scale_frompA and shift_frompA describe the mapping from current in pA to + standardised numbers for training: + + standardised_current = ( current - shift ) / scale + + Ref_to_signal[n] is the location in Dacs corresponding to base n in Reference. + + """ + # The data to be stored, with types + # In cases where int or float is specified, numpy types like np.int32 or np.float64 are allowed + # for the scalar. + # If the data type is a numpy one (e.g. np_int32) we interpret that as meaning an ndarray of that + # dtype. + # Also we use upper case for numpy arrays (or dataset in HDF5), lower case for scalar + # in these dictionaries, although that is just an aid to reading and not checked in the code + read_data = {'alphabet': 'str', + 'collapse_alphabet': 'str', + 'shift_frompA': 'float', + 'scale_frompA': 'float', + 'range': 'float', + 'offset': 'float', + 'digitisation': 'float', + 'Dacs': 'np_int16', + 'Ref_to_signal': 'np_int32', + 'Reference': 'np_int16'} + + optional_read_data = {'mapping_score': 'float', + 'mapping_method': 'str', + 'read_id': 'str'} + + def __init__(self, d): + self.update(d) + + @staticmethod + def _typecheck(name, x, target_type): + """Returns empty string or error string depending on whether type matches""" + if target_type == 'int': + # Allow any integer type including numpy integer types + # See + # https://stackoverflow.com/questions/37726830/how-to-determine-if-a-number-is-any-type-of-int-core-or-numpy-signed-or-not + if not np.issubdtype(type(x), np.integer): + return "Type of attribute " + name + " is " + str(type(x)) + ": should be an integer type.\n" + elif target_type == 'float': + # Allow any float type including numpy float types + if not np.issubdtype(type(x), np.floating): + return "Type of attribute " + name + " is " + str(type(x)) + ": should be a float type.\n" + elif target_type == 'bool': + # Allow any boolean type including numpy bool type + if not np.issubdtype(type(x), np.dtype(bool).type): + return "Type of attribute " + name + " is " + str(type(x)) + ": should be a float type.\n" + elif target_type == 'str': + if not isinstance(x, str): + return "Type of attribute " + name + " is " + str(type(x)) + ": should be a string.\n" + elif target_type.startswith('np'): + if type(x) != np.ndarray: + return "Type of attribute " + name + " is not np.ndarray\n" + target_dtype = target_type.split('_')[1] + if str(x.dtype) != target_dtype: + return "Data type of items in numpy array " + name + " is not " + target_dtype + "\n" + else: + if str(type(x)) != target_type: + return "Type of attribute " + name + " is " + str(type(x)) + ": should be" + target_type + ".\n" + return "" + + def check(self): + """Return string "pass" if read info passes some integrity tests. + Return failure information as string otherwise.""" + return_string = "" + + for k, target_type in self.read_data.items(): + try: + x = self[k] + except: + return_string += "Failed to find element " + k + "\n" + continue + return_string += self._typecheck(k, x, target_type) + + try: + maplen = len(self['Ref_to_signal']) + reflen = len(self['Reference']) + if reflen + 1 != maplen: + return_string += "Length of Ref_to_signal (" + str(maplen) + \ + ") should be 1+ length of Reference (" + str(reflen) + ")\n" + except: + return_string += "Not able to check len(Ref_to_signal)=len(Reference)\n" + + try: + r = self['Ref_to_signal'] + s = self['Dacs'] + mapmin, mapmax = np.min(r), np.max(r) + if mapmin < -1 or mapmax > len(s): # -1 and len(s) are used as end markers + return_string += "Range of locations in mapping exceeds length of Dacs\n" + except: + return_string += "Not able to check range of Ref_to_signal values is inside the signal vector (Dacs)\n" + + try: + r = self['Ref_to_signal'] + if np.any(np.diff(r) < 0): + return_string += "Mapping does not increase monotonically\n" + except: + return_string += "Not able to check that mapping increases monotonically\n" + + # Are there any items in the dictionary that should't be there? + alldatakeys = set(self.read_data).union(self.optional_read_data) + for k in self: + if not (k in alldatakeys): + return_string += "Data item " + k + " in read data shouldn't be there\n" + + if len(return_string) == 0: + return "pass" + return return_string + + def get_mapped_reference_region(self): + """Return tuple (start,end_exc) so that + read_dict['Reference'][start:end_exc] is the mapped region + of the reference""" + daclen = len(self['Dacs']) + r = self['Ref_to_signal'] + mappedlocations = np.where((r >= 0) & (r < daclen))[0] # Locations in the ref that are mapped + return np.min(mappedlocations), np.max(mappedlocations) + 1 # +1 to make it exclusive + + def get_mapped_dacs_region(self): + """Return tuple (start,end_exc) so that + read_dict['Dacs'][start:end_exc] is the mapped region + of the signal""" + r = self['Ref_to_signal'] + daclen = len(self['Dacs']) + r = r[(r >= 0) & (r < daclen)] # Locations in the signal (not the end points -1, daclen ) that are mapped to + return np.min(r), np.max(r) + 1 # +1 to make it exclusive + + def get_reference_locations(self, signal_location_vector): + """Return reference locations that go with given signal locations. + signal_location_vector should be a numpy integer vector of signal locations. + The return value is a numpy integer vector of reference locations. + (feeding in a tuple works too but the result is still a vector) + + In the output,-1 and len(self['Dacs']) are used as markers for + unmapped points at the start and end. + + If we have a numpy-style range in a tuple + t = (signal_start_inclusive, signal_end_exclusive) + then f(t) is + reference_start_inclusive, reference_end_exclusive + """ + + if isinstance(signal_location_vector, tuple): + signal_location_vector = np.array(signal_location_vector) + reflen = len(self['Reference']) + + first_mapped_sigloc, first_unmapped_sigloc = self.get_mapped_dacs_region() + result = np.searchsorted(self['Ref_to_signal'], signal_location_vector) + result[signal_location_vector < first_mapped_sigloc] = -1 + result[signal_location_vector >= first_unmapped_sigloc] = reflen + + return result + + def get_standardised_current(self, region=None): + """Get standardised current vector. If region is not None, then + treat region as a tuple: region = (start_inclusive, end_exclusive) + returning current[start_inclusive:end_exclusive]. + """ + if region is None: + dacs = self['Dacs'] + else: + a, b = region + dacs = self['Dacs'][a:b] + + current = (dacs + self['offset']) * self['range'] / self['digitisation'] + return (current - self['shift_frompA']) / self['scale_frompA'] + + def check_for_slip_at_refloc(self, refloc): + """Return True if there is a slip at reference location refloc. + This means that the signal location at refloc is the same as + the signal location that goes with either the previous base + or the next one.""" + r = self['Ref_to_signal'] + sigloc = r[refloc] + # print("reftosig[",refloc-1,"-",refloc+1,"]=",r[refloc-1:refloc+2]) + if refloc < len(r) - 1: + if r[refloc + 1] == sigloc: + return True + if refloc > 1: + if r[refloc - 1] == sigloc: + return True + return False + + def _get_chunk(self, dacs_region, ref_region, verbose=False): + """ + Get a chunk, returning a dictionary with entries: + + current, sequence, max_dwell, start_sample, read_id + + where current is standardised (i.e. scaled so roughly mean=0 std=1) and + reference is a np array of ints. + + The function will return a dict containing at least keys 'read_id' and + 'rejected' (giving a reason for rejection) if either the reference region or the + signal region is empty or if a boundary of the proposed chunk is in a slip. + + Note there is no checking in this function that the dacs_region and ref_region + are associated with one another. That must be done before calling this function. + + If the optional data item read_id is not present in the read dictionary + then the 'read_id' item will be missing in the dictionary returned. + + The mean dwell is len(reference_sequence) / len(standardised_current), + so can be calculated from the data returned by this function. + """ + if ref_region[1] == ref_region[0]: + if verbose: + print("Rejecting read because of zero-length sequence chunk") + returndict = {'rejected': 'emptysequence'} + elif dacs_region[1] == dacs_region[0]: + if verbose: + print("Rejecting read because of zero-length signal chunk") + returndict = {'rejected': 'emptysignal'} + else: + current = self.get_standardised_current(dacs_region) + reference = self['Reference'][ref_region[0]:ref_region[1]] + dwells = np.diff(self['Ref_to_signal'][ref_region[0]:ref_region[1]]) + # If the ref_region has length 1, then the diff has length zero and the + # line to get maxdwell fails. So we need to check length + if len(dwells) > 0: + maxdwell = np.max(dwells) + else: + maxdwell = 1 + returndict = {'current': current, + 'sequence': reference, + 'max_dwell': maxdwell, + 'start_sample': dacs_region[0]} + if self.check_for_slip_at_refloc(ref_region[0]) or self.check_for_slip_at_refloc(ref_region[1]): + if verbose: + print("Rejecting read because of slip:", self.check_for_slip_at_refloc( + ref_region[0]), self.check_for_slip_at_refloc(ref_region[1])) + returndict['rejected'] = 'slip' + + if 'read_id' in self: + returndict['read_id'] = self['read_id'] + return returndict + + def get_chunk_with_sample_length(self, chunk_len, start_sample=None, verbose=False): + """ + Get a chunk, with chunk_len samples, returning a dictionary as in the docstring for get_chunk() + + If start_sample is None, then choose the start point randomly over the possible start points + that lead to a chunk of the right size. + If start_sample is specified as an int, then use a start point start_sample samples into + the mapped region. + + The chunk should have length chunk_len samples, with the number of bases determined by the mapping. + """ + mapped_dacs_region = self.get_mapped_dacs_region() + spare_length = mapped_dacs_region[1] - mapped_dacs_region[0] - chunk_len + if spare_length < 0: + if verbose: + print("Rejecting read because spare_length=", spare_length, + ". mapped_dacs_region = ", mapped_dacs_region) + return {'rejected':'tooshort','read_id':self.get('read_id')} + + if start_sample is None: + dacstart = np.random.randint(spare_length) + mapped_dacs_region[0] + else: + if start_sample >= spare_length: + if verbose: + print("Rejecting read because start_sample >= spare_length=", spare_length) + return {'rejected':'tooshort','read_id':self.get('read_id')} + dacstart = start_sample + mapped_dacs_region[0] + + dacs_region = dacstart, chunk_len + dacstart + ref_region = self.get_reference_locations(dacs_region) + return self._get_chunk(dacs_region, ref_region, verbose) + + def get_chunk_with_sequence_length(self, chunk_bases, start_base=None): + """Get a chunk containing a sequence of length chunk_bases, + returning a dictionary as in the docstring for get_chunk() + + If start_base is None, then choose the start point randomly over the possible start points + that lead to a chunk of the right size. + If start_base is specified as an int, then use a start point start_base bases into + the mapped region. + + The chunk should have length chunk_bases bases, with the number of samples determined by the mapping. + """ + mapped_reference_region = self.get_mapped_reference_region() + spare_length = (mapped_reference_region[1] - mapped_reference_region[0]) - chunk_bases + if spare_length <= 0: #<= rather than < because we want to be able to look up the end in the mapping + return {'rejected':'tooshort','read_id':self.get('read_id')} + if start_base is None: + refstart = np.random.randint(spare_length) + mapped_reference_region[0] + else: + if start_base >= spare_length: + return {'rejected':'tooshort','read_id':self.get('read_id')} + refstart = start_base + mapped_reference_region[0] + refend_exc = refstart + chunk_bases + dacstart = self['Ref_to_signal'][refstart] + dacsend_exc = self['Ref_to_signal'][refend_exc] + #print("get_chunk_with_sequence_length(): ref region",refstart,refend_exc) + #print(" sig region",dacstart,dacsend_exc) + + return self._get_chunk((dacstart, dacsend_exc), (refstart, refend_exc)) + + +class AbstractMappedSignalFile(ABC): + """Abstract base class for files containing mapped reads. + Methods specified as abstractnethod must be overridden + in derived classes. + Note that the methods to check reads and to check the metadata + are not abstract and should not be overridden. + + The class has __enter__ and __exit__ so can be used as a context + manager (i.e. with the 'with' statement). + + In all derived classes, the input and output from the file is done with + the read_dict defined above. + + Derived classes should use the read class variables read_data and + optional_read_data + as much as possible so that changes made there will be propagated to the + derived classes. + """ + + def __enter__(self): + """Called when 'with' is used to create an object. + Since we always return the instance, no need to override this.""" + return self + + def __exit__(self, *args): + """No need to override this - just override the close() function. + Called when 'with' finishes.""" + self.close() + + ######################################### + # Abstract methods in alphabetical order + ######################################### + + @abstractmethod + def __init__(self, filename, mode="a"): + """Open file in read-only mode (mode="r") or allowing + writing of additional stuff and creating if empty + (default mode "a") """ + pass + + @abstractmethod + def close(self): + """Close file""" + pass + + @abstractmethod + def get_read(self, read_id): + """Return a read object containing all elements of the read.""" + pass + + @abstractmethod + def get_read_ids(self): + """Return list of read ids, or empty list if none present""" + pass + + @abstractmethod + def get_version_number(self): + """Return integer version number""" + pass + + @abstractmethod + def write_read(self, read_id, read): + """Write a read to the appropriate place in the file, starting from a read object""" + pass + + @abstractmethod + def write_version_number(self, version_number=_version): + """Get version number of file format""" + pass + + # This function is not abstract because it can be left as-is. + # But it may be overridden if there are speed gains to be had + def get_multiple_reads(self, read_id_list, return_list=True, max_reads=None): + """Get dictionary where keys are read ids from the list + and values are the read objects. If read_id_list=="all" then get + them all. + If return_list, then return a list of read objects where the read_ids + are incorporated in the dicts. + If not, then a dict of dicts where the keys are the read_ids. + If a read_id in the list is not present in the file, then just skip. + Don't raise an exception.""" + read_ids_in_file = self.get_read_ids() + if read_id_list == "all": + read_ids_used = read_ids_in_file + else: + read_ids_used = set(read_id_list).intersection(read_ids_in_file) + if max_reads is not None and max_reads < len(read_ids_used): + read_ids_used = list(read_ids_used)[:max_reads] + if return_list: + # Make a new read object containing the read_id as well as other items, for each read id in the list + return [Read({**self.get_read(read_id), 'read_id': read_id}) for read_id in read_ids_used] + else: + return {read_id: self.get_read(read_id) for read_id in read_ids_used} + + def check_read(self, read_id): + """Check a read in the currently open file, returning "pass" or a report + on errors.""" + try: + read = self.get_read(read_id) + except: + return "Unable to get read " + read_id + " from file" + return read.check() + + def check(self, limit_report_lines=100): + """Check the whole file, returning report in the form of a string""" + return_string = "" + try: + version_number = self.get_version_number() + return_string += Read._typecheck('version', version_number, 'int') + except: + return_string += "Can't get version number\n" + + read_ids = self.get_read_ids() + if len(read_ids) == 0: + return_string += "No reads in file\n" + for read_id in read_ids: + if return_string.count('\n') >= limit_report_lines: + return_string += "----------Number of lines in error report limited to " + \ + str(limit_report_lines) + "\n" + else: + read_check = self.check_read(read_id) + if read_check != "pass": + return_string += "Read " + read_id + ":\n" + read_check + if len(return_string) == 0: + return "pass" + else: + return return_string + + +class HDF5(AbstractMappedSignalFile): + """A file storing mapped data in an HDF5 in the simplest + possible way. + NOT using a derivative of the fast5 format. + This is an HDF5 file with structure below + There can be as many read_ids as you like. + version is an attr, and the read data are stored + as Datasets or attributes as appropriate. + + file--|---Reads --|------ { + \ | { (All the read data for read 0) + version | + |------ { + | { (All the read data for read 1) + + + """ + + def __init__(self, filename, mode): + self.hdf5 = h5py.File(filename, mode) + + def close(self): + self.hdf5.close() + + def _get_read_path(self, read_id): + """Returns string giving path within HDF5 file to data for a read""" + return 'Reads/' + read_id + + def get_read(self, read_id): + """Return a read object (see class definition above).""" + h = self.hdf5[self._get_read_path(read_id)] + d = {} + for k, v in h.items(): # Iterate over datasets (the read group should have no subgroups) + d[k] = v[()] + for k, v in h.attrs.items(): # iterate over attributes + d[k] = v + return Read(d) + + def get_read_ids(self): + """Return list of read ids, or empty list if none present""" + try: + return list(self.hdf5['Reads'].keys()) + except: + return [] + + def get_version_number(self): + return self.hdf5.attrs['version'] + + def write_read(self, read_id, read): + """Write a read to the appropriate place in the file, starting from a read object""" + g = self.hdf5.create_group(self._get_read_path(read_id)) + for k, v in read.items(): + if isinstance(v, np.ndarray): + g.create_dataset(k, data=v) + else: + g.attrs[k] = v + + def write_version_number(self, version_number=_version): + self.hdf5.attrs['version'] = version_number diff --git a/taiyaki/mapping.py b/taiyaki/mapping.py new file mode 100644 index 0000000..62cc80a --- /dev/null +++ b/taiyaki/mapping.py @@ -0,0 +1,245 @@ +# Defines class to represent a read - used in chunkifying +# class also includes data and methods to deal with mapping table +# Also defines iterator giving all f5 files in directory, optionally using a read list + +import numpy as np +import sys +from taiyaki import mapped_signal_files + + +class Mapping: + """ + Represents a mapping between a signal and a reference, with attributes + including the signal and the reference. + + We use the trimming parameters from the signal object to set limits on + outputs (including chunks). + """ + + def __init__(self, signal, signalpos_to_refpos, reference, verbose=False): + """ + param: signal : a Signal object + param: signalpos_to_refpos : a numpy integer array of the same length as signal.untrimmed_dacs + where signalpos_to_refpos[n] is the location in the reference that + goes with location n in signal.untrimmed_dacs. A (-1) in the + vector indicates no association with that signal location. + param: reference : a bytes array or str containing the reference. + (Note that this is converted into a str for use in the class) + param: verbose : Print information about newly constructed mapping object to stdout + (useful when writing new factory functions) + """ + + if len(signal.untrimmed_dacs) != len(signalpos_to_refpos): + raise Exception('Mapping: mapping vector is different length from untrimmed signal') + self.signal = signal + self.signalpos_to_refpos = signalpos_to_refpos + if isinstance(reference, str): + self.reference = reference + if verbose: + print("Created reference from str") + else: + try: + self.reference = reference.decode("utf-8") + if verbose: + print("Created reference from bytes") + except: + if verbose: + print("REFERENCE NOT SET") + raise Exception('Mapping: reference cannot be decoded as string or bytes array') + if verbose: + print("Signal constructor finished.") + print("Signal (trimmed) length:", self.signal.trimmed_length) + print("Mapping vector length:", len(self.signalpos_to_refpos)) + print("reference length", len(self.reference)) + + @property + def trimmed_length(self): + """Trimmed length of the signal in samples. Convenience function, + same as signal.trimmed_length""" + return self.signal.trimmed_length + + def mapping_limits(self, mapping_margin=0): + """Calculate start and (exclusive) endpoints for the signal + so that only the mapped portion of the signal is included. + + After finding endpoints for the mapped region, trim off another + mapping_margin samples from both ends. + + If resulting region is empty, then return start and end points + so that nothing is left. + + Take no notice at all of the signal's trimming parameters + (but see the function mapping_limits_with_signal_trim()) + + param: mapping_margin : extra number of samples to trim off both ends + + returns: (startsample, endsample_exc) + where self.signal.untrimmed_dacs[startsample:endsample_exc] + is the region that is included in the mapping. + """ + firstmapped, lastmapped = -1, -1 + for untrimmed_dacloc, refloc in enumerate(self.signalpos_to_refpos): + if refloc >= 0: + if firstmapped < 0: + firstmapped = untrimmed_dacloc + lastmapped = untrimmed_dacloc + if firstmapped >= 0: # If we have found any mapped locations + startloc = firstmapped + mapping_margin + endloc = lastmapped + 1 - mapping_margin + if startloc <= endloc - 1: + return startloc, endloc + # Trim to leave nothing + return 0, 0 + + def mapping_limits_with_signal_trim(self, mapping_margin=0): + """Calculate mapping limits as in the method mapping_limits() + and then find start and end points for the intersection of + the mapped region with the trimmed region of the signal. + Note mapping_margin is applied to the mapped region before + the signal trim is taken into account. + """ + mapstart, mapend_exc = self.mapping_limits(mapping_margin) + start = max(mapstart, self.signal.signalstart) + end_exc = min(mapend_exc, self.signal.signalend_exc) + + if start < end_exc: + return start, end_exc + else: + return 0, 0 + + @classmethod + def from_remapping_path(_class, signal, sigtoref_downsampled, reference, stride=1, signalstart=None): + """ + Construct Mapping object based on downsampled mapping information + (rather than just copying sigtoref). + Inputs: + sigtoref = a numpy int vector where sigtoref_downsampled[k] is the + location in the reference of the base starting at + untrimmed_dacs[k*stride-1+signalstart] + reference = a string containing the reference + By default, we assume that signalstart is self.signalstart, the trim start + stored within the signal object. + + There is a bit of freedom in where to put the signal locations + because of transition weights + and downsampling. We use a picture like this, shown for the case + stride = 2 + + + signal [0] [1] [2] [3] [4] [5] + blocks --------- --------- --------- + trans weights [0] [1] [2] + sigtoref [0] [1] [2] [3] + mapping to signal / / / + signal [0] [1] [2] [3] [4] [5] + + in other words sigtoref[n] maps to signal[stride*n-1] + Note that the very first element of the sigtoref vector that comes + from remapping is ignored with this choice + + """ + + if signalstart is None: + signalstart_used = signal.signalstart + else: + signalstart_used = signalstart + + # Create null dacstoref ranging of the full (not downsampled) range of locations + # and fill in the locations determined by the mapping + # -1 means nothing associated with this location + fullsigtoref = np.full(len(signal.untrimmed_dacs), -1, dtype=np.int32) + # Calculate locations in signal associated with each location in the input sigtoref + # There is a bit of freedom here: see the docstring + siglocs = np.arange(len(sigtoref_downsampled), dtype=np.int32) * stride - 1 + signalstart_used + # We keep only signal locations that are between 0 and len(untrimmed_dacs) + f = (siglocs >= 0) & (siglocs < len(fullsigtoref)) + # Put numbers in + + # print("Len(fullsigtoref)=",len(fullsigtoref)) + # print("Max(siglocs[f])=",np.max(siglocs[f])) + + fullsigtoref[siglocs[f]] = sigtoref_downsampled[f] + + return _class(signal, fullsigtoref, reference) + + def get_reftosignal(self): + """Return integer vector reftosig, mapping reference to signal. + + length of reftosig returned is (1+reflen), where reflen = len(self.reference) + + reftosig[n] is the location in the untrimmed dacs where the base at + self.reference[n] starts. + + The last element, reftosig[reflen] is consistent with this scheme: it is + (1 + (last location in untrimmed dacs)) + + if the start of the reference is not mapped, then reftosig will begin + with a sequence of (-1)s + + if the end of the reference is not mapped, then reftosig will end with + ... f, f, f, f] where f is the last mapped location in the signal. + """ + siglen = len(self.signal.untrimmed_dacs) + reflen = len(self.reference) + sig_to_ref_non_zero_idxs = np.nonzero(self.signalpos_to_refpos != -1)[0].astype(np.int32) + sig_to_ref_non_zeros = self.signalpos_to_refpos[sig_to_ref_non_zero_idxs] + copy_rights = np.diff(sig_to_ref_non_zeros) + + putative_ref_to_sig = np.repeat(sig_to_ref_non_zero_idxs[:-1], copy_rights) + putative_ref_to_sig = np.append(-1 * np.ones(sig_to_ref_non_zeros[0], dtype=np.int32), putative_ref_to_sig) + + ref_to_sig = np.append(putative_ref_to_sig, siglen * np.ones(reflen + + 1 - len(putative_ref_to_sig), dtype=np.int32)) + + if len(ref_to_sig) != (reflen + 1): + with open('/media/groups_cs2/res_algo/active/aevans/taiyakiExperiments/integrate/DUMP.txt', "w") as f: + f.write('[' + (','.join([str(i) for i in self.signalpos_to_refpos])) + ']\n') + raise Exception("Length of constructed reftosignal ({}) != reflen ({}) + 1".format(len(ref_to_sig), reflen)) + + return ref_to_sig + + def get_read_dictionary(self, shift, scale, read_id, check=True, alphabet="ACGT", collapse_alphabet=None): + """Return a read dictionary of the sort specified in mapped_signal_files.Read. + Note that we return the dictionary, not the object itself. + That's because this method is used inside worker processes which + need to pass their results out through the pickling mechanism in imap_mp. + Apply error checking if check = True, and raise an Exception if it fails""" + readDict = { + 'alphabet': alphabet, + 'collapse_alphabet': alphabet if collapse_alphabet is None else collapse_alphabet, + 'shift_frompA': float(shift), + 'scale_frompA': float(scale), + 'range': float(self.signal.range), + 'offset': float(self.signal.offset), + 'digitisation': float(self.signal.digitisation), + 'Dacs': self.signal.untrimmed_dacs.astype(np.int16), + 'Ref_to_signal': self.get_reftosignal(), + 'Reference': np.array([alphabet.index(i) for i in self.reference], dtype=np.int16), + 'read_id': read_id + } + if check: + readObject = mapped_signal_files.Read(readDict) + checkstring = readObject.check() + if checkstring != "pass": + print("Channel info:") + for k, v in self.signal.channel_info.items(): + print(" ", k, v) + print("Read attributes:") + for k, v in self.signal.read_attributes.items(): + print(" ", k, v) + sys.stderr.write( + "Read object for {} to place in file doesn't pass tests:\n {}\n".format(read_id, checkstring)) + raise Exception("Read object failed error checking in mapping.get_read_dictionary()") + return readDict + + def to_ssv(self, filename, appendRef=True): + """Saves untrimmed dac, signal and mapping to a + space-separated file. If appendRef then add + the reference on the end, starting with a #""" + with open(filename, "w") as f: + f.write("dac signal dactoref\n") + for dac, refpos in zip(self.signal.untrimmed_dacs, self.signalpos_to_refpos): + sig = (dac + self.signal.offset) * self.signal.range / self.signal.digitisation + f.write(str(dac) + " " + str(sig) + " " + str(refpos) + "\n") + if appendRef: + f.write("#" + self.reference) diff --git a/taiyaki/maths.py b/taiyaki/maths.py new file mode 100644 index 0000000..e6d9637 --- /dev/null +++ b/taiyaki/maths.py @@ -0,0 +1,120 @@ +import numpy as np + + +def med_mad(data, factor=None, axis=None, keepdims=False): + """Compute the Median Absolute Deviation, i.e., the median + of the absolute deviations from the median, and the median + + :param data: A :class:`ndarray` object + :param factor: Factor to scale MAD by. Default (None) is to be consistent + with the standard deviation of a normal distribution + (i.e. mad( N(0,\sigma^2) ) = \sigma). + :param axis: For multidimensional arrays, which axis to calculate over + :param keepdims: If True, axis is kept as dimension of length 1 + + :returns: a tuple containing the median and MAD of the data + """ + if factor is None: + factor = 1.4826 + dmed = np.median(data, axis=axis, keepdims=True) + dmad = factor * np.median(abs(data - dmed), axis=axis, keepdims=True) + if axis is None: + dmed = dmed.flatten()[0] + dmad = dmad.flatten()[0] + elif not keepdims: + dmed = dmed.squeeze(axis) + dmad = dmad.squeeze(axis) + return dmed, dmad + + +def mad(data, factor=None, axis=None, keepdims=False): + """Compute the Median Absolute Deviation, i.e., the median + of the absolute deviations from the median, and (by default) + adjust by a factor for asymptotically normal consistency. + + :param data: A :class:`ndarray` object + :param factor: Factor to scale MAD by. Default (None) is to be consistent + with the standard deviation of a normal distribution + (i.e. mad( N(0,\sigma^2) ) = \sigma). + :param axis: For multidimensional arrays, which axis to calculate the median over. + :param keepdims: If True, axis is kept as dimension of length 1 + + :returns: the (scaled) MAD + """ + _ , dmad = med_mad(data, factor=factor, axis=axis, keepdims=keepdims) + return dmad + + +def studentise(x, axis=None): + """ Studentise a numpy array along a given axis + :param x: A :class:`ndaray` + :param axis: axis over which to studentise + + :returns: A :class:`nd.array` with same shape as x + """ + m = np.mean(x, axis=axis, keepdims=True) + s = np.std(x, axis=axis, keepdims=True) + s = np.where(s > 0.0, s, 1.0) + return np.divide(x - m, s) + + +def geometric_prior(n, m, rev=False): + """ Log probabilities for random start time with geoemetric distribution + + :param n: length of output vector + :param m: mean of distribution + :param rev: reverse distribution + + :returns: A 1D :class:`ndarray` containing log probabilities + """ + p = 1.0 / (1.0 + m) + prior = np.repeat(np.log(p), n) + prior[1:] += np.arange(1, n) * np.log1p(-p) + if rev: + prior = prior[::-1] + return prior + + +def logsumexp(x, axis=None, keepdims=False): + """ Calculate log-sum-exp of an array in a stable manner + + log-sum-exp = log( sum_i exp x_i ) + + Calculation is stablised against under- and over-flow in the exponential by + finding the maximum value of the array x_M and calculating: + + log-sump-exp = x_M + log( sum_i exp(x_i - x_M) ) + + :param x: Array containing numbers whose log-sum-exp is desired + :param axis: Axis or axes along which the log-sum-exp are computed. The default + is to compute the log-sum-exp of the flattened array. + :param keepdims: If this is set to True, the axes which are reduced are left + in the result as dimensions with size one + + :returns: Array containing log-sum-exp + """ + maxX = np.amax(x, axis=axis, keepdims=True) + rem = np.log(np.sum(np.exp(x - maxX), axis=axis, keepdims=keepdims)) + maxX_out = maxX.reshape(np.shape(rem)) + return maxX_out + rem + + +def rle(x, tol=0): + """ Run length encoding of array x + + Note: where matching is done with some tolerance, the first element + of the run is chosen as representative. + + :param x: array + :param tol: tolerance of match (for continuous arrays) + + :returns: tuple of array containing elements of x and array containing + length of run + """ + + delta_x = np.ediff1d(x, to_begin=1) + starts = np.where(np.absolute(delta_x) > tol)[0] + last_runlength = len(x) - starts[-1] + runlength = np.ediff1d(starts, to_end=last_runlength) + + return (x[starts], runlength) diff --git a/taiyaki/prepare_mapping_funcs.py b/taiyaki/prepare_mapping_funcs.py new file mode 100644 index 0000000..c47709c --- /dev/null +++ b/taiyaki/prepare_mapping_funcs.py @@ -0,0 +1,116 @@ +import numpy as np +import sys +from ont_fast5_api import fast5_interface +import torch +from taiyaki import flipflop_remap, helpers, mapping, mapped_signal_files, signal +from taiyaki.config import taiyaki_dtype +from taiyaki.fileio import readtsv + + +def oneread_remap(read_tuple, references, model, device, per_read_params_dict, + alphabet, collapse_alphabet): + """ Worker function for remapping reads using flip-flop model on raw signal + :param read_tuple : read, identified by a tuple (filepath, read_id) + :param references :dict mapping fast5 filenames to reference strings + :param model :pytorch model (the torch data structure, not a filename) + :param device :integer specifying which GPU to use for remapping, or 'cpu' to use CPU + :param per_read_params_dict :dictionary where keys are UUIDs, values are dicts containing keys + trim_start trim_end shift scale + :param alphabet : alphabet for basecalling (passed on to mapped-read file) + :param collapse_alphabet : collapsed alphabet for basecalling (passed on to mapped-read file) + + :returns: dictionary as specified in mapped_signal_files.Read class + """ + filename, read_id = read_tuple + try: + with fast5_interface.get_fast5_file(filename, 'r') as f5file: + read = f5file.get_read(read_id) + sig = signal.Signal(read) + except Exception as e: + # We want any single failure in the batch of reads to not disrupt other reads being processed. + sys.stderr.write('No read information on read {} found in file {}.\n{}\n'.format(read_id, filename, repr(e))) + return None + + if read_id in references: + read_ref = references[read_id].decode("utf-8") + else: + sys.stderr.write('No fasta reference found for {}.\n'.format(read_id)) + return None + + if read_id in per_read_params_dict: + read_params_dict = per_read_params_dict[read_id] + else: + return None + + sig.set_trim_absolute(read_params_dict['trim_start'], read_params_dict['trim_end']) + + try: + torch.set_num_threads(1) # Prevents torch doing its own parallelisation on top of our imap_map + # Standardise (i.e. shift/scale so that approximately mean =0, std=1) + signalArray = (sig.current - read_params_dict['shift']) / read_params_dict['scale'] + # Make signal into 3D tensor with shape [siglength,1,1] and move to appropriate device (GPU number or CPU) + signalTensor = torch.tensor(signalArray[:, np.newaxis, np.newaxis].astype(taiyaki_dtype) , device=device) + # The model must live on the same device + modelOnDevice = model.to(device) + # Apply the network to the signal, generating transition weight matrix, and put it back into a numpy array + with torch.no_grad(): + transweights = modelOnDevice(signalTensor).cpu().numpy() + except Exception as e: + sys.stderr.write("Failure applying basecall network to remap read {}.\n{}\n".format(sig.read_id, repr(e))) + return None + + # Extra dimensions introduced by np.newaxis above removed by np.squeeze + remappingscore, path = flipflop_remap.flipflop_remap( + np.squeeze(transweights), read_ref, localpen=0.0) + # read_ref comes out as a bytes object, so we need to convert to str + # localpen=0.0 does local alignment + + # flipflop_remap() establishes a mapping between the network outputs and the reference. + # What we need is a mapping between the signal and the reference. + # To resolve this we need to know the stride of the model (how many samples for each network output) + model_stride = helpers.guess_model_stride(model, device=device) + remapping = mapping.Mapping.from_remapping_path(sig, path, read_ref, model_stride) + + return remapping.get_read_dictionary(read_params_dict['shift'], read_params_dict['scale'], read_id, + alphabet=alphabet, collapse_alphabet=collapse_alphabet) + + + +def generate_output_from_results(results, args): + """ + Given an iterable of dictionaries, each representing the results of mapping + a single read, output a mapped-read file. + This version outputs to the V7 'chunk' file format (actually containing mapped reads, not chunks) + + param: results : an iterable of read dictionaries + (with mappings) + param: args : command line args object + """ + progress = helpers.Progress() + + # filter removes None and False and 0; filter(None, is same as filter(o:o, + read_ids = [] + with mapped_signal_files.HDF5(args.output, "w") as f: + f.write_version_number() + for readnumber, resultdict in enumerate(filter(None, results)): + progress.step() + read_id = resultdict['read_id'] + read_ids.append(read_id) + f.write_read(read_id, mapped_signal_files.Read(resultdict)) + + +def get_per_read_params_dict_from_tsv(input_file): + """Load per read parameter .tsv into a np array and parse into a dictionary + :param input_file : filename including path for the tsv file + :returns: dictionary with keys being UUIDs, values being named + tuple('per_read_params', 'trim_start trim_end shift scale')""" + try: + per_read_params_array = readtsv(input_file, ['UUID', 'trim_start', 'trim_end', 'shift', 'scale']) + except Exception as e: + sys.stderr.write('Failed to get per-read parameters from {}.\n{}\n'.format(input_file, repr(e))) + return None + + per_read_params_dict = {} + for row in per_read_params_array: + per_read_params_dict[row[0]] = {'trim_start': row[1], 'trim_end': row[2], 'shift': row[3], 'scale': row[4]} + return per_read_params_dict diff --git a/taiyaki/signal.py b/taiyaki/signal.py new file mode 100644 index 0000000..4ae21cf --- /dev/null +++ b/taiyaki/signal.py @@ -0,0 +1,99 @@ +# Defines class to represent a signal - used in chunkifying + +from taiyaki import fast5utils + + +class Signal: + """ + Represents a read, with constructor + obtaining signal data from a fast5 file + or from a numpy array for testing. + + The only fiddly bit is that .untrimmed_dacs contains + all the (integer) current numbers available, while + .dacs and .current are trimmed according to the trimming + parameters. + """ + + def __init__(self, read=None, dacs=None): + """Loads data from read in fast5 file. + If read is None + and dacs is a np array then initialse the untrimmed_dacs to this array. + (this allows testing with non-fast5 data) + + param read : an ont_fast5_api read object + param dacs : np int array (only used if first param is None) + """ + if read is None: + try: + self.untrimmed_dacs = dacs.copy() + except: + raise Exception("Cannot initialise SignalWithMap object") + self.offset = 0 + self.range = 1 + self.digitisation = 1 + else: + self.channel_info = {k: v for k, v in fast5utils.get_channel_info(read).items()} + # channel_info contains attributes of the channel such as calibration parameters and sample rate + self.read_attributes = {k: v for k, v in fast5utils.get_read_attributes(read).items()} + # read_attributes includes read id, start time, and active mux + #print("Channel info:",[(k,v) for k,v in self.channel_info.items()]) + #print("Read attributes:",[(k,v) for k,v in self.read_attributes.items()]) + # the sample number (counted from when the device was switched on) when the signal starts + self.start_sample = self.read_attributes['start_time'] + self.sample_rate = self.channel_info['sampling_rate'] + # a unique key corresponding to this read + self.read_id = self.read_attributes['read_id'].decode("utf-8") + # digitised current levels. + # this function returns a copy, not a reference. + self.untrimmed_dacs = read.get_raw_data() + # parameters to convert between DACs and picoamps + self.range = self.channel_info['range'] + self.offset = self.channel_info['offset'] + self.digitisation = self.channel_info['digitisation'] + + # We want to allow trimming without mucking about with the original data + # To start with, set trimming parameters to trim nothing + self.signalstart = 0 + # end is defined exclusively so that self.dacs[signalstart:signalend_exc] is the bit we want. + self.signalend_exc = len(self.untrimmed_dacs) + + def set_trim_absolute(self, trimstart, trimend): + """trim trimstart samples from the start and trimend samples from the end, starting + with the whole stored data set (not starting with the existing trimmed ends) + """ + untrimmed_len = len(self.untrimmed_dacs) + if trimstart < 0 or trimend < 0: + raise Exception("Can't trim a negative amount off the end of a signal vector.") + if trimstart + trimend >= untrimmed_len: # Nothing left! + trimstart = 0 + trimend = 0 + self.signalstart = trimstart + self.signalend_exc = untrimmed_len - trimend + + def set_trim_relative(self, trimstart, trimend): + """trim trimstart samples from the start and trimend samples from the end, starting + with the existing trimmed ends + """ + untrimmed_len = len(self.untrimmed_dacs) + self.set_trim_absolute(self.signalstart + trimstart, (untrimmed_len - self.signalend_exc) + trimend) + + @property + def dacs(self): + """dac numbers, trimmed according to trimming parameters""" + return self.untrimmed_dacs[self.signalstart:self.signalend_exc].copy() + + @property + def untrimmed_current(self): + """Signal measured in pA, untrimmed""" + return (self.untrimmed_dacs + self.offset) * self.range / self.digitisation + + @property + def current(self): + """Signal measured in pA, trimmed according to trimming parameters""" + return (self.dacs + self.offset) * self.range / self.digitisation + + @property + def trimmed_length(self): + """Trimmed length of the signal in samples""" + return self.signalend_exc - self.signalstart diff --git a/taiyaki/squiggle_match/__init__.py b/taiyaki/squiggle_match/__init__.py new file mode 100644 index 0000000..13fc93c --- /dev/null +++ b/taiyaki/squiggle_match/__init__.py @@ -0,0 +1 @@ +from .squiggle_match import * diff --git a/taiyaki/squiggle_match/c_squiggle_match.c b/taiyaki/squiggle_match/c_squiggle_match.c new file mode 100644 index 0000000..c25f164 --- /dev/null +++ b/taiyaki/squiggle_match/c_squiggle_match.c @@ -0,0 +1,749 @@ +#define _BSD_SOURCE 1 +#include +#include +#include +#include + +static float LARGE_VAL = 1e30f; +static size_t nparam = 3; + +static inline float loglaplace(float x, float loc, float sc, float logsc){ + return -fabsf(x - loc) / sc - logsc - M_LN2; +} + +static inline float laplace(float x, float loc, float sc, float logsc){ + return expf(loglaplace(x, loc, sc, logsc)); +} + +static inline float dloglaplace_loc(float x, float loc, float sc, float logsc){ + return ((x > loc) - (x < loc)) / sc; +} + +static inline float dloglaplace_scale(float x, float loc, float sc, float logsc){ + return (fabsf(x - loc) / sc - 1.0) / sc; +} + +static inline float dloglaplace_logscale(float x, float loc, float sc, float logsc){ + return fabsf(x - loc) / sc - 1.0; +} + +static inline float dlaplace_loc(float x, float loc, float sc, float logsc){ + return laplace(x, loc, sc, logsc) * dloglaplace_loc(x, loc, sc, logsc); +} + +static inline float dlaplace_scale(float x, float loc, float sc, float logsc){ + return laplace(x, loc, sc, logsc) * dloglaplace_scale(x, loc, sc, logsc); +} + +static inline float dlaplace_logscale(float x, float loc, float sc, float logsc){ + return laplace(x, loc, sc, logsc) * dloglaplace_logscale(x, loc, sc, logsc); +} + +static inline float plogisticf(float x){ + return 0.5f * (1.0f + tanhf(x / 2.0f)); +} + +static inline float logplogisticf(float x){ + return -log1pf(expf(-x)); +} + +static inline float qlogisticf(float p){ + return 2.0f * atanhf(2.0f * p - 1.0f); +} + +static inline float dlogisticf(float x){ + const float p = plogisticf(x); + return p * (1.0f - p); +} + + + +static inline float logsumexp(float x, float y){ + return fmaxf(x, y) + log1pf(expf(-fabsf(x-y))); +} + + +static inline float max_array(const float * x, size_t n){ + float max = x[0]; + for(size_t i=1 ; i < n ; i++){ + if(x[i] > max){ + max = x[i]; + } + } + return max; +} + +static inline float sum_array(const float * x, size_t n){ + float sum = x[0]; + for(size_t i=1 ; i < n ; i++){ + sum += x[i]; + } + return sum; +} + +static inline float logsum_array(const float * x, size_t n){ + float sum = x[0]; + for(size_t i=1 ; i < n ; i++){ + sum = logsumexp(sum, x[i]); + } + return sum; +} + +static inline void softmax_inplace(float * x, size_t n){ + const float xmax = max_array(x, n); + + for(size_t i=0 ; i < n ; i++){ + x[i] = expf(x[i] - xmax); + } + + const float sum = sum_array(x, n); + for(size_t i=0 ; i < n ; i++){ + x[i] /= sum; + } +} + + + +float squiggle_match_forward(float const * signal, size_t nsample, float const * params, size_t ldp, + float const * scale, size_t npos, float prob_back, float * fwd){ + assert(nsample > 0); + assert(npos > 0); + assert(NULL != signal); + assert(NULL != params); + assert(NULL != scale); + assert(NULL != fwd); + const size_t nstate = 2 * npos; + + const float move_back_pen = logf(prob_back); + const float stay_in_back_pen = logf(0.5f); + const float move_from_back_pen = logf(0.5f); + + float * move_pen = calloc(npos, sizeof(float)); + float * stay_pen = calloc(npos, sizeof(float)); + for(size_t pos=0 ; pos < npos ; pos++){ + const float mp = (1.0f - prob_back) * plogisticf(params[pos * ldp + 2]); + move_pen[pos] = logf(mp); + stay_pen[pos] = log1pf(-mp - prob_back); + } + + // Point prior -- must start at beginning of sequence + for(size_t pos=0 ; pos < nstate ; pos++){ + fwd[pos] = -LARGE_VAL; + } + fwd[0] = 0.0; + + for(size_t sample=0 ; sample < nsample ; sample++){ + const size_t fwd_prev_off = sample * nstate; + const size_t fwd_curr_off = sample * nstate + nstate; + + for(size_t pos=0 ; pos < npos ; pos++){ + // Stay in same position + fwd[fwd_curr_off + pos] = fwd[fwd_prev_off + pos] + stay_pen[pos]; + } + for(size_t pos=0 ; pos < npos ; pos++){ + // Stay in backwards state + fwd[fwd_curr_off + npos + pos] = fwd[fwd_prev_off + npos + pos] + stay_in_back_pen; + } + for(size_t pos=1 ; pos < npos ; pos++){ + // Move to next position + fwd[fwd_curr_off + pos] = logsumexp(fwd[fwd_curr_off + pos], + fwd[fwd_prev_off + pos - 1] + move_pen[pos]); + } + for(size_t pos=1 ; pos < npos ; pos++){ + // Move backwards + fwd[fwd_curr_off + npos + pos - 1] = logsumexp(fwd[fwd_curr_off + npos + pos - 1], + fwd[fwd_prev_off + pos] + move_back_pen); + } + for(size_t pos=1 ; pos< npos ; pos++){ + // Move from back state + fwd[fwd_curr_off + pos] = logsumexp(fwd[fwd_curr_off + pos], + fwd[fwd_prev_off + npos + pos - 1] + move_from_back_pen); + } + + for(size_t pos=0 ; pos < npos ; pos++){ + // Add on emission + const float location = params[pos * ldp + 0]; + const float logscale = params[pos * ldp + 1]; + const float logscore = loglaplace(signal[sample], location, scale[pos], logscale); + fwd[fwd_curr_off + pos] += logscore; + fwd[fwd_curr_off + npos + pos] += logscore; + } + } + + free(move_pen); + free(stay_pen); + + // Must finish in final position + return fwd[nsample * nstate + npos - 1]; +} + + +float squiggle_match_backward(float const * signal, size_t nsample, float const * params, size_t ldp, + float const * scale, size_t npos, float prob_back, float * bwd){ + assert(nsample > 0); + assert(npos > 0); + assert(NULL != signal); + assert(NULL != params); + assert(NULL != scale); + assert(NULL != bwd); + const size_t nstate = 2 * npos; + + const float move_back_pen = logf(prob_back); + const float stay_in_back_pen = logf(0.5f); + const float move_from_back_pen = logf(0.5f); + + float * move_pen = calloc(npos, sizeof(float)); + float * stay_pen = calloc(npos, sizeof(float)); + for(size_t pos=0 ; pos < npos ; pos++){ + const float mp = (1.0f - prob_back) * plogisticf(params[pos * ldp + 2]); + move_pen[pos] = logf(mp); + stay_pen[pos] = log1pf(-mp - prob_back); + } + + + float * tmp = calloc(nstate, sizeof(float)); + + // Point prior -- must start at end of sequence + for(size_t pos=0 ; pos < nstate ; pos++){ + bwd[nstate * nsample + pos] = -LARGE_VAL; + } + bwd[nstate * nsample + npos - 1] = 0.0; + + for(size_t sample=nsample ; sample > 0 ; sample--){ + const size_t bwd_prev_off = sample * nstate; + const size_t bwd_curr_off = sample * nstate - nstate; + for(size_t pos=0 ; pos < npos ; pos++){ + const float location = params[pos * ldp + 0]; + const float logscale = params[pos * ldp + 1]; + const float logscore = loglaplace(signal[sample - 1], location, scale[pos], logscale); + tmp[pos] = bwd[bwd_prev_off + pos] + logscore; + tmp[npos + pos] = bwd[bwd_prev_off + npos + pos] + logscore; + } + for(size_t pos=0 ; pos < npos ; pos++){ + // Stay in position + bwd[bwd_curr_off + pos] = tmp[pos] + stay_pen[pos]; + } + for(size_t pos=1 ; pos < npos ; pos++){ + // Move to next position + bwd[bwd_curr_off + pos - 1] = logsumexp(bwd[bwd_curr_off + pos - 1], + tmp[pos] + move_pen[pos]); + } + for(size_t pos=0 ; pos < npos ; pos++){ + // Stay in back state + bwd[bwd_curr_off + npos + pos] = tmp[npos + pos] + stay_in_back_pen; + } + for(size_t pos=1 ; pos < npos ; pos++){ + // Move out of back state + bwd[bwd_curr_off + npos + pos - 1] = logsumexp(bwd[bwd_curr_off + npos + pos - 1], + tmp[pos] + move_from_back_pen); + } + for(size_t pos=1 ; pos < npos ; pos++){ + // Move into back state + bwd[bwd_curr_off + pos] = logsumexp(bwd[bwd_curr_off + pos], + tmp[npos + pos - 1] + move_back_pen); + } + } + + free(move_pen); + free(stay_pen); + + free(tmp); + + // Must start in first position + return bwd[0]; +} + + +float squiggle_match_viterbi(float const * signal, size_t nsample, float const * params, size_t ldp, + float const * scale, size_t npos, float prob_back, float localpen, + float minscore, int32_t * path, float * fwd){ + assert(nsample > 0); + assert(npos > 0); + assert(NULL != signal); + assert(NULL != params); + assert(NULL != scale); + assert(NULL != path); + assert(NULL != fwd); + const size_t nfstate = npos + 2; + const size_t nstate = npos + nfstate; + + const float move_back_pen = logf(prob_back); + const float stay_in_back_pen = logf(0.5f); + const float move_from_back_pen = logf(0.5f); + + float * move_pen = calloc(nfstate, sizeof(float)); + float * stay_pen = calloc(nfstate, sizeof(float)); + { + float mean_move_pen = 0.0f; + float mean_stay_pen = 0.0f; + for(size_t pos=0 ; pos < npos ; pos++){ + const float mp = (1.0f - prob_back) * plogisticf(params[pos * ldp + 2]); + move_pen[pos + 1] = logf(mp); + stay_pen[pos + 1] = log1pf(-mp - prob_back); + mean_move_pen += move_pen[pos + 1]; + mean_stay_pen += stay_pen[pos + 1]; + } + mean_move_pen /= npos; + mean_stay_pen /= npos; + + move_pen[0] = mean_move_pen; + move_pen[nfstate - 1] = mean_move_pen; + stay_pen[0] = mean_stay_pen; + stay_pen[nfstate - 1] = mean_stay_pen; + } + + for(size_t st=0 ; st < nstate ; st++){ + // States are start .. positions .. end + fwd[st] = -LARGE_VAL; + } + // Must begin in start state + fwd[0] = 0.0; + + int32_t * traceback = calloc(nsample * nstate, sizeof(int32_t)); + + for(size_t sample=0 ; sample < nsample ; sample++){ + const size_t fwd_prev_off = (sample % 2) * nstate; + const size_t fwd_curr_off = ((sample + 1) % 2) * nstate; + const size_t tr_off = sample * nstate; + + for(size_t st=0 ; st < nfstate ; st++){ + // Stay in start, end or normal position + fwd[fwd_curr_off + st] = fwd[fwd_prev_off + st] + stay_pen[st]; + traceback[tr_off + st] = st; + } + for(size_t st=0 ; st < npos ; st++){ + // Stay in back position + const size_t idx = nfstate + st; + fwd[fwd_curr_off + idx] = fwd[fwd_prev_off + idx] + stay_in_back_pen; + traceback[tr_off + idx] = idx; + } + for(size_t st=1 ; st < nfstate ; st++){ + // Move to next position + const float step_score = fwd[fwd_prev_off + st - 1] + move_pen[st - 1]; + if(step_score > fwd[fwd_curr_off + st]){ + fwd[fwd_curr_off + st] = step_score; + traceback[tr_off + st] = st - 1; + } + } + for(size_t destpos=1 ; destpos < npos ; destpos++){ + const size_t destst = destpos + 1; + // Move from start into sequence + const float score = fwd[fwd_prev_off] + move_pen[0] - localpen * destpos; + if(score > fwd[fwd_curr_off + destst]){ + fwd[fwd_curr_off + destst] = score; + traceback[tr_off + destst] = 0; + } + } + for(size_t origpos=0 ; origpos < (npos - 1) ; origpos++){ + const size_t destst = nfstate - 1; + const size_t origst = origpos + 1; + const size_t deltapos = npos - 1 - origpos; + // Move from sequence into end + const float score = fwd[fwd_prev_off + origst] + move_pen[origst] - localpen * deltapos; + if(score > fwd[fwd_curr_off + destst]){ + fwd[fwd_curr_off + destst] = score; + traceback[tr_off + destst] = origst; + } + } + for(size_t st=1 ; st < npos ; st++){ + // Move to back + const float back_score = fwd[fwd_prev_off + st + 1] + move_back_pen; + if(back_score > fwd[fwd_curr_off + nfstate + st - 1]){ + fwd[fwd_curr_off + nfstate + st - 1] = back_score; + traceback[tr_off + nfstate + st - 1] = st + 1; + } + } + for(size_t st=1 ; st < npos ; st++){ + // Move from back + const float back_score = fwd[fwd_prev_off + nfstate + st - 1] + move_from_back_pen; + if(back_score > fwd[fwd_curr_off + st + 1]){ + fwd[fwd_curr_off + st + 1] = back_score; + traceback[tr_off + st + 1] = nfstate + st - 1; + } + } + + + for(size_t pos=0 ; pos < npos ; pos++){ + // Add on score for samples + const float location = params[pos * ldp + 0]; + const float logscale = params[pos * ldp + 1]; + const float logscore = fmaxf(-minscore, loglaplace(signal[sample], location, scale[pos], logscale)); + // State to add to is offset by one because of start state + fwd[fwd_curr_off + pos + 1] += logscore; + fwd[fwd_curr_off + nfstate + pos] += logscore; + } + + // Score for start and end states + fwd[fwd_curr_off + 0] -= localpen; + fwd[fwd_curr_off + nfstate - 1] -= localpen; + + } + + // Score of best path and final states. Could be either last position or end state + const size_t fwd_offset = (nsample % 2) * nstate; + const float score = fmaxf(fwd[fwd_offset + nfstate - 2], fwd[fwd_offset + nfstate - 1]); + if(fwd[fwd_offset + nfstate - 2] > fwd[fwd_offset + nfstate - 1]){ + path[nsample - 1] = nfstate - 2; + } else { + path[nsample - 1] = nfstate - 1; + } + + for(size_t sample=1 ; sample < nsample ; sample++){ + const size_t rs = nsample - sample; + const size_t tr_off = rs * nstate; + path[rs - 1] = traceback[tr_off + path[rs]]; + } + free(traceback); + + // Correct path so start and end states are encoded as -1, other states as positions + { + size_t sample_min = 0; + size_t sample_max = nsample; + for(; sample_min < nsample ; sample_min++){ + if(0 != path[sample_min]){ + break; + } + path[sample_min] = -1; + } + for(; sample_max > 0 ; sample_max--){ + if(nfstate - 1 != path[sample_max - 1]){ + break; + } + path[sample_max - 1] = -1; + } + for(size_t sample=sample_min ; sample < sample_max ; sample++){ + assert(path[sample] > 0); + if(path[sample] >= nfstate){ + path[sample] -= nfstate; + } else { + path[sample] -= 1; + } + } + } + + free(move_pen); + free(stay_pen); + + return score; +} + + +void squiggle_match_cost(float const * signal, int32_t const * siglen, size_t nbatch, + float const * params, size_t npos, float prob_back, float * score){ + size_t sigidx[nbatch]; + sigidx[0] = 0; + for(size_t idx=1 ; idx < nbatch ; idx++){ + sigidx[idx] = sigidx[idx - 1] + siglen[idx - 1]; + } + +#pragma omp parallel for + for(size_t batch=0 ; batch < nbatch ; batch++){ + const size_t nsample = siglen[batch]; + const size_t signal_offset = sigidx[batch]; + const size_t param_offset = batch * nparam; + const size_t ldp = nbatch * nparam; + + float * fwd = calloc(2 * npos * (nsample + 1), sizeof(float)); + float * scale = calloc(npos, sizeof(float)); + for(size_t pos=0 ; pos < npos ; pos++){ + scale[pos] = expf(params[param_offset + pos * ldp + 1]); + } + score[batch] = squiggle_match_forward(signal + signal_offset, nsample, params + param_offset, + ldp, scale, npos, prob_back, fwd); + free(scale); + free(fwd); + } +} + + +void squiggle_match_scores_fwd(float const * signal, int32_t const * siglen, size_t nbatch, + float const * params, size_t npos, float prob_back, float * score){ + squiggle_match_cost(signal, siglen, nbatch, params, npos, prob_back, score); +} + + +void squiggle_match_scores_bwd(float const * signal, int32_t const * siglen, size_t nbatch, + float const * params, size_t npos, float prob_back, float * score){ + size_t sigidx[nbatch]; + sigidx[0] = 0; + for(size_t idx=1 ; idx < nbatch ; idx++){ + sigidx[idx] = sigidx[idx - 1] + siglen[idx - 1]; + } + +#pragma omp parallel for + for(size_t batch=0 ; batch < nbatch ; batch++){ + const size_t nsample = siglen[batch]; + const size_t signal_offset = sigidx[batch]; + const size_t param_offset = batch * nparam; + const size_t ldp = nbatch * nparam; + float * bwd = calloc(2 * npos * (nsample + 1), sizeof(float)); + float * scale = calloc(npos, sizeof(float)); + for(size_t pos=0 ; pos < npos ; pos++){ + scale[pos] = expf(params[param_offset + pos * ldp + 1]); + } + score[batch] = squiggle_match_backward(signal + signal_offset, nsample, params + param_offset, + ldp, scale, npos, prob_back, bwd); + free(scale); + free(bwd); + } +} + + +void squiggle_match_viterbi_path(float const * signal, int32_t const * siglen, size_t nbatch, + float const * params, size_t npos, float prob_back, float localpen, + float minscore, int32_t * path, float * score){ + size_t sigidx[nbatch]; + sigidx[0] = 0; + for(size_t idx=1 ; idx < nbatch ; idx++){ + sigidx[idx] = sigidx[idx - 1] + siglen[idx - 1]; + } + +#pragma omp parallel for + for(size_t batch=0 ; batch < nbatch ; batch++){ + const size_t nsample = siglen[batch]; + const size_t signal_offset = sigidx[batch]; + const size_t param_offset = batch * nparam; + const size_t ldp = nbatch * nparam; + const size_t nstate = 2 * npos + 2; + float * fwd = calloc(2 * nstate, sizeof(float)); + float * scale = calloc(npos, sizeof(float)); + for(size_t pos=0 ; pos < npos ; pos++){ + scale[pos] = expf(params[param_offset + pos * ldp + 1]); + } + score[batch] = squiggle_match_viterbi(signal + signal_offset, nsample, params + param_offset, + ldp, scale, npos, prob_back, localpen, minscore, + path + signal_offset, fwd); + free(scale); + free(fwd); + } +} + + + +float squiggle_match_posterior(float const * signal, size_t nsample, float const * params, size_t ldp, + float const * scale, size_t npos, float prob_back, float * post){ + const size_t nstate = 2 * npos; + float * fwd = post; + float * bwd = calloc(nstate * (nsample + 1), sizeof(float)); + float score = squiggle_match_forward(signal, nsample, params, ldp, scale, npos, prob_back, fwd); + squiggle_match_backward(signal, nsample, params, ldp, scale, npos, prob_back, bwd); + + for(size_t sample=1 ; sample <= nsample ; sample++){ + const size_t offset = sample * nstate; + + // Normalised to form posteriors + { + for(size_t pos=0 ; pos < nstate ; pos++){ + fwd[offset + pos] += bwd[offset + pos]; + } + + softmax_inplace(fwd + offset, nstate); + } + } + free(bwd); + + return score; +} + + + +void squiggle_match_grad(float const * signal, int32_t const * siglen, size_t nbatch, + float const * params, size_t npos, float prob_back, float * grad){ + size_t sigidx[nbatch]; + sigidx[0] = 0; + for(size_t idx=1 ; idx < nbatch ; idx++){ + sigidx[idx] = sigidx[idx - 1] + siglen[idx - 1]; + } + +#pragma omp parallel for + for(size_t batch=0 ; batch < nbatch ; batch++){ + const size_t nsample = siglen[batch]; + const size_t signal_offset = sigidx[batch]; + const size_t param_offset = batch * nparam; + const size_t ldp = nbatch * nparam; + const size_t nstate = 2 * npos; + float * fwd = calloc(nstate * (nsample + 1), sizeof(float)); + float * bwd = calloc(nstate * (nsample + 1), sizeof(float)); + float * scale = calloc(npos, sizeof(float)); + for(size_t pos=0 ; pos < npos ; pos++){ + scale[pos] = expf(params[param_offset + pos * ldp + 1]); + } + squiggle_match_forward(signal + signal_offset, nsample, params + param_offset, + ldp, scale, npos, prob_back, fwd); + squiggle_match_backward(signal + signal_offset, nsample, params + param_offset, + ldp, scale, npos, prob_back, bwd); + + + for(size_t pos=0 ; pos < npos ; pos++){ + grad[param_offset + pos * ldp + 0] = 0.0f; + grad[param_offset + pos * ldp + 1] = 0.0f; + grad[param_offset + pos * ldp + 2] = 0.0f; + } + + for(size_t sample=1 ; sample <= nsample ; sample++){ + const size_t offset = sample * nstate; + const float sig = signal[signal_offset + sample - 1]; + + // Normalised to form posteriors + float fact = fwd[offset] + bwd[offset]; + { + for(size_t st=1 ; st < nstate ; st++){ + fact = logsumexp(fact, fwd[offset + st] + bwd[offset + st]); + } + } + + for(size_t pos=0 ; pos < npos ; pos++){ + const float loc = params[param_offset + pos * ldp + 0]; + const float logsc = params[param_offset + pos * ldp + 1]; + const float prob_pos = expf(fwd[offset + pos] + bwd[offset + pos] - fact); + const float prob_posnpos = expf(fwd[offset + npos + pos] + bwd[offset + npos + pos] - fact); + grad[param_offset + pos * ldp + 0] += (prob_pos + prob_posnpos) * dloglaplace_loc(sig, loc, scale[pos], logsc); + grad[param_offset + pos * ldp + 1] += (prob_pos + prob_posnpos) * dloglaplace_logscale(sig, loc, scale[pos], logsc); + } + + for(size_t pos=0 ; pos < npos ; pos++){ + const float loc = params[param_offset + pos * ldp + 0]; + const float logsc = params[param_offset + pos * ldp + 1]; + const float logem = loglaplace(sig, loc, scale[pos], logsc); + const float pprob_pos = expf(fwd[offset - nstate + pos] + bwd[offset + pos] + logem - fact); + const float move_pen = plogisticf(params[param_offset + pos * ldp + 2]); + const float dlogisticf_move_pen = (1.0f - prob_back) * move_pen * (1.0f - move_pen); + grad[param_offset + pos * ldp + 2] -= pprob_pos * dlogisticf_move_pen; + } + + + for(size_t pos=1 ; pos < npos ; pos++){ + const float loc = params[param_offset + pos * ldp + 0]; + const float logsc = params[param_offset + pos * ldp + 1]; + const float logem = loglaplace(sig, loc, scale[pos], logsc); + const float pprob_pos = expf(fwd[offset - nstate + pos - 1] + bwd[offset + pos] + logem - fact); + const float move_pen = plogisticf(params[param_offset + pos * ldp + 2]); + const float dlogisticf_move_pen = (1.0f - prob_back) * move_pen * (1.0f - move_pen); + grad[param_offset + pos * ldp + 2] += pprob_pos * dlogisticf_move_pen; + } + } + + + free(scale); + free(bwd); + free(fwd); + } +} + +#ifdef SQUIGGLE_TEST +const float test_signal[] = { + 1.0120153f, 1.0553021f, 10.0172595f, 10.0962240f, 10.0271495f, + 1.0117957f, 4.6153470f, 5.4212851f, 3.0914187f, 1.2078583f, + 1.5120153f, 1.4553021f, 3.6172595f, 3.8962240f, 3.9271495f, + 0.5117957f, 4.6153470f, 5.4212851f, 2.5914187f, 3.2078583f}; +const int32_t test_siglen[2] = {10, 10}; +float test_param[30] = { + // t = 0, b = 0 + 1.0f, 0.0f, -1.0f, + // t = 0, b = 1 + 1.0f, 0.0f, -1.0f, + // t = 1, b = 0 + 10.0f, 0.0f, -2.0f, + // t = 1, b = 1 + 3.0f, 0.0f, -2.0f, + // t = 2, b = 0 + 1.0f, 0.0f, -1.5f, + // t = 2, b = 1 + 1.0f, 0.0f, -1.5f, + // t = 3, b = 0 + 5.0f, 0.0f, -0.5f, + // t = 3, b = 1 + 5.0f, 0.0f, -0.5f, + // t = 4, b = 0 + 3.0f, 0.0f, -1.0f, + // t = 4, b = 1 + 3.0f, 0.0f, -1.0f +}; + +#include + +int main(void){ + const size_t npos = 5; + const size_t nbatch = 2; + float score[2] = {0.0f}; + int32_t path[20] = {0}; + const float DELTA = 1e-3f; + const float prob_back = 0.3f; + const float localpen = 2000.0f; + const float minscore = 12.0; + const size_t msize = npos * nbatch * nparam; + + + squiggle_match_scores_fwd(test_signal, test_siglen, nbatch, test_param, npos, prob_back, score); + printf("Forwards scores: %f %f\n", score[0], score[1]); + + squiggle_match_scores_bwd(test_signal, test_siglen, nbatch, test_param, npos, prob_back, score); + printf("Backwards scores: %f %f\n", score[0], score[1]); + + squiggle_match_viterbi_path(test_signal, test_siglen, nbatch, test_param, npos, prob_back, localpen, + minscore, path, score); + printf("Viterbi scores: %f %f\n", score[0], score[1]); + size_t offset = 0; + for(size_t batch=0 ; batch < nbatch ; batch++){ + const size_t nsample = test_siglen[batch]; + for(size_t sample=0 ; sample < nsample ; sample++){ + printf(" %d", path[offset + sample]); + } + fputc('\n', stdout); + offset += nsample; + } + + + float * grad = calloc(msize, sizeof(float)); + squiggle_match_grad(test_signal, test_siglen, nbatch, test_param, npos, prob_back, grad); + float maxdelta = 0.0; + for(size_t pos=0 ; pos < npos ; pos++){ + const size_t offset = pos * nbatch *nparam; + for(size_t st=0 ; st < nparam ; st++){ + maxdelta = fmaxf(maxdelta, fabsf(grad[offset + st] - grad[offset + nparam + st])); + } + } + printf("Max grad delta = %f\n", maxdelta); + + printf("Derviatives:\n"); + float fscore[2] = {0.0f}; + for(size_t pos=0 ; pos < npos ; pos++){ + printf(" Pos %zu\n", pos); + const size_t offset = pos * nbatch * nparam; + for(size_t st=0 ; st < nparam ; st++){ + // Positive difference + const float orig = test_param[offset + st]; + test_param[offset + st] = orig + DELTA; + squiggle_match_scores_fwd(test_signal, test_siglen, nbatch, test_param, npos, prob_back, score); + fscore[0] = score[0]; + fscore[1] = score[1]; + // Negative difference + test_param[offset + st] = orig - DELTA; + squiggle_match_scores_fwd(test_signal, test_siglen, nbatch, test_param, npos, prob_back, score); + fscore[0] = (fscore[0] - score[0]) / (2.0f * DELTA); + fscore[1] = (fscore[1] - score[1]) / (2.0f * DELTA); + // Report and reset + test_param[offset + st] = orig; + squiggle_match_scores_fwd(test_signal, test_siglen, nbatch, test_param, npos, prob_back, score); + printf(" %f d=%f [%f %f] (%f %f)\n", grad[offset + st], fabsf(grad[offset + st] - fscore[0]), fscore[0], fscore[1], score[0], score[1]); + + } + } + free(grad); + + + for(size_t pos=0 ; pos < npos ; pos++){ + const size_t offset = pos * nbatch * nparam; + for(size_t sample=0 ; sample < test_siglen[0] ; sample++){ + const float loc = test_param[offset + 0]; + const float logsc = test_param[offset + 1]; + + const float df = dloglaplace_logscale(test_signal[sample], loc, expf(logsc), logsc); + const float dplus = loglaplace(test_signal[sample], loc, expf(logsc + DELTA), logsc + DELTA); + const float dminus = loglaplace(test_signal[sample], loc, expf(logsc - DELTA), logsc - DELTA); + const float approxdf = (dplus - dminus) / (2.0f * DELTA); + printf("dlog/dloc = %f\t%f\t%f\n", df, approxdf, fabsf(df - approxdf)); + } + } +} +#endif diff --git a/taiyaki/squiggle_match/c_squiggle_match.h b/taiyaki/squiggle_match/c_squiggle_match.h new file mode 100644 index 0000000..921b76d --- /dev/null +++ b/taiyaki/squiggle_match/c_squiggle_match.h @@ -0,0 +1,9 @@ +#include +#include +void squiggle_match_cost(float const * signal, int32_t const * siglen, size_t nbatch, + float const * params, size_t npos, float prob_back, float * score); +void squiggle_match_grad(float const * signal, int32_t const * siglen, size_t nbatch, + float const * params, size_t npos, float prob_back, float * grad); +void squiggle_match_viterbi_path(float const * signal, int32_t const * siglen, size_t nbatch, + float const * params, size_t npos, float prob_back, float localpen, + float minscore, int32_t * path, float * score); diff --git a/taiyaki/squiggle_match/libsquiggle_match.pxd b/taiyaki/squiggle_match/libsquiggle_match.pxd new file mode 100644 index 0000000..16732af --- /dev/null +++ b/taiyaki/squiggle_match/libsquiggle_match.pxd @@ -0,0 +1,10 @@ +from libc.stdint cimport int32_t +cdef extern from "c_squiggle_match.h": + void squiggle_match_cost(const float * signal, const int32_t * siglen, size_t nbatch, + const float * params, size_t npos, float prob_back, float * score) + void squiggle_match_grad(const float * signal, const int32_t * siglen, size_t nbatch, + const float * params, size_t npos, float prob_back, float * grad) + void squiggle_match_viterbi_path(const float * signal, const int32_t * siglen, + size_t nbatch, const float * params, size_t npos, + float prob_back, float localpen, float minscore, + int32_t * path, float * score) diff --git a/taiyaki/squiggle_match/squiggle_match.pyx b/taiyaki/squiggle_match/squiggle_match.pyx new file mode 100644 index 0000000..c1178d8 --- /dev/null +++ b/taiyaki/squiggle_match/squiggle_match.pyx @@ -0,0 +1,196 @@ +cimport libsquiggle_match +import cython +from Bio import SeqIO +import numpy as np +cimport numpy as np +import os +import sys + +from ont_fast5_api import fast5_interface +from taiyaki import config, helpers +from taiyaki.maths import mad +from taiyaki.variables import DEFAULT_ALPHABET, LARGE_LOG_VAL + +import torch + + +_base_mapping = {k : i for i, k in enumerate(DEFAULT_ALPHABET)} +_cartesian_tetrahedron = np.array([[1.0, 0.0, -1.0 / np.sqrt(2.0)], + [-1.0, 0.0, -1.0 / np.sqrt(2.0)], + [0.0, 1.0, 1.0 / np.sqrt(2.0)], + [0.0, -1.0, 1.0 / np.sqrt(2.0)]], + dtype=config.taiyaki_dtype) + + +@cython.boundscheck(False) +@cython.wraparound(False) +def squiggle_match_cost(np.ndarray[np.float32_t, ndim=3, mode="c"] params, + np.ndarray[np.float32_t, ndim=1, mode="c"] signal, + np.ndarray[np.int32_t, ndim=1, mode="c"] siglen, + back_prob): + """Forward scores of matching observed signals to predicted squiggles + + :param params: A [length, batch, 3] numpy array of predicted squiggle parameters. + The 3 features are predicted level, spread and movement rate + :param signal: A vector containing observed signals, concatenated + :param seqlen: Length of each signal + :param back_prob: Probably of entering the backsampling state + """ + cdef size_t npos, nbatch + npos, nbatch = params.shape[0], params.shape[1] + + cdef np.ndarray[np.float32_t, ndim=1, mode="c"] costs = np.zeros((nbatch,), dtype=np.float32) + libsquiggle_match.squiggle_match_cost(&signal[0], &siglen[0], nbatch, + ¶ms[0, 0, 0], npos, back_prob, + &costs[0]) + + return -costs + + +@cython.boundscheck(False) +@cython.wraparound(False) +def squiggle_match_grad(np.ndarray[np.float32_t, ndim=3, mode="c"] params, + np.ndarray[np.float32_t, ndim=1, mode="c"] signal, + np.ndarray[np.int32_t, ndim=1, mode="c"] siglen, + back_prob): + """Gradient of forward scores of matching observed signals to predicted squiggles + + :param params: A [length, batch, 3] numpy array of predicted squiggle parameters. + The 3 features are predicted level, spread and movement rate + :param signal: A vector containing observed signals, concatenated + :param seqlen: Length of each signal + :param back_prob: Probably of entering the backsampling state + """ + cdef size_t nblk, nbatch, nstate + npos, nbatch = params.shape[0], params.shape[1] + + cdef np.ndarray[np.float32_t, ndim=3, mode="c"] grads = np.zeros_like(params, dtype=np.float32) + libsquiggle_match.squiggle_match_grad(&signal[0], &siglen[0], nbatch, + ¶ms[0, 0, 0], npos, back_prob, + &grads[0, 0, 0]) + + return -grads + + +@cython.boundscheck(False) +@cython.wraparound(False) +def squiggle_match_path(np.ndarray[np.float32_t, ndim=3, mode="c"] params, + np.ndarray[np.float32_t, ndim=1, mode="c"] signal, + np.ndarray[np.int32_t, ndim=1, mode="c"] siglen, + back_prob, localpen, minscore): + """Viterbi scores and paths of matching observed signals to predicted squiggles + + :param params: A [length, batch, 3] numpy array of predicted squiggle parameters. + The 3 features are predicted level, spread and movement rate + :param signal: A vector containing observed signals, concatenated + :param seqlen: Length of each signal + :param back_prob: Probably of entering the backsampling state + """ + cdef size_t nblk, nbatch, nstate + npos, nbatch = params.shape[0], params.shape[1] + localpen = localpen if localpen is not None else LARGE_LOG_VAL + minscore = minscore if minscore is not None else LARGE_LOG_VAL + + cdef np.ndarray[np.float32_t, ndim=1, mode="c"] costs = np.zeros((nbatch,), dtype=np.float32) + cdef np.ndarray[np.int32_t, ndim=1, mode="c"] paths = np.zeros_like(signal, dtype=np.int32) + libsquiggle_match.squiggle_match_viterbi_path(&signal[0], &siglen[0], nbatch, + ¶ms[0, 0, 0], npos, + back_prob, localpen, minscore, + &paths[0], &costs[0]) + + return -costs, paths + + +def load_references(filename): + references = dict() + for seq in SeqIO.parse(filename, 'fasta'): + seqstr = str(seq.seq).encode('ascii') + references[seq.id] = seqstr + + return references + + +def embed_sequence(seq, alphabet=DEFAULT_ALPHABET): + """Embed sequence of bases (bytes) using points of a tetrahedron""" + if alphabet == DEFAULT_ALPHABET: + seq_index = np.array([_base_mapping[b] for b in seq]) + elif alphabet is None: + seq_index = seq + else: + raise Exception('Alphabet not recognised in squiggle_match.pyx embed_sequence()') + return _cartesian_tetrahedron[seq_index] + + +def init_worker(model, reference_file): + torch.set_num_threads(1) + + global predict_squiggle + predict_squiggle = model + + global references + references = load_references(reference_file) + + +def worker(fast5_read_tuple, trim, back_prob, localpen, minscore): + fast5_name, read_id = fast5_read_tuple + if read_id in references: + refseq = references[read_id] + else: + sys.stderr.write('Reference not found for {}\n'.format(read_id)) + return None + + try: + with fast5_interface.get_fast5_file(fast5_name, 'r') as f5file: + read = f5file.get_read(read_id) + signal = read.get_raw_data() + except: + sys.stderr.write('Error reading {}\n'.format(read_id)) + return None + + signal = helpers.trim_array(signal, *trim) + assert len(signal) > 0 + + norm_sig = (signal - np.median(signal)) / mad(signal) + norm_sig = np.ascontiguousarray(norm_sig, dtype=config.taiyaki_dtype) + + embedded_seq = np.expand_dims(embed_sequence(refseq), axis=1) + with torch.no_grad(): + squiggle_params = predict_squiggle(torch.tensor(embedded_seq, dtype=torch.float32)).cpu().numpy() + sig_len = np.array([len(norm_sig)], dtype=np.int32) + + squiggle_params = np.ascontiguousarray(squiggle_params, dtype=np.float32) + cost, path = squiggle_match_path(squiggle_params, norm_sig, sig_len, + back_prob, localpen, minscore) + + return (read_id, norm_sig, cost[0], path, + np.squeeze(squiggle_params, axis=1), refseq) + + +class SquiggleMatch(torch.autograd.Function): + """Pytorch autograd function wrapping squiggle_match_cost""" + @staticmethod + def forward(ctx, params, signal, siglen, back_prob): + ctx.save_for_backward(params, signal, siglen, torch.tensor(back_prob)) + params = np.ascontiguousarray(params.detach().cpu().numpy().astype(np.float32)) + signal = np.ascontiguousarray(signal.detach().cpu().numpy().astype(np.float32)) + siglen = np.ascontiguousarray(siglen.detach().cpu().numpy().astype(np.int32)) + back_prob = float(back_prob) + cost = squiggle_match_cost(params, signal, siglen, back_prob) + return torch.tensor(cost) + + @staticmethod + def backward(ctx, output_grads): + params, signal, siglen, back_prob = ctx.saved_tensors + device = params.device + dtype = params.dtype + params = np.ascontiguousarray(params.detach().cpu().numpy().astype(np.float32)) + signal = np.ascontiguousarray(signal.detach().cpu().numpy().astype(np.float32)) + siglen = np.ascontiguousarray(siglen.detach().cpu().numpy().astype(np.int32)) + back_prob = float(back_prob) + grad = squiggle_match_grad(params, signal, siglen, back_prob) + grad = torch.tensor(grad, dtype=dtype, device=device) + output_grads = output_grads.unsqueeze(1).to(device) + return grad * output_grads, None, None, None + + +squiggle_match_loss = SquiggleMatch.apply diff --git a/taiyaki/variables.py b/taiyaki/variables.py new file mode 100644 index 0000000..10008bb --- /dev/null +++ b/taiyaki/variables.py @@ -0,0 +1,15 @@ +DEFAULT_ALPHABET = b'ACGT' +DEFAULT_NBASE = len(DEFAULT_ALPHABET) + +LARGE_LOG_VAL = 50000.0 +SMALL_VAL = 1e-10 + + +def nstate_flipflop(nbase): + """ Number of states in output of flipflop network + + :param nbase: Number of letters in alphabet + + :returns: Number of states + """ + return 2 * nbase * (nbase + 1) diff --git a/test/acceptance/requirements.txt b/test/acceptance/requirements.txt new file mode 100644 index 0000000..dfecb8c --- /dev/null +++ b/test/acceptance/requirements.txt @@ -0,0 +1,4 @@ +h5py >= 2.2.1,<=2.6.0 +parameterized == 0.6.1 +pytest-xdist == 1.15.0 +future==0.17.1 diff --git a/test/acceptance/test_dump_json.py b/test/acceptance/test_dump_json.py new file mode 100644 index 0000000..4e05492 --- /dev/null +++ b/test/acceptance/test_dump_json.py @@ -0,0 +1,67 @@ +import json +from parameterized import parameterized +import os +import sys +import unittest + +import util + + +def is_valid_json(s): + try: + json.loads(s) + return True + except ValueError: + return False + + +class AcceptanceTest(unittest.TestCase): + + @classmethod + def setUpClass(self): + test_directory = os.path.splitext(__file__)[0] + self.testset_work_dir = os.path.basename(test_directory) + self.script = os.path.join(util.BIN_DIR, "dump_json.py") + self.model_file = os.path.join(util.MODELS_DIR, "mGru256_flipflop_remapping_model_r9_DNA.checkpoint") + + def work_dir(self, test_name): + directory = os.path.join(self.testset_work_dir, test_name) + util.maybe_create_dir(directory) + return directory + + def test_usage(self): + cmd = [self.script] + util.run_cmd(self, cmd).expect_exit_code(2).expect_stderr(util.any_line_starts_with(u"usage")) + + @parameterized.expand([ + [["--params"]], + [["--no-params"]], + ]) + def test_dump_to_stdout(self, options): + self.assertTrue(os.path.exists(self.model_file)) + cmd = [self.script, self.model_file] + options + util.run_cmd(self, cmd).expect_exit_code(0).expect_stdout(lambda o: is_valid_json('\n'.join(o))) + + @parameterized.expand([ + [["--no-params"], "2"], + ]) + def test_dump_to_a_file(self, options, subdir): + self.assertTrue(os.path.exists(self.model_file)) + test_work_dir = self.work_dir(os.path.join("test_dump_to_a_file", subdir)) + + output_file = os.path.join(test_work_dir, "output.json") + open(output_file, "w").close() + + cmd = [self.script, self.model_file, "--out_file", output_file] + options + error_message = "RuntimeError: File/path for 'out_file' exists, {}".format(output_file) + util.run_cmd(self, cmd).expect_exit_code(1).expect_stderr(util.any_line_starts_with(error_message)) + + os.remove(output_file) + + info_message = "Writing to file: {}".format(output_file) + util.run_cmd(self, cmd).expect_exit_code(0).expect_stdout(lambda o: o == [info_message]) + + self.assertTrue(os.path.exists(output_file)) + dump = open(output_file, 'r').read() + + self.assertTrue(is_valid_json(dump)) diff --git a/test/acceptance/test_prepare_remap.py b/test/acceptance/test_prepare_remap.py new file mode 100644 index 0000000..bf54d86 --- /dev/null +++ b/test/acceptance/test_prepare_remap.py @@ -0,0 +1,61 @@ +import os +import subprocess +import unittest + + +from taiyaki import mapped_signal_files + +class AcceptanceTest(unittest.TestCase): + """Test on a single fast5 file runs the first part of the workflow + Makefile to make the per-read-params file and reference file + and then do remapping""" + + @classmethod + def setUpClass(self): + """Make all paths absolute so that when we run Makefile in another dir it works OK""" + testset_directory_rel, _ = os.path.splitext(__file__) + self.testset_name = os.path.basename(testset_directory_rel) + self.taiyakidir = os.path.abspath(os.path.join(testset_directory_rel,'../../..')) + self.testset_work_dir = os.path.join(self.taiyakidir,'build/acctest/'+self.testset_name) + os.makedirs(self.testset_work_dir, exist_ok=True) + self.datadir = os.path.join(self.taiyakidir,'test/data') + self.read_dir = os.path.join(self.datadir,'reads') + self.per_read_refs = os.path.join(self.datadir,'per_read_references.fasta') + self.per_read_params = os.path.join(self.datadir,'readparams.tsv') + self.output_mapped_signal_file = os.path.join(self.testset_work_dir,'mapped_signals.hdf5') + self.remapping_model = os.path.join(self.taiyakidir,"models/mGru256_flipflop_remapping_model_r9_DNA.checkpoint") + self.script = os.path.join(self.taiyakidir,"bin/prepare_mapped_reads.py") + + def test_prepare_remap(self): + print("Current directory is",os.getcwd()) + print("Taiyaki dir is",self.taiyakidir) + print("Data dir is ",self.datadir) + cmd = [self.script, + self.read_dir, + self.per_read_params, + self.output_mapped_signal_file, + self.remapping_model, + self.per_read_refs, + "--device","cpu"] + r=subprocess.run(cmd, stdout=subprocess.PIPE, + stderr=subprocess.PIPE) + print("Result of running make command in shell:") + print("Stdout=",r.stdout.decode('utf-8')) + print("Stderr=",r.stderr.decode('utf-8')) + + # Open mapped read file and run checks to see if it complies with file format + # Also get a chunk and check that speed is within reasonable bounds + with mapped_signal_files.HDF5(self.output_mapped_signal_file,"r") as f: + testreport = f.check() + print("Test report from checking mapped read file:") + print(testreport) + self.assertEqual(testreport,"pass") + read0 = f.get_multiple_reads("all")[0] + chunk = read0.get_chunk_with_sample_length(1000,start_sample=10) + # Defined start_sample to make it reproducible - otherwise randomly + # located chunk is returned. + chunk_meandwell = len(chunk['current']) / (len(chunk['sequence']) + 0.0001) + print("chunk mean dwell time in samples = ", chunk_meandwell) + assert 7 < chunk_meandwell < 13, "Chunk mean dwell time outside allowed range 7 to 13" + + return diff --git a/test/acceptance/test_train_squiggle.py b/test/acceptance/test_train_squiggle.py new file mode 100644 index 0000000..998cd50 --- /dev/null +++ b/test/acceptance/test_train_squiggle.py @@ -0,0 +1,49 @@ +import os +import shutil +import unittest + +import util + +class AcceptanceTest(unittest.TestCase): + + @classmethod + def setUpClass(self): + test_directory = os.path.splitext(__file__)[0] + testset_name = os.path.basename(test_directory) + + self.testset_work_dir = testset_name + + self.script = os.path.join(util.BIN_DIR, "train_squiggle.py") + + def work_dir(self, test_name): + directory = os.path.join(self.testset_work_dir, test_name) + util.maybe_create_dir(directory) + return directory + + def test_usage(self): + cmd = [self.script] + util.run_cmd(self, cmd).expect_exit_code(2) + + def test_squiggle_training(self): + test_work_dir = self.work_dir(os.path.join("test_squiggle_training")) + + output_directory = os.path.join(test_work_dir, "training_output") + if os.path.exists(output_directory): + shutil.rmtree(output_directory) + + + hdf5_file = os.path.join(util.BIN_DIR, "../test/data/mapped_signal_file/mapped_remap_samref.hdf5") + print("Trying to find ", hdf5_file) + self.assertTrue(os.path.exists(hdf5_file)) + + train_cmd = [self.script, "--batch_size", "50", + "--niteration", "1", "--save_every", "1", + "--seed","1", # Seed random numbers so test is reproducible + hdf5_file, output_directory] + + util.run_cmd(self, train_cmd).expect_exit_code(0) + + self.assertTrue(os.path.exists(output_directory)) + self.assertTrue(os.path.exists(os.path.join(output_directory, "model_final.checkpoint"))) + self.assertTrue(os.path.exists(os.path.join(output_directory, "model_final.params"))) + self.assertTrue(os.path.exists(os.path.join(output_directory, "model.log"))) diff --git a/test/acceptance/util.py b/test/acceptance/util.py new file mode 100644 index 0000000..f1fa31b --- /dev/null +++ b/test/acceptance/util.py @@ -0,0 +1,123 @@ +import numpy as np +import os +from subprocess import Popen, PIPE + +# Data and script paths relative to working directory build/acctest +BIN_DIR = "../../bin" +DATA_DIR = "../../data" +MODELS_DIR = "../../models" + +class Result(object): + + def __init__(self, test_case, cmd, cwd, exit_code, stdout, stderr, max_lines=100): + self.test_case = test_case + self.cmd = cmd + self.cwd = cwd + self._exit_code = exit_code + self._stdout = stdout.strip('\n').split('\n') + self._stderr = stderr.strip('\n').split('\n') + self.max_lines = max_lines + + def __repr__(self): + L = ['\n\tCommand: {}'.format(' '.join(self.cmd))] + if self.cwd: + L.append('\n\tCwd: {}'.format(self.cwd)) + + if self._exit_code: + L.append('\tCommand exit code: %s' % self._exit_code) + + if self._stdout: + L.append('\n\tFirst {} lines of stdout:'.format(self.max_lines)) + for line in self._stdout[:self.max_lines]: + L.append("\t\t{}".format(line)) + + if self._stderr: + L.append('\n\tFirst {} lines of stderr:'.format(self.max_lines)) + for line in self._stderr[:self.max_lines]: + L.append("\t\t{}".format(line)) + + return '\n'.join(L) + + def expect_exit_code(self, expected_exit_code): + msg = "expected return code %s but got %s in: %s" % (expected_exit_code, self._exit_code, self) + self.test_case.assertEqual(expected_exit_code, self._exit_code, msg) + return self + + def expect_stdout(self, f): + msg = "expectation on stdout failed for: %s" % self + self.test_case.assertTrue(f(self._stdout), msg) + return self + + def expect_stdout_equals(self, referenceStdout): + self.test_case.assertEquals(self._stdout, referenceStdout) + return self + + def expect_stderr(self, f): + msg = "expectation on stderr failed for: %s" % self + self.test_case.assertTrue(f(self._stderr), msg) + return self + + def expect_stderr_equals(self, referenceStderr): + self.test_case.assertEquals(self._stderr, referenceStderr) + return self + + def get_exit_code(self): + return self._exit_code + + def get_stdout(self): + return self._stdout + + def get_stderr(self): + return self._stderr + + +def run_cmd(test_case, cmd, cwd=None): + proc = Popen(cmd, stdout=PIPE, stderr=PIPE, cwd=cwd) + stdout, stderr = proc.communicate(None) + + exit_code = proc.returncode + stdout = stdout.decode('UTF-8') + stderr = stderr.decode('UTF-8') + + return Result(test_case, cmd, cwd, exit_code, stdout, stderr) + + +def maybe_create_dir(directory_name): + ''' + Create a directory if it does not exist already. + In Python 2.7 OSError is thrown if directory does not exist or permissions are insufficient. + In Python 3 more specific exceptions are thrown. + ''' + + try: + os.makedirs(directory_name) + except OSError: + if os.path.exists(directory_name) and os.path.isdir(directory_name): + pass + else: + raise + + +def any_line_starts_with(prefix): + return lambda lines: any(l.startswith(prefix) for l in lines) + + +def assertArrayEqual(test_case, a, b): + test_case.assertEqual(a.shape, b.shape, + msg='Array shape mismatch: {} != {}\na = {}\nb = {}'.format(a.shape, b.shape, a, b)) + test_case.assertTrue(np.array_equal(a, b), + msg='Array element mismatch: {} != {}\nshape = {}'.format(a, b, a.shape)) + + +if __name__ == '__main__': + assert not zeroth_line_starts_with('a')([]) + assert zeroth_line_starts_with('a')(['a']) + assert zeroth_line_starts_with('a')(['a', 'a']) + assert zeroth_line_starts_with('a')(['a', 'b']) + assert not zeroth_line_starts_with('a')(['b', 'a']) + + assert not last_line_starts_with('a')([]) + assert last_line_starts_with('a')(['a']) + assert last_line_starts_with('a')(['a', 'a']) + assert not last_line_starts_with('a')(['a', 'b']) + assert last_line_starts_with('a')(['b', 'a']) diff --git a/test/data/aligner_output/alignment_summary.txt b/test/data/aligner_output/alignment_summary.txt new file mode 100644 index 0000000..43759b4 --- /dev/null +++ b/test/data/aligner_output/alignment_summary.txt @@ -0,0 +1,6 @@ +read_id alignment_genome alignment_genome_start alignment_genome_end alignment_strand_start alignment_strand_end alignment_num_insertions alignment_num_deletions alignment_num_aligned alignment_num_correct alignment_identity alignment_accuracy alignment_score +0f776a08-1101-41d4-8097-89136494a46e Salmonella_enterica_snippet 11 2106 821 2827 72 161 1934 1858 0.960703 0.857407 2478 +de1508c4-755b-489e-9ffb-51af35c9a7e6 Salmonella_enterica_snippet_rc 2933 5002 2 1996 38 113 1956 1904 0.973415 0.903654 2962 +db6b45aa-5d21-45cf-a435-05fb8f12e839 Salmonella_enterica_snippet 4548 7605 89 3118 58 86 2971 2901 0.976439 0.9313 4854 +1f1a0f33-e2ac-431a-8f48-c3c687a7a7dc * -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 +b7096acd-b528-474e-a863-51295d18d3de * -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 diff --git a/test/data/aligner_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_0.sam b/test/data/aligner_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_0.sam new file mode 100644 index 0000000..f66e85d --- /dev/null +++ b/test/data/aligner_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_0.sam @@ -0,0 +1,2 @@ +@SQ SN:Salmonella_enterica_snippet LN:10000 +db6b45aa-5d21-45cf-a435-05fb8f12e839 0 Salmonella_enterica_snippet 4548 60 89S17M1D5M1I34M1D50M1D1M1D1M1D12M1I93M1D7M1D124M1I19M8D3M1D6M2D5M1I2M1D19M2I55M2I3M2I14M1D17M4D5M2D31M1I12M1D7M1D4M1D39M1I18M2D21M1D6M2I20M2D15M1D24M2D5M2I8M1I15M1D11M1D30M1D49M1D171M2D13M1D22M1I43M1D7M1D18M1I70M1I48M1I21M1D25M1I21M1D41M2D71M1D36M2I34M1I16M1D42M2D49M1I20M3D29M1D2M2D47M2I5M2I58M4I16M2D13M1I310M1D10M1I13M1D29M4D4M1I30M1I59M2D23M2D16M3I1M1I51M1I68M2D75M1I20M1I33M1D1M1D12M2I82M2I11M1D2M1D44M1D95M2D13M1I40M2I47M1I6M3D3M1I5M1I1M2I5M1I100M2D17M1S * 0 0 TCGGTATCTTCGTTCGGTTTCGGAGGTGGGTGTTTAACAGATGCTTGCCTGTCGCTCTATCTTCGGCATCTTTCGCCGGGTGTTCCAACGCCAGGCGGCATGACGTCATATCAACGCCAGCGCTTCCAGACGCGTGAGCCGCTCGCGGCGGGTTTACGCACGGAACCGCCTGACGCTATACCAGTGATATGATGTTGTGCGTGCCAGACACCCGGCAATGCCAATACGCGTCGAATCGGTGCCGCCAGCTACGACAAGCCCTTTATCCAGCGCCAGTCGAACCGGCGGCGGGTAATCCGCCGCGCGGGTGGTTGGCGTCACGGATGGCGGCCCCCTCAAAGTAAGGCCCCATTTGCACAGTGTATGCCAGACCCAGCTTACGCATTCGCTCGAGCGTCTGTGGCGAACCGGTATTCAGATGCGCAATAGACCAGCTGGAGAGGGCGCAGATCGTACTACTGCTCAAATCAGTAATATCGCGTCGGCGCTGTCGCATCGGTATAGGCATGGATCTCAAGCGGTATTCCCCGTTTTTGCGCAAACGATGCCTGACCACTGGCGAAGCGCGGTTTGTCCTGCTCTGAGGAAACCAGGGCCATCCGCACGCCGTCATTCATTCCGGCCCGCCAGGCTTTCCCCAGGTAAGAAAGCCAGTTGCCCGTCATCGGCACGCGCCGGGCAAAAGAGCCATCAGGTTGCTGAACCTGCGCTTCATGACCTTTCGCGCCGGCTTTGTACCGGAATGCGATACCCCGCGCAGCGGTAAATCCCCTGGTTTCGCATTGCAAATGAGCTCACCCAAGCGGCGGGCAGGCCCGGCAGAGGGTCAATGATTAGGTCACGCCGCGAACGTTCATATCAGCTAAAATTGCCGCAGACCACCCTCGCGATCGGCGTTACGACTTATGCTGGCAAAAGCTGGTTAAACGCGGCGATGTCACCAAATAATTTCCCCGTGGCGCTACCTTTTGCGTCGCGCTCTGCATGGATTCCCGCTAAATCAGGAGGAGAGGTGTCGTTAAGGCCAAGTACGTCTATACCGCGCTGATTCACTAACGCATAGTCGTAAAGATACTGAATATAATGGATGATCGGGAGGGCGTGGCTCAGTTCGACTACAAATCGGCGCCCGGTTTTCTGCAAATTGCGCCGGTATCCACGATCCACTACATTACCCATTGATCGTGGAGGACGACGGTTAGCGTCGGCGCGTAATTTATCCAGCGCGTCTTTAAGCGAAGGGCTGTCGTACCAGTAGGTTTTCGAATGTCCAGGTTTGTCCGCCGCGAATGGCATGAATGTGGGTATCCGGTCAGGCCGGGTATCACGGTTTACCCTGTAAATCGATGGTACGGAGTATGGTTGCCGCGCCATTCATTGTCGCCGTATCATCGCCAATCGCCACAATCCGCGAGCCGAATCGCCAGCGCGCTGGCCTGCGGCTGGGCATCATTCAGAGTGATGATATTACCGTTATGCAAAATAATATCGCTGTCGGGCGGGCGGCGGTGGCTGCGCCTGTGGCTACGTGAGCAGAAAGAAAAGCGCAGTTCTGGAAAGCAAGAGACGGGACGAGACGATCATTAGCATGTTCCTTTTGTCGGGCGATAAGCAAACCATAGAGCGAACAAGAATTGTTCAATCTGTGCTTGGGGAACAGTGCGTTCTGTTTTTTAGAACGGTGGTGATGAAAGAAAAGCCCGCCGAAACGGCGGGCTTAGGAATTAATGAATAATTAGAACTGGTAAACCATACCCAGCGCGACGATGCATCATCCCGGTGCTGATACCGGCATCTTTATAGAACTGGTCGTCATCATCCAGCAGGTTGATTTTACGCATAATCAACATAGGTGGAAGTTTTTGTTGTTAATAGTAGGTGGCACCTACATCGGCATATTTAACCAGGTCTTTATCATCGCCATTAACGTTGTTATAGGTCAAATCTTTACCTTTAGACATCAGGAAAGAAACCGCCGGGCGCAGACCGAAATCAAATTGATATTGGGCTGTAACTTCGAAGTTTTGAGTTTTGTTTGCTACGCATGAGCGACAATCGTCATTGCCGTAAGGCGTCATATTACCGGTTTCTGAATACATTGTGGCCGAATAGATATTATTAGCATCATATTTAGCGCCAACGGTCCATGCATCCGCTTTATCACCGCCAGCGGTAGAGTATTAACCTGACCCATTAGTACGATGGAAGTGGTATATGCCGCGCCGAAACTTACATATACCAGATCATAAGTTGATGAGATACCGAAGCCCGTCACCGTTAGAGTTCTTCACGTTACGATCGCCGCCGTTATTAGTGCCTTCCTGCTCGCGAAACCTGAACTTCATTGGCACTGATATTGCAACGCGCGAAGAGTTCAGACCGTCTACCAGACCGAAGAAATCAGTGTTACGGTAGGTGGCCACCGCCGTTAGCGCGACCGGTCATGAAGTTGTCAGCGTAAGTGTAGGAGTCGCCACCGAACTCAGGCAGCATCGGTCCAGCCTTCTACGTCGTACAGGACGCCGTAGTTACGACCATAGTCGAATGAGCCGTAGTCGCCGAATTTCCGGGCCGGCAGATGCCGAACGGAGATCAGGAGTTAGCACCTTCGCCTTCGGTAGGTAGCCTGAACGTTATATATTCCCACTGGCCGTAACCGGTCAGTTGGTCGTTAATCTGCGTTTCGCCTTTAAAGCCGACACGCATATAGGTCTGATCATACCATCTTTAGAGGTCATCAGAGAAATAGTGCAGGCCGTCTACTTTCCCGTAAAGGTCAGTTTGTTGCCGTCTTTGTTATAAATTTCAGCGGCGTGTGCTGCGCCAGCAGCCAGCAGAGCCGGGATGACAAGTGCCAATACTTTTCTTTTCATTTTTATCCTTAAGAAACTTAACTTATTTGCAAAAGATTGAACTTCTACAGATTCACGTTGAATCAAAGGCATCCTAATCTGAATAATATTATTTCAACGAGTAGCTAACGCTGTATATCGTTTTGTTGATTTAATACAAAAGTTACTATCGGAAACGCATATATTTATGGTGAATATATTTGTTATTATGTATTCATGGCTGTGATTTGTTTTATTTCACAATTTGCGAAAAGATGGCAATATAGA * NM:i:214 ms:i:4854 AS:i:4854 nn:i:0 tp:A:P cm:i:291 s1:i:2082 s2:i:0 dv:f:0.0442 MD:Z:17^G0T9T28^T36G10A2^C1^A1^A91A0A1T0G9^C1A5^A17A0A0G60A62^CTGCTGCG3^C6^AA7^C60G0C8T0C19^T17^AAGA5^CT0G32A9^C7^C0C3^A34G22^CA21^T0T25^CA15^C22A1^AG5T22^G11^G0C0C13G14^A49^A69A1G0C24G73^GC13^G17G4C0G41^C7^G0G0C155^T46^T41^GA71^C57G28^T42^AG29C0C0A37^AAA29^G2^AA53A72^GA13A10A11A190G22A0G33A37^A23^C29^CGCC4C42C45^AT1A8G0T0T10^CC78A41G15^AT76A7G0A5A0G5T0C28^T1^T94G10^A2^A44^C95^TT94G1A1A7^TTT0T10A3A98^AA17 diff --git a/test/data/aligner_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_1.sam b/test/data/aligner_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_1.sam new file mode 100644 index 0000000..ec2778d --- /dev/null +++ b/test/data/aligner_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_1.sam @@ -0,0 +1,3 @@ +@SQ SN:Salmonella_enterica_snippet LN:10000 +0f776a08-1101-41d4-8097-89136494a46e 0 Salmonella_enterica_snippet 11 60 821S15M1I1M2I28M3D8M2I13M1I18M1I21M1D4M1D15M1D1M2D13M1D81M1D41M3I55M1D28M2D22M1I18M3I41M1I3M1D6M2D4M1D76M2I14M4I2M2I13M2D37M4I7M1D30M1I5M2D9M2D15M1D38M2D14M1D37M1D10M2I3M1I12M2I24M7D2M1D11M1D5M3I3M5I5M2D1M2D18M3D6M1I9M3D26M1D15M1D1M2I11M2I13M1D8M1D23M5I9M2I15M1D17M2D1M1I27M2D9M1D11M1I16M1I3M2D19M1D9M1D16M1I12M1D17M2D1M1D13M2D28M1I1M2D2M2I13M1D8M1D12M2D10M1D35M4D13M1D6M1D39M1D31M1D22M6D21M8D11M4D31M1I62M3D3M1D16M2I2M3D10M4D60M1D2M1D21M1D22M3D13M1D4M7D1M4D3M1D5M5D14M2D1M5D11M1I17M1D13M1D5M6D38M2D2M1I13M2D56M1D36M1D10M4I31M1D7M1D16M2I7M1I7M2D9M1S * 0 0 TCGGTAGTGCGCTTCGTTCCAACTCGATTTGGGTGTTTAACTTGGTCTTGCCTGTCGCTCTATCTTCGGCGTCTAGGTGTTTAACCTCCTCTGCTGTCCCGCTAAAGGCGGCGGATCAGCGAATAAAACCTACCTCTATCAGGAGACAAAGCGCTTCTGACGCCAGATACAAGCTCAAATTATCTCGTCAGAGAAAGGATCTTGCGTACGCTGGCACCACCGCCTGTCCCGCCGTATCATCACATTGCGTTGTCATTATGGCGGCACCTCGGCGGAAGCGAACTGAAGACCGTTAAACTGGCGCGGCAAAATACTATGATGCGCTACCAGCCGAAGGCAATGAACACGGGCGAGCCTTCGCGATGAGCTGGAAAAAGGGTATTTAGCAGGCGCAAAATCTTGGCTTAGGCGCGCAGTTCGGCGGCAAATATTTCCGCTCACGATATTCGTGTCATCCGATTGCTGCCGCGCCACGGGCGGCATCGTGTCCGATAGGTATGGGCGTGTCCTGCTCCGCGGATCGAAACGCATTAAAGCGAAGATCAACCGGGAAAGGATCTGGATCGAAGCTGAGCCTCCCGGTAAATATATTCCAGAGGCGCTGCGCCAGGCGGGAAGGGCGAGGCGGTGCGCGTTGATAACCGTCCATGAGCGAGATACTGCAACAACTGTCGCGGTATCGACATCAACGCGCCTGTCGCTGAACGGCACGATTATTGTGGGCTTGCGACATCGCTTTGCTAAACTGAAAGAACGGATGGACAGAGGCGAAAGGTTTGCACGTTACATCAAAAATACATCTATCTATTATGCAGGGCCTTAAAAACGCCGGAAGGAGTATATGCCTCCGGGTCGCTTGGGCCGACTACAGGACGAATATGGACTCTTATGTTTGATCAGCTCCAGTCACTGGGGCGGCAGTATGATCATGCGGCGAAGGCAACCGCAGCCGGGTGACCGATGCCGCAAGAAACATGGCGGCTTCTACCTGGGCAGTATCGGCGGCCCGGCAGCCGTTCTGGCGCAGGGTAGTATCAAGCGCCTGGGTGCGTGGAATATCCTGAACTGGGTATGGAGGCTATCTGGAAGGAAATTGAAGTGGAGGATTTTCCGGCCTTCATTCTGGTGGATGACCAAGGCAACGATTCTTCCAGCAGATTCGGTCATCACAGTGCACGCTGCGTTAAGTAATGTCATACGCCCTGCGGGGTGGTAGCAACATGATTCGCTATTTTGCCGGAGAACGGCGCAAGCGCATATATCGGTCTACCGTGTGTTATTATCGTTAATACTCATAAGACCGCAAACATGTGAGCAATGTCGACACGCAGTCTCATCAACTTGAGTAGGAGGAGCAGGTAATGGCGTGTACAACGGTACGCCGCGAAAGATTCAATGGGCGCGATTGAAGTCCCGGCAGATAAACAAACTGTGGGGGCGCAGACTCAACGTTCGCTGGAGCATTTTTCAGTTCCACGGAAAATGCCCGTCTCCTCATTCACGCTCTGGCGTTGGCCAAGCGCGCCGCTGCAGGTCAACCAGGACTAGGGTTGTTGGCGGCGGAAAAAGCCAGCGCGATTATCAGGCGGCTGATATGAAATTACTGGCGGAAGAAAACATGCTGATGAGTTTCCGCTTCCAGACCAGGTCGGTACCTCGCTCTTTACCGCCGAATATGAATGAAGTGATGAACCGCCTCCAGTGATACAGGCGGCGTTCGCGGTATGGAACCTGAGTGCATCCCAAGCAATGACGTCAATAAAAGAGTCGAAAATAAACGATGTTTCCCAACCGCCATGCACATGGCGGCACGGCGTTACTATAACGTTACGCGGTCTTTTATCCCACAGGTATGTTGTTAACGGATGCGCTTCGCGATAAATCACGCTTTCTCGATATTGTCAAAAATTGGCCGTACTCTACCAAGGACGCGACGCCGCTCACATAAGCCAGAGATTTCCGGTTGGGCTAGCCATGCTGGAGTTAACCTCAGACACAGGCACAGTTTACCGCGTCGCGGAACTGGCGCTCGGCGGAACCTGGTAAAGGGACAGGGCTTATCACCATCGGAATATGCCCAGCGTGGCCGGGAACTGGCGACGATTACCACGGCGCCGTTTGTTACACCAATAAATTCAAGCGCGGCGACCTGTGACGCGTTGGTACAGGCGCATGGCGCATTAAAGGACTGGCATACTCGCTGATGAAAATCGTAACGATGTTCGCTGGCTGGCGACGCGCTGCGGCATTGGCGGGCCCGGAGAATGAGGCAGTTCCATTATGCCTGGTAAAGTGAATCCCGACCCAGTGTGAAGCGGTAACGATGCTATGTTGCCAGGTGATGGGTAACGATGTGGCCATTATGGGGGCGCATCGGGCAGCACAATCTCAACGTCGTCCGATGGTTATTCATAATTTTCTGCAAACGGTGCGCCTGCTGGCCGATGGCATGGAAGTTTAATAAACACTGTGCGTCAGAATCGAGCCAAACCGCGAGCGTACGCAGTTGCTGATGAAGGTACTGCATACGCACATCGGTCAGCGGCGGAGATTTGCGAAGAAGGCGCATAAGAAGGGCTGACATAAAAGCCGTGGCATTAGGATACCTTAGCGACGAGGAGTTCGGCACTGGGTACGTCAGGTTGATGGTTGGCAGTATGACGCCGGGACGTTAATCCGCCACATACAGGTGCAATCATGAATAATCAGTCGCAGCAGCTGCGCCTGTGGTTTAAACGGTGCTACAAGAATATTATCGGCGTCATAGTTTTCCAGCCCGCAATATCGGCACGCCAGCGCCGTTTTGTCCGATGTTGGTGTGCGATAAC * NM:i:309 ms:i:2478 AS:i:2478 nn:i:0 tp:A:P cm:i:104 s1:i:811 s2:i:0 dv:f:0.0826 MD:Z:44^CCG37G1A20^T4^A15^A1^CA13^T1T79^A84T0A10^T16A6G4^GC22G0C12C42C4^C6^AC4^T6G48A0T0G0A46^GA44^G35^GG1T7^GA15^G21A16^GA14^T37^C15G1G31^TGGCGAT2^G0G6G3^C9A0A2^AA1^AT16T1^AGC7G7^AAT1T0T0G22^G1A0A0G11^T19A2G0C1^C8^T19G13G0G9A0A1^A0C0C7G2A0T3^CC1C0A9A15^CC9^G0C29^TG0C18^C1T2G4^G28^A1C0A14^TT1^A13^CA29^CA15^A2A0C4^C12^GG0C9^A19G15^CCGC1C11^G6^T39^A11G0G0C17^C0C21^TCCGGC0C18A1^ATTGCTAT11^AGCC93^TAA3^G18^TTT0G1G7^TCTA60^G2^T21^G22^TAT13^A4^GTCGCTG1^TGCT3^C5^GCTCA14^TA1^GATAA28^A13^G0C4^AGCCTC38^AC13C1^GA53G2^G0C0G17G16^A41^A0C6^C28A1^TA0A8 +de1508c4-755b-489e-9ffb-51af35c9a7e6 16 Salmonella_enterica_snippet 2933 60 2S11M2D2M1D81M1D50M1I7M1D50M1D13M2D6M1D26M4D20M1D54M5D35M2D3M2D51M3D7M1D13M1I17M1D2M2D33M1I141M1D28M1D2M8D66M1D36M1D7M3D2M1I9M1D18M3I11M2D32M5D11M1I57M1D36M3D4M1I21M1D44M2D58M1D54M1D43M2D45M1I6M1I34M1D6M1D12M1D12M3I5M3D2M1D28M1D6M2D9M1D57M1D3M3D35M2D36M1D38M1I45M1D6M2D63M1D28M1I5M1I8M1D7M1D5M4D35M2D13M1D22M3I9M1I31M1I5M8D1M3D14M2I23M1I4M1I30M3D2M2I3M1D3M1I11M1I4M1I1M2D4M2I5M4I4M1D21M1I49M89S * 0 0 ATGCTCAACCAGCGCAAATACGTCAGGCAGCGGTCACGTTGGACATCGCGTGCCAGACCGGCCAGTGAAAGCTAAATTTTGCGCCGCCCAGTTCGCCTCGTCACAGACGACGCTACCGCCCATCGCCAGGGCGATAGAGTGTACGATTGGCGAGCCCAGCCTGCAGCCGCCTGTCGCCCGGTCTCTACTGGGGTCGAGTTCGACAAAGGTTCAAATACGCTCAGCGCATCCGCTTCAATGCCGGGGTTGCTTCAACTATCAGCGTTGTCGACTGCCGTCGAGTAACAGGCTGACTTGAACGGTGGCACGACTGTAGCGCAACGTTCAGCAGGTTATCCAGCACTCGCGACATTAACCGTATACCGCGCCATAATCGCCGACTACACATGTGACAAGGTTTACGGCGCGTTCCGTCACGCTTGCACATCCTGCAAAATGCATGAGCAACCAGCGGAGGTCGGGTCTGGTCAACATCAGTTCATTTTGGTGGTCGGTCAAGGCGCGCATAGGTGAGCAGTTCTTCAATCAGCGCCTCCAGCTGACCAATATCGCGATTGAGCGCCTGTGATTCCGGCGGTGTCAGATTTTCACTCATTTCCTGCCGATAGCGTAAACGTACCAGCGGCGGCGAAGCTCATGGGCGATACCGTCAATCGCACTGGCGATCAGGGCATTAATGTTATCCGCCATCTGGTTGAACGCGACGCCCAGACGTTCGAAACTGAACCATTATCGAAATGCAGGCGTTCAGTCAAATGACCTCGCCGCAGCTGCGCGGTGATTCGAGCCTGAGCATTTCTTCTTGCCAGTGGGCGCATCCAGATAAACACAGGAAAGGCGAGAGGCAATGAGCTGCCATTAAGGCGATATCCAGCAGCCGCATTTGATGCAGGAAATAGAGATAGGGAACCAGCCAACCGCCAGGACGTAATGGCTGCGCGGGATACGAATAAAAGGTGTATTGATCGTCAAGGCGACGATATCGCCTTCACGCAGTCGCTGCGTGGTGGCGGTATCGCTTATAGTGATTTAACGGTTCAACGCGTAAATCGAAAGAGAGGTTCAGGTCCATCTCTTTAACGTTTTTCCCCATTCACGCGGCGGAATTTCCCGCAGTTCGCTGCGCATCGATAGAGCGAACTTTTCATCAGATCGTCGAGCGATTGCCTGCCCGCGTTCGGCAGTGAATTTGTACACCAGTCCGACCAATAAGGTCAGCGACCAGGAAAACAGACGAACAGCGTAAGGTAAAACTGTACAACAGCTTTTCATCAACGCCACCGGATCTCGGGAATACAATTTATCGCGGCGCTGACGGAAAAGATTTCCGTCAGCCGCTGGTGGTTATAAAAGCAGGCTTTGCGTCAACTGACAACCTCATTGTCCGGCGTTAAGGTCCGACCGTGAACAATGCGTCCATCGACAAGTGTCAAAACAGCGAATGGTATCAATTCTGTCTTCCGGCATCGTCATAATGGCTGATTGAGTACCGCCAGATCGGCCTGTTTTCCGGACGCTAAGCTGGCCCCGGTGTTGTTCGGCAAAGGCGAGCCAGGCGTATGATGTATATAACGCCAGCGCTTCCAGATGCGCGAGCCGCTCGCTGGCGGGTTTACGCACGGAAACCCTGACGCTATACCGGTGATATGATATTCCGATATGCGCGCCATTACCGGTACGCACGCGTCGAATCGGTGCCGCCAGCTACGACAAGCCTTATCCAGCGCCATCGAACTGGCGGTGAATTGTCCGCTGCCGCGCCAGGGTGGATTGGCGTCACGGATGGCAAGCCTCTGCAAAGCGGCACAGTGTATGCTGTAGACCCAGCTTACGCATTTTCTGCCTGCCGTCTGTGGCGAACCGGTATTCAGATGCGCAGCCACCGCGTGTGAGGGCGCAGGATCAGCCGCCGTGCGCCATTTTCCGCTCAAAAATCGTCAATATCGCCGTCGGCGCTGTCATCGGTATAGGTATGGATCTCGGGCGGTATTCCCCGAGGTTAAACACTCCAAGCAGACGCCGAAGATATTTGACAGGTAAAGTAACTGCTAAACACCACCTCCGAACTGAACGGGGGCGCCACCA * NM:i:203 ms:i:2962 AS:i:2962 nn:i:0 tp:A:P cm:i:171 s1:i:1182 s2:i:0 dv:f:0.0532 MD:Z:11^GT2^A81^T57^T2A4C37C0T3^A13^CG6^C23C0C1^TCAT18C1^T36T17^CATTA35^CA3^CA1G49^GGG7^T19G0C9^G2^GG11T0C21C44T5T59A0A27^T28^A2^TGTTTTTT66^G36^C7^AAA11^C0G28^GC32^AGAAA34T33^G0G35^GCT25^C40C3^AA58^T54^A43^CG45T22A0A12C2^A6^T12^C17^ATT2^C1C26^C6^GC9^G57^G3^TTT35^GC36^A83^G1C4^CG12T12C3T31C1^G36T4^G0A0C5^C0A1T2^CAAT35^CT13^G6C5C46C7^TAAGGCCC1^ATT0T13C18C0G3A0A31^AAT5^A4A14^TA1T8A2^T45C9A0A13 diff --git a/test/data/aligner_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_2.sam b/test/data/aligner_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_2.sam new file mode 100644 index 0000000..d03306e --- /dev/null +++ b/test/data/aligner_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_2.sam @@ -0,0 +1,2 @@ +@SQ SN:Salmonella_enterica_snippet LN:10000 +b7096acd-b528-474e-a863-51295d18d3de 4 * 0 0 * * 0 0 * * diff --git a/test/data/aligner_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_3.sam b/test/data/aligner_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_3.sam new file mode 100644 index 0000000..5660bd9 --- /dev/null +++ b/test/data/aligner_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_3.sam @@ -0,0 +1,2 @@ +@SQ SN:Salmonella_enterica_snippet LN:10000 +1f1a0f33-e2ac-431a-8f48-c3c687a7a7dc 4 * 0 0 * * 0 0 * * diff --git a/test/data/basecaller_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_0.fastq b/test/data/basecaller_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_0.fastq new file mode 100644 index 0000000..d13c25f --- /dev/null +++ b/test/data/basecaller_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_0.fastq @@ -0,0 +1,4 @@ +@db6b45aa-5d21-45cf-a435-05fb8f12e839 runid=9a076f39fd3254aeacc15a915c736105296275f3 sampleid=AMW_4b4 read=90479 ch=69 start_time=2019-02-08T09:13:46Z +TCGGTATCTTCGTTCGGTTTCGGAGGTGGGTGTTTAACAGATGCTTGCCTGTCGCTCTATCTTCGGCATCTTTCGCCGGGTGTTCCAACGCCAGGCGGCATGACGTCATATCAACGCCAGCGCTTCCAGACGCGTGAGCCGCTCGCGGCGGGTTTACGCACGGAACCGCCTGACGCTATACCAGTGATATGATGTTGTGCGTGCCAGACACCCGGCAATGCCAATACGCGTCGAATCGGTGCCGCCAGCTACGACAAGCCCTTTATCCAGCGCCAGTCGAACCGGCGGCGGGTAATCCGCCGCGCGGGTGGTTGGCGTCACGGATGGCGGCCCCCTCAAAGTAAGGCCCCATTTGCACAGTGTATGCCAGACCCAGCTTACGCATTCGCTCGAGCGTCTGTGGCGAACCGGTATTCAGATGCGCAATAGACCAGCTGGAGAGGGCGCAGATCGTACTACTGCTCAAATCAGTAATATCGCGTCGGCGCTGTCGCATCGGTATAGGCATGGATCTCAAGCGGTATTCCCCGTTTTTGCGCAAACGATGCCTGACCACTGGCGAAGCGCGGTTTGTCCTGCTCTGAGGAAACCAGGGCCATCCGCACGCCGTCATTCATTCCGGCCCGCCAGGCTTTCCCCAGGTAAGAAAGCCAGTTGCCCGTCATCGGCACGCGCCGGGCAAAAGAGCCATCAGGTTGCTGAACCTGCGCTTCATGACCTTTCGCGCCGGCTTTGTACCGGAATGCGATACCCCGCGCAGCGGTAAATCCCCTGGTTTCGCATTGCAAATGAGCTCACCCAAGCGGCGGGCAGGCCCGGCAGAGGGTCAATGATTAGGTCACGCCGCGAACGTTCATATCAGCTAAAATTGCCGCAGACCACCCTCGCGATCGGCGTTACGACTTATGCTGGCAAAAGCTGGTTAAACGCGGCGATGTCACCAAATAATTTCCCCGTGGCGCTACCTTTTGCGTCGCGCTCTGCATGGATTCCCGCTAAATCAGGAGGAGAGGTGTCGTTAAGGCCAAGTACGTCTATACCGCGCTGATTCACTAACGCATAGTCGTAAAGATACTGAATATAATGGATGATCGGGAGGGCGTGGCTCAGTTCGACTACAAATCGGCGCCCGGTTTTCTGCAAATTGCGCCGGTATCCACGATCCACTACATTACCCATTGATCGTGGAGGACGACGGTTAGCGTCGGCGCGTAATTTATCCAGCGCGTCTTTAAGCGAAGGGCTGTCGTACCAGTAGGTTTTCGAATGTCCAGGTTTGTCCGCCGCGAATGGCATGAATGTGGGTATCCGGTCAGGCCGGGTATCACGGTTTACCCTGTAAATCGATGGTACGGAGTATGGTTGCCGCGCCATTCATTGTCGCCGTATCATCGCCAATCGCCACAATCCGCGAGCCGAATCGCCAGCGCGCTGGCCTGCGGCTGGGCATCATTCAGAGTGATGATATTACCGTTATGCAAAATAATATCGCTGTCGGGCGGGCGGCGGTGGCTGCGCCTGTGGCTACGTGAGCAGAAAGAAAAGCGCAGTTCTGGAAAGCAAGAGACGGGACGAGACGATCATTAGCATGTTCCTTTTGTCGGGCGATAAGCAAACCATAGAGCGAACAAGAATTGTTCAATCTGTGCTTGGGGAACAGTGCGTTCTGTTTTTTAGAACGGTGGTGATGAAAGAAAAGCCCGCCGAAACGGCGGGCTTAGGAATTAATGAATAATTAGAACTGGTAAACCATACCCAGCGCGACGATGCATCATCCCGGTGCTGATACCGGCATCTTTATAGAACTGGTCGTCATCATCCAGCAGGTTGATTTTACGCATAATCAACATAGGTGGAAGTTTTTGTTGTTAATAGTAGGTGGCACCTACATCGGCATATTTAACCAGGTCTTTATCATCGCCATTAACGTTGTTATAGGTCAAATCTTTACCTTTAGACATCAGGAAAGAAACCGCCGGGCGCAGACCGAAATCAAATTGATATTGGGCTGTAACTTCGAAGTTTTGAGTTTTGTTTGCTACGCATGAGCGACAATCGTCATTGCCGTAAGGCGTCATATTACCGGTTTCTGAATACATTGTGGCCGAATAGATATTATTAGCATCATATTTAGCGCCAACGGTCCATGCATCCGCTTTATCACCGCCAGCGGTAGAGTATTAACCTGACCCATTAGTACGATGGAAGTGGTATATGCCGCGCCGAAACTTACATATACCAGATCATAAGTTGATGAGATACCGAAGCCCGTCACCGTTAGAGTTCTTCACGTTACGATCGCCGCCGTTATTAGTGCCTTCCTGCTCGCGAAACCTGAACTTCATTGGCACTGATATTGCAACGCGCGAAGAGTTCAGACCGTCTACCAGACCGAAGAAATCAGTGTTACGGTAGGTGGCCACCGCCGTTAGCGCGACCGGTCATGAAGTTGTCAGCGTAAGTGTAGGAGTCGCCACCGAACTCAGGCAGCATCGGTCCAGCCTTCTACGTCGTACAGGACGCCGTAGTTACGACCATAGTCGAATGAGCCGTAGTCGCCGAATTTCCGGGCCGGCAGATGCCGAACGGAGATCAGGAGTTAGCACCTTCGCCTTCGGTAGGTAGCCTGAACGTTATATATTCCCACTGGCCGTAACCGGTCAGTTGGTCGTTAATCTGCGTTTCGCCTTTAAAGCCGACACGCATATAGGTCTGATCATACCATCTTTAGAGGTCATCAGAGAAATAGTGCAGGCCGTCTACTTTCCCGTAAAGGTCAGTTTGTTGCCGTCTTTGTTATAAATTTCAGCGGCGTGTGCTGCGCCAGCAGCCAGCAGAGCCGGGATGACAAGTGCCAATACTTTTCTTTTCATTTTTATCCTTAAGAAACTTAACTTATTTGCAAAAGATTGAACTTCTACAGATTCACGTTGAATCAAAGGCATCCTAATCTGAATAATATTATTTCAACGAGTAGCTAACGCTGTATATCGTTTTGTTGATTTAATACAAAAGTTACTATCGGAAACGCATATATTTATGGTGAATATATTTGTTATTATGTATTCATGGCTGTGATTTGTTTTATTTCACAATTTGCGAAAAGATGGCAATATAGA ++ +,,%%,$.$"(*0-2.%'),-%#%#$%$(%$)'&'%%$##("+%#$$*.+.+64(41:.9:<==:1-%%2*$&'&$""'(+,&"#$#$$$#-/&*467+&&''$(&-#"##%$((*''('-&''',=<458=:;=<537()6;)421&/%%+3663)324392>64/0.-,*4598<=64333%(*031,+337,+.%%%./;2.2$&()),,-'-8>:=1323&$'04249=//076-/;=C-;90/34:57/')56+52%%16<=AB<@>??>;<>>,+.<:9911;=@=>6/37142;>>1,))))(%%26;;5965&&-::;;<.086***-#$+5<>>?@22(9?=@CG@?34:@=54**6:8.&28811443;;;-:*-+%#%,/0/,,,0%#%,$%169<>77=>:6../79:6'&&5?%.9?>3+4797C::>==@A?;5%%%&++)/+$%####''&)#"'**-.3329-$-<:@B97;'625:6260(*./%$(&-)-2216=9)(/42><;>>39;8.03)'%$$$#%$$$*+///1<:94#()&''$47-%>;3<9<==;<>B@9<;>8345854C<57,''&''&$,,,/12,*4/0%#%)&&$%(('&)798$*=<././6/$##"%,)##(1594352844:*67>;.2//&%&&.-+)))7-099@72/47<<@?7124.86@B@C932=:>7.-%#'&)*,*##(&#$'(0/0-256=73300-..2,*(-')09@>?D@<-++*&0/)6809<:;6.99I<02&;)67/&&&077+877<;2-0>@>:864:?;98?9:;682>9283887<<8<$$+3474110=>MIM86@<:859<<::<<8:476855')()$&#=?<;838:<<47E>6213%)(+))6&/&&&+:@?38<=&=.36=855<9:88:%9.;5,&,98>551;7((/4///3;=?ED>?6>?@46<&65<=16>2<8>*6/.-.1,1=9<4658=><3987:9988;>>;=;0%$%$4==9;843244++4-6:6@:504011,$+&,345777*/0<4,896:<;=?@?B377.32555702677===@8=6899@:<;;;>@*,8=<49;926-&+,14?18>=89?9557689><.)+#%%-#(&&3**.49A?&45;:2;>;().466455(;;=>3:<:757229010&*34332'*)7.45(&&*$$'-8548.3.8B9448;001-/-&;9:=:69;971&:8<))58669:;===A=:524D<1583/29;:+&@@<>2>=++2946;66;:5.))*0150---.+%##$&0111,,*--.)986/84/((477;-620>?-,,,&%'*,*;;97$%3*%&5??2262+11<7;A77;=A?<=<,8927B?G<5:=<(70100.).01127==3432/%'-)*,3'#$.&###$9<4A?.,,<966:00303<622.AA7:5(($$$$$$$,$'#3112:988.-+2)0*.&&/3:661=;5:*3;<7..,$$#&%$(#$(-7:8::9+97/-/008-98=4$,;>@C<9;A?FB??75:5;;<8>?<=@?A.,,,;;<74@@@8,(((:<=9;<>;:;66;1==:6):=A>6<844'-/565$$(&(00-/57::>.048;,2>@<==<=E=><)3;<:7AC<;9(10::;::@B>A;<:3:512*609446:7>ED>210,,,*++-7:628+,0;?0<:<;;=>,:;67=@>A52263200=;H@369@96<:9('+,)*,//32:9<9;=5&6=<<:+611107885#-''.&$&,8;<=89>=<<:64262%/(##''(845?1598*''57)-++**+578?>8(''6=<9(4&&+5<82+,8B>9969:BBBB;,,%.+(%*46-.<88:541.--//3.87:74-8,,('$/%&+'68:9@@*%*+,67934440($$*)$&&)*=?@67C:>99><+=5-:80,1(&(+2118A=?979@=6<6789:?@745=>C>==A>==9;=DC?EA//5+.:188=@;7:@=@=730+./0,#($()124##'&.-/2)6:<33-/67,,+)#$&$$&&79:8-*$56568.#'566?AC>9=4%::>@,5=>@?<=>85574;:903%#$2569=4$*##&/;8*0288/69:;>;???B0?=C@N@;;>:2/&+896*1')./545/1,&3+&&#$#18:5:==78<;;A=86;5?<6'((97::=,39;---3625659?488=>?<+/,9:;;;$+&*+3*##$$9434$#$((($-#;>):9+/366544;=BC@>=?==;(46/3;7.6:64444=85=9;5*787221)+/42/0:@JA;>9=;?2@=>@=75--:@??AA>55=ED:;2::???1,-,-0-+161---,,)+9:80.,1()*&''5;=DCA+(&(+,/4?D<08587082BB8>?E==89/-150(+:=@=6=9:8.A9<<89=90:2+0+6475:66.317('()&)75&8547<427-28023A=:722@CD<=@HE>>@@@8:>=>:58H:*9>?>A<;:((./$,,&/BD723116<=2-111158<=C?<1$(&3+B@:;1<>;>7;?AA=24;':9=.*,/0/8?;8794<3437/--;;;34&$# diff --git a/test/data/basecaller_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_1.fastq b/test/data/basecaller_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_1.fastq new file mode 100644 index 0000000..fcc04b2 --- /dev/null +++ b/test/data/basecaller_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_1.fastq @@ -0,0 +1,8 @@ +@0f776a08-1101-41d4-8097-89136494a46e runid=9a076f39fd3254aeacc15a915c736105296275f3 sampleid=AMW_4b4 read=117011 ch=9 start_time=2019-02-08T05:43:48Z +TCGGTAGTGCGCTTCGTTCCAACTCGATTTGGGTGTTTAACTTGGTCTTGCCTGTCGCTCTATCTTCGGCGTCTAGGTGTTTAACCTCCTCTGCTGTCCCGCTAAAGGCGGCGGATCAGCGAATAAAACCTACCTCTATCAGGAGACAAAGCGCTTCTGACGCCAGATACAAGCTCAAATTATCTCGTCAGAGAAAGGATCTTGCGTACGCTGGCACCACCGCCTGTCCCGCCGTATCATCACATTGCGTTGTCATTATGGCGGCACCTCGGCGGAAGCGAACTGAAGACCGTTAAACTGGCGCGGCAAAATACTATGATGCGCTACCAGCCGAAGGCAATGAACACGGGCGAGCCTTCGCGATGAGCTGGAAAAAGGGTATTTAGCAGGCGCAAAATCTTGGCTTAGGCGCGCAGTTCGGCGGCAAATATTTCCGCTCACGATATTCGTGTCATCCGATTGCTGCCGCGCCACGGGCGGCATCGTGTCCGATAGGTATGGGCGTGTCCTGCTCCGCGGATCGAAACGCATTAAAGCGAAGATCAACCGGGAAAGGATCTGGATCGAAGCTGAGCCTCCCGGTAAATATATTCCAGAGGCGCTGCGCCAGGCGGGAAGGGCGAGGCGGTGCGCGTTGATAACCGTCCATGAGCGAGATACTGCAACAACTGTCGCGGTATCGACATCAACGCGCCTGTCGCTGAACGGCACGATTATTGTGGGCTTGCGACATCGCTTTGCTAAACTGAAAGAACGGATGGACAGAGGCGAAAGGTTTGCACGTTACATCAAAAATACATCTATCTATTATGCAGGGCCTTAAAAACGCCGGAAGGAGTATATGCCTCCGGGTCGCTTGGGCCGACTACAGGACGAATATGGACTCTTATGTTTGATCAGCTCCAGTCACTGGGGCGGCAGTATGATCATGCGGCGAAGGCAACCGCAGCCGGGTGACCGATGCCGCAAGAAACATGGCGGCTTCTACCTGGGCAGTATCGGCGGCCCGGCAGCCGTTCTGGCGCAGGGTAGTATCAAGCGCCTGGGTGCGTGGAATATCCTGAACTGGGTATGGAGGCTATCTGGAAGGAAATTGAAGTGGAGGATTTTCCGGCCTTCATTCTGGTGGATGACCAAGGCAACGATTCTTCCAGCAGATTCGGTCATCACAGTGCACGCTGCGTTAAGTAATGTCATACGCCCTGCGGGGTGGTAGCAACATGATTCGCTATTTTGCCGGAGAACGGCGCAAGCGCATATATCGGTCTACCGTGTGTTATTATCGTTAATACTCATAAGACCGCAAACATGTGAGCAATGTCGACACGCAGTCTCATCAACTTGAGTAGGAGGAGCAGGTAATGGCGTGTACAACGGTACGCCGCGAAAGATTCAATGGGCGCGATTGAAGTCCCGGCAGATAAACAAACTGTGGGGGCGCAGACTCAACGTTCGCTGGAGCATTTTTCAGTTCCACGGAAAATGCCCGTCTCCTCATTCACGCTCTGGCGTTGGCCAAGCGCGCCGCTGCAGGTCAACCAGGACTAGGGTTGTTGGCGGCGGAAAAAGCCAGCGCGATTATCAGGCGGCTGATATGAAATTACTGGCGGAAGAAAACATGCTGATGAGTTTCCGCTTCCAGACCAGGTCGGTACCTCGCTCTTTACCGCCGAATATGAATGAAGTGATGAACCGCCTCCAGTGATACAGGCGGCGTTCGCGGTATGGAACCTGAGTGCATCCCAAGCAATGACGTCAATAAAAGAGTCGAAAATAAACGATGTTTCCCAACCGCCATGCACATGGCGGCACGGCGTTACTATAACGTTACGCGGTCTTTTATCCCACAGGTATGTTGTTAACGGATGCGCTTCGCGATAAATCACGCTTTCTCGATATTGTCAAAAATTGGCCGTACTCTACCAAGGACGCGACGCCGCTCACATAAGCCAGAGATTTCCGGTTGGGCTAGCCATGCTGGAGTTAACCTCAGACACAGGCACAGTTTACCGCGTCGCGGAACTGGCGCTCGGCGGAACCTGGTAAAGGGACAGGGCTTATCACCATCGGAATATGCCCAGCGTGGCCGGGAACTGGCGACGATTACCACGGCGCCGTTTGTTACACCAATAAATTCAAGCGCGGCGACCTGTGACGCGTTGGTACAGGCGCATGGCGCATTAAAGGACTGGCATACTCGCTGATGAAAATCGTAACGATGTTCGCTGGCTGGCGACGCGCTGCGGCATTGGCGGGCCCGGAGAATGAGGCAGTTCCATTATGCCTGGTAAAGTGAATCCCGACCCAGTGTGAAGCGGTAACGATGCTATGTTGCCAGGTGATGGGTAACGATGTGGCCATTATGGGGGCGCATCGGGCAGCACAATCTCAACGTCGTCCGATGGTTATTCATAATTTTCTGCAAACGGTGCGCCTGCTGGCCGATGGCATGGAAGTTTAATAAACACTGTGCGTCAGAATCGAGCCAAACCGCGAGCGTACGCAGTTGCTGATGAAGGTACTGCATACGCACATCGGTCAGCGGCGGAGATTTGCGAAGAAGGCGCATAAGAAGGGCTGACATAAAAGCCGTGGCATTAGGATACCTTAGCGACGAGGAGTTCGGCACTGGGTACGTCAGGTTGATGGTTGGCAGTATGACGCCGGGACGTTAATCCGCCACATACAGGTGCAATCATGAATAATCAGTCGCAGCAGCTGCGCCTGTGGTTTAAACGGTGCTACAAGAATATTATCGGCGTCATAGTTTTCCAGCCCGCAATATCGGCACGCCAGCGCCGTTTTGTCCGATGTTGGTGTGCGATAAC ++ +++%$2##%###)*+1566-,$#"%$$#$%1)(&00-.++)###)$/$#&/545266.24/''66981$&599/-$$)1//00:95775&0-,$+.'13/.#$6-//-2%%(56:588#,')135?=@;;310116906460064..&%*+0320020/0470/0*)$)&"*)+472/-'#1.41)+53-#&&++++$&$&%'%%,04**%$'%%"####+,/33614733)++%*(/+&*#$(.080'*9<;<43&$$.'022%&%%3:&&21...4-*)21*)(),-),1+128;A<46--4=E<<4=?>::20352$''9>:?>;2948;/;=6668<1'&&10*+'**&*''&#$%$'78+5-55(%#'$,3945019/.50::.0969:*(--632.#"$$$$%(-0A?87551226;:76D@39853(<::>A;A;C?B-000.()&)&')8:;15=<9129;1;<@A./*+888<>0+,55..<&&83.-&,*&&+2>89@?C5<&3C?*(')(%'46864443585/0<:/-4+++,-/*'''71*)2/3+*'#$#'/1943..///-.9=28:9665&)*++*.&0&&(/.,(+$$$#+.0+,60'##$%#$&1213/1//35865%5788;:766+++%%))),<;=?,'&##$*++343=712-+&+31&377:933736.++,)('%#$,-5430+//*513''$()$$''$&&.87*.)3.//0/4))(+:853(*$8765)),8$$;>@<=:8<8<6445;B:?2<@==7',4-),--7<;770((8=4/1068,0(#%+(+'''(&&%%&)''(@@1-:7;<)(64>:31#$$䗉:=2.3?8:;..)))02,/((,1/21931,)+))+&&*#/+&+-6.%%''01+/637930/75,8<<*594433&*((3<87%+**.766:;*886:0122227:/+.+)$$%&#$+*(+'4422-6/9;1,1%&')9,'&(+420.1,50-$8114.:3$$&/.40'+1)&%(%''&0$$#&&(26764<900:.444212.1>1078;/2100417,,*--0+78;1842%%''*8348<=,242:*+0.3,4-045870.2&&,,*6878.*'-*(%#$##$79:;;<:51%%%&$35&((&%%$&$$*+1**3-+"*%$""#+',.?8;=523)-*&)03.-0&11;:=;/1558749:;;4..,1)$*$&&$%%'$%%%%)*'%#%%&--%(+))/3%**+*4$&(*'($&$$$.6//032*)'&$$䁲@9666.*+378-4*/,($).-%%($$"#&//&*&###'&&2200(+#$#&&$'''(,.75.)&&((210''(&',*(%,1.4$-)%),+//0@7:9876,/%#$&*/$)/-()').4,4421$/35566+40/1(#$$$$*4676*),84350..74,*$*&%'.01+*+4<=;;8$0&3$###&###&',#'((*&&%,%%$%+)''3,+%,-754-+:<@>:;3,%(940.18.+2061))#,0#$&&,%%%&$$&2*&$,:32)&.,-0*,+0%#%'283$*#&1*4(&.3$7/31,.342-7.8<:>5-/29:;:@379%&%(-1688@<8'-0$$#''...3/*475&&-.*&%&$&#--,./')2//44;0,80+*+;9977:<<:=A>>>G=90775631++&##47321,&2--/430%%'1030+95033%%)(&'/0+,%$4'&),.,''+666:00%&($&&'-6-3134)$$(%%49<:7((.897??>:151/$'',+,86++0/#%%%(86+3++$034:9,13.=;<7.-.-/'%)));;:?;%172B;8979&$%%11$$%%&&.+,3++/5<7,3+)(###$$##%-374:9:36=A=64=+5>;7866--2-+1AB;)$//1/2-82*.+668204:.++6;952//3*,-+36:500$-,*-,0==8>;=>?))&(87+)716899:=3.4993'(&$#(&-+0'#&'$4*%%$%7'37/--+'+&%**4:8&),(,+$$&-.4(/'(+5+%%%&%(*,+&&&68?=800+-+/,+&$$$-144.$%1,<=95:9()/':/(*(0103484(&.&27<><5($$&-$488/04:9(+=:;:9<::85+*()288:>6:6<5/48&$0*:;;884.00.-+0(*,.*-%$%'-4&&$$&96545%1;;.++.**20/3-0/.,$+)0*-,,0(*'14123&$$$##%4<58558*-0597*+)6/:@?A=;.=<<'6B@;6+;&(/.-/77&+))(&#',++-9:4*$7$/,*$7$0-269<*'# +@de1508c4-755b-489e-9ffb-51af35c9a7e6 runid=9a076f39fd3254aeacc15a915c736105296275f3 sampleid=AMW_4b4 read=362464 ch=248 start_time=2019-02-08T15:48:18Z +TGGTGGCGCCCCCGTTCAGTTCGGAGGTGGTGTTTAGCAGTTACTTTACCTGTCAAATATCTTCGGCGTCTGCTTGGAGTGTTTAACCTCGGGGAATACCGCCCGAGATCCATACCTATACCGATGACAGCGCCGACGGCGATATTGACGATTTTTGAGCGGAAAATGGCGCACGGCGGCTGATCCTGCGCCCTCACACGCGGTGGCTGCGCATCTGAATACCGGTTCGCCACAGACGGCAGGCAGAAAATGCGTAAGCTGGGTCTACAGCATACACTGTGCCGCTTTGCAGAGGCTTGCCATCCGTGACGCCAATCCACCCTGGCGCGGCAGCGGACAATTCACCGCCAGTTCGATGGCGCTGGATAAGGCTTGTCGTAGCTGGCGGCACCGATTCGACGCGTGCGTACCGGTAATGGCGCGCATATCGGAATATCATATCACCGGTATAGCGTCAGGGTTTCCGTGCGTAAACCCGCCAGCGAGCGGCTCGCGCATCTGGAAGCGCTGGCGTTATATACATCATACGCCTGGCTCGCCTTTGCCGAACAACACCGGGGCCAGCTTAGCGTCCGGAAAACAGGCCGATCTGGCGGTACTCAATCAGCCATTATGACGATGCCGGAAGACAGAATTGATACCATTCGCTGTTTTGACACTTGTCGATGGACGCATTGTTCACGGTCGGACCTTAACGCCGGACAATGAGGTTGTCAGTTGACGCAAAGCCTGCTTTTATAACCACCAGCGGCTGACGGAAATCTTTTCCGTCAGCGCCGCGATAAATTGTATTCCCGAGATCCGGTGGCGTTGATGAAAAGCTGTTGTACAGTTTTACCTTACGCTGTTCGTCTGTTTTCCTGGTCGCTGACCTTATTGGTCGGACTGGTGTACAAATTCACTGCCGAACGCGGGCAGGCAATCGCTCGACGATCTGATGAAAAGTTCGCTCTATCGATGCGCAGCGAACTGCGGGAAATTCCGCCGCGTGAATGGGGAAAAACGTTAAAGAGATGGACCTGAACCTCTCTTTCGATTTACGCGTTGAACCGTTAAATCACTATAAGCGATACCGCCACCACGCAGCGACTGCGTGAAGGCGATATCGTCGCCTTGACGATCAATACACCTTTTATTCGTATCCCGCGCAGCCATTACGTCCTGGCGGTTGGCTGGTTCCCTATCTCTATTTCCTGCATCAAATGCGGCTGCTGGATATCGCCTTAATGGCAGCTCATTGCCTCTCGCCTTTCCTGTGTTTATCTGGATGCGCCCACTGGCAAGAAGAAATGCTCAGGCTCGAATCACCGCGCAGCTGCGGCGAGGTCATTTGACTGAACGCCTGCATTTCGATAATGGTTCAGTTTCGAACGTCTGGGCGTCGCGTTCAACCAGATGGCGGATAACATTAATGCCCTGATCGCCAGTGCGATTGACGGTATCGCCCATGAGCTTCGCCGCCGCTGGTACGTTTACGCTATCGGCAGGAAATGAGTGAAAATCTGACACCGCCGGAATCACAGGCGCTCAATCGCGATATTGGTCAGCTGGAGGCGCTGATTGAAGAACTGCTCACCTATGCGCGCCTTGACCGACCACCAAAATGAACTGATGTTGACCAGACCCGACCTCCGCTGGTTGCTCATGCATTTTGCAGGATGTGCAAGCGTGACGGAACGCGCCGTAAACCTTGTCACATGTGTAGTCGGCGATTATGGCGCGGTATACGGTTAATGTCGCGAGTGCTGGATAACCTGCTGAACGTTGCGCTACAGTCGTGCCACCGTTCAAGTCAGCCTGTTACTCGACGGCAGTCGACAACGCTGATAGTTGAAGCAACCCCGGCATTGAAGCGGATGCGCTGAGCGTATTTGAACCTTTGTCGAACTCGACCCCAGTAGAGACCGGGCGACAGGCGGCTGCAGGCTGGGCTCGCCAATCGTACACTCTATCGCCCTGGCGATGGGCGGTAGCGTCGTCTGTGACGAGGCGAACTGGGCGGCGCAAAATTTAGCTTTCACTGGCCGGTCTGGCACGCGATGTCCAACGTGACCGCTGCCTGACGTATTTGCGCTGGTTGAGCAT ++ +'((+&%##$$$$'3681#$'.)%'#%')033300-&$###'-&())-$))()6'$$$).856988655<:9;000-/$$6*'();9//480(-$-.0034)#&%#%4>=;9342::9>@42A>5865**99:@?;*(($%&5778:9+(,::;9700..%%/%$%%($$#&$#$#%#%%#$308953)32:4,,2++)*(%$&)%$#%'$&((-3*3348.77'=4=84;9+&&-#$$$###'$$$--,->?:>:./21//(('$$697=?78958:')+789(9<>62/)%%*+0+-1:82-/,559@<7794+:7A=..+%%/0::<<;4-+%#$/$%%8&-1'&+$$&#%#$#$##&$/..3459=;@-/443%/037>73(74366.82643133.;2+++154<=8=?B60#&-*&&(005$$#$$//04//.*122)&)982;;8-(('$/-#$%-)21/-08;;60/+./35634449+&%',)059<<*('''$389:=?4<2098.-4A731'498>=45@<4)/8BA<::++*2//+-**58@>=?>>BLF>2(')%&',1)5(02*+--,<<=;<==<9*(,--88/1,<:??>0.)02#,,+**(157,;@;;8;A@>42.(85120,,-0:30-6=-=?=802$+/01+0+->34@C<=8B031,---6--*(3<3847787;<41,.64/02/098<:4**+-&%$$.##"$,$$$)(*09><<6?@@>A<::9:7::74-87<:3+*$%&76;@;,&'$$')')26:=2982$'-5@@;(13-70&$%5-/-))*'.83370643/0+4%()1;<3$$01?>747.-229,%%/348756''05<389:85*2109:;95:8=91472*/537077;;0693:99?=:=8<=3574(1287-'.1.99,-'''/8:222+../02453:93('(%%%312458=B9<:3??68;;:6,#$%==$:0==>?;.)$#'49:::::510758E@@=613-57%(/;;==556%%$$/,,2,,*)$$78155,*().89523.5<>:=A@>;9;6:9:;1,47:;<::,8:<(6::7+,$049+553:%%334:2//5/',-2)'6&%&($"#%'+$/29/.226:1/*012:9=0+,8.=>A;>B?><;=>;A=,<9;=8&%'87:<<<<::<=@9=07@>9(+134;=DB<==?A@>=000(&-#+--/30075=9A;=9:97011,4:;96*(%.884582/7187=CE0699;<7;=/&&-)89/4**599@;A@==:=64;AB9@?5FF@::5--49;;@;8CA=@;9938*))(:9?6/;78;4()+&(7:;B>@>:;9<==<26-36&-=9.:80/..&&46;>@@A:695;,667.985408.9-+))#-996889957168=?,-/83)215;7469<>@>;:)57,*+,7;:=>C?C=G=@217=;4**,*1870*((&''+87&,#6.&9;;>=:4/,3.,*656=65#$+6?:=@2,,676%$(+(()48630)0/*$$$$%&%001243221-&/0,95.,95('005122./4888+)*%&&*:8<=445:7-:69-284,))61/+.*004022;7;-#-(##%%%.('-011&$$$&-))-'),4;<19=<554/*<=6(*<:62))*40#+,9;=?9?:>76:6)$468<><82(6..221(94*+(/9498=?@=:>;A=756:;?:<;87789==9:,,+,8?99++4:<;8;=938:?AB9$)5<>A02@9;,92,$$.)%##%/522,+$$##&$##&#&$$&''%&&$'')*#%"$#%$$%#"$$%"%,()3/)55//17(7%#"#%-74419:59;789D>?;=0)<;?9=<334100005.)#%%+$$&.-852()+*/&#$$$$%$%%&%&&#%$&&$$%''%('#$2<=62,/&&.---(&$65.#%(&&&(*%%$##%###"#$$(*)($)-.*$('$6,-.'$*,,&%%&)%2'//*&*(4477'$+.4::9,/889?<:=20/21+'##&#%$%&"#$$&%%),2,(&&266*))*-*#)#)'##%&&#&)&$&'&$$%$$#$$#%&$'#%/2.3-2%$'-)%#$%#$-##($#*%%$"$'$$$*/.'516%#%%&#$%#$%&'&+,>;C;4$.07'%%35-//'%01,'$#'&$%%,#'$%%%('%$$'#'(('#902*),66.$"%$(+)((.-,;7/++*68%$#(+%&&%&&#$$#$(-0;35&&%-/;@<;>84++'&)*+*+)('$#$&&%'$#$$%&%'(()&$3-%&(-3,*%%&$*4-(1/$),2$#&%*&25=7981//.(4;;<>?69=>2533/*)6/70*'6;;2/+*,*+&&'%$$$::??>56;90/%,/,((&%$24;;?85,3576662.,///%$%.-7::3)#.%$##(('%&''+*##$+*/.1/99>A>6781176((*$$&%'/6664.**%67;9@@>;/21')*%#%$$$//,''+%'&)#%866:*36711/66+$%(-+)-55%$')6-:::4348867786$%$27A@<<1//61'&&%#)%%-,(0.*$#$""$%%-53.+212-7<:E557;@@:940%,$%##%$,#$%$$$%&('####"(2:9?622'0662:75=952/&$###$#$()%"#&%%#$)&%&&)+,/.3837+'&&,$'/(',.-.%%#$#$$&.$*-+..*.4244-.,*)*1875$&$##%&%"$#"#"$&&%#'-..+%(++4/%%)$%'((,3)762$$',#*)%'$$&&-./00)#%'$#%$+$2399657+;;88:>91,%%%%$%%'$'&$0$%12$%"&%%#$%'1&&&%%04)%%#$$%#$###%&&')%(.+(&%&##('&%$*(.$%)-*,0005:;@=:=;31:/13-/5)(*1+%1%)26-4=7><9;:..))(&/581$$'%/?;737:;2;;658'$$%(.*19>*$#%:98(301'')0)))./)$(%,++-007AE;AA>:>677)//0.'&*&&,'%%%%#%&..9:;<>9656454'##$%)675D<79:?<22/(//+,,,44/---;)+'064/'..+,.&*$##"####$#" diff --git a/test/data/basecaller_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_3.fastq b/test/data/basecaller_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_3.fastq new file mode 100644 index 0000000..2168925 --- /dev/null +++ b/test/data/basecaller_output/fastq_runid_9a076f39fd3254aeacc15a915c736105296275f3_3.fastq @@ -0,0 +1,4 @@ +@1f1a0f33-e2ac-431a-8f48-c3c687a7a7dc runid=9a076f39fd3254aeacc15a915c736105296275f3 sampleid=AMW_4b4 read=5131 ch=173 start_time=2019-02-07T17:46:53Z +TCATACGTTCATTATCAAGTGGGTGTTTATGATTTGCCTGTCGCGCTCTATCTTCGGCGTCTGCTTGGGTGTTTAACCTCCTCCATTTAATTTGCATGTTGTTCAGCATCAGATTTATTTTTTCCAAAACGCTTTATCGGTTTCTCTTGCACGACTGCAATATCATGTGCATTTCCTATGTTTGTTCCCCATTACCGACTCCATCAGTTTTCCTTGATTTACCCCACTTTGGTTAAAATATATGGAAAATTCGAGTTTTTTTCAATAAAGAAGGTGACATGTTGCACGATCAATCTCACACTCTCATTGGAGTTACGAAAGCGCCGCTGCATACTTTTTATACGCCAATACTTGCAATCATCAATTTAATCAGGATATTTTATTCCGCAGTTTGTGAAATGGGCATTATCCCTCCGTGGATTCCTTTTGAAGGACACTCCGGCTGGTCCTACCACCTCTTTCGAGCAGCTGGGCATCTCGCTTCTCGGAATGTCTCGGGTTTGTTTCATATTCTGTTTAATGCCAATGCTGCTGTCTGCTGTCTAACAAATGTACTGATGGAATTAACCAAAGGAAAAAAGTGCAGACAGTCGTTCCGACGATCGCTCTTGTGTTTATCGTCACAAACGGCTTATTTGCCAGGAACAGCGGATATGAAATAGCTCGGAAGAAATTGATGAAAAACCAGTAAACGTCGAAGGCCTTTAGACATCATCATTACGAATGGTCAGCAGATGGTGATCGTTTAGAACCAATTTCGCCGGTGGATACCGGATGCACCAGACAACCAGACCTCTCGCCAGGCGCCCATACTTTGATGAAAGCTCGTCCGCAATGCACTGAAAGACGAGAAACAGACGAGGTTGAAGAATTCCGTCTCGAACTGGCACAGCTGGGTATCGTTGTTCGAGACGAAGGAACACCCATAATATTGGCGAAAAGCACAAACCTACCCTCCGAACATTCAGCATCGCGAAGGAACATAACAGGGCCCAAAGAGCCTCTTCTTAGTACTGCACAATAAAGAGCACTCTGGATGCCGTCAAGGATAACAATCATTTTGATTCCCTACCCTGAGCCGGCCCCGCGATTATCAGAGCGTGACATATCGGACATCAGCTCCTGCGCCGCCTTCCTCACAAACCGCCATCAGCCACTCATCCCGGTCATAGCCTGGCTAAAATCAGGTCTTTATGATTCTGCCGCCTTTTGTCACTCCATCCTGTCGATGGTGCAAATATCAGAATGTCACTGCAATCGCATCTTTAATATGATCTGTTTCTAGGATAATAGCAAGGACAGCGGAAGAATACAACCTTTGTCATACGAAGCGCCTTGGATCTGTTGCGGAACCATTAAGCAAGAAGAACTGCTGCAGCAAGAAACAGGCCTGCCATCCATTGACGATGGCGCTCTGCGGCCACTTCAGTTGTGGGTTTGCTCACCGTCCCGCTGCGGGCGGAACACCTCGCGAGGTGGGGTGCAGTTCAAAACGAATTGAAAAAGGTAAAGTGCTGGTCCTGCCCTTGTGCTGCGCGTTGTCTTGTACAAGCAGCGAGTCTGCATCAGCAGGTATTCCTGATTGACTTCAGCGGTGGGATAGTTTTCCAGCGTGGTATAACCGGTCATGAGTGTGCGGATATTCCCTCCCCCAGCACCCGACTGCCGGGGCGCGCTGCGCTTCCATACGAATGCGCGTGTTTATCTCTGCGCCTTCTGCTCTTGTCGAAGTAGTCTCCCGGCCACTCCTTAATGCTCGTAGGTGGCATGACCGGTTTACGCAGGTTTGCCACGGTGTTGCAACAATGAACGTGGCTTCGTAAATCGAAATCATCAGCACCCACTGGCCAGTCCGCAGACTCTCGTTTGCCGTGATAGTGTGGATAAACTCCTTGTCCAGCTTCAGCCTTCCTGGTGCCACTCATTGGCGGCGAGTCCGGACATGCTGTGAGCGCTGATGGCATCCGCCAGCACCAGCGTGTTGGCTGTCCTCGCTGTGCTCAAACCACGAAGTAGTGCCCCCTCCTGCATCAGTCGCTGAAGAAAATCCGTCCAGTTTCACCGTGCTACCTGCCAGGTGCGTACCGGGTAGCTTTCCACCAGCCGCTTTTCCGCAGGTAGGGATATTCCGCCAGAACCTCATCCAGAATATCCACCACGGTTTATTCTAGATGCTGGTAGTCGCTGGTATGGGTCAGCAGTTTTAGCCACGGCTCCATACGCTCATGGTAACGAGCACGCCCTTCATGGCCTACCCGCGCCACCGTGACCAGCCCGCTGATCTGTCGTTTACCGCCACCATCCAGTTCGATGTTGACGCACAGATCTTTGTCTTACCATCGGTTTGAGCGGCAGGTTGGCCACGCCGGAAGGCGCAGCCAGATTCAGGTGTCAGGCGTTTTAGCTGTACCACATCGCTGAACAGTTCATCTTAATGTTTCGCCGCCGTCCGAATGGCGAACAATGTCGCCCGGCGGTACGGCGCTGCCGTAACCGTCACATTATGCGCCCCCGGCCCGCCTGCAACAGCAGCAGAACGATAGCCGATACCGTATCCCTGCGCCATACCTGCGGCGGCGGCAGCCTGCCGCTGCTACCGCGGCTTTCACCAGTCCTGATAAACCGACAATGCGCCAAATAGCGATCCGCAGCCTGTTCAGCGGCTTCCGCCGCATCTGCCTTCCGGCGACTGAATCTGCGGTGACACCGCCACTTTCTCCGGTTATCGGCGGCGTGGTGTCGTCGGCCCTCTTCATACCGGATTTATTATTTGTGGATACAAAACTCATGGTAGTGACTCCTTTAATACCGGATTAAATAAATAGACTCACCGTCAGGAAACGGTGCCCACTTTATAGTAAACCAGCCTGTTATATTAAACTGTTTACCAGAAGAACGGGGTGTATTGCTGACAAGAGCCGTAATAAAACATAAAAATTAGCACATAACATAAATACTGAATTACATAAATAAAATGAATATAACAATATTATGATTTTATTTTTATTACGCTTACCAGATTCACATTGCATTTTATGAAATTTGATTAATAACCAGCACACCGCCAGTCATATACTTTATCATCATTATATTACTTATCTGGCCACCATTGTAATCATTGTCAACCACAGCGCGTTCAGACACGATAATTTGAGTCCGGTTTTAACGGTATTGTTTTGTTCACCCATTCGTTTGCTGACCAGAAACATACCAACATTAACCGTAAATGTCTTTTCACCATTTCTTTTTCCGTCTCCTGTTTCAGAATAGGTTTCGGTTACTTCAATACAACCATGGCTGCCTTTGGTGGAATCATGGATATAAAAACCACCCCGCACTCTACAGGCTTTTCGGGTCTTCTCATCCGCTTCAGCCGGATACGGTATATCCCCAGTTCGCCCAGTATATCTCGCATGTTCCGGCAACCTGGCCTCGCGGAATGGTACTCCAGCCCCATGATGATTCCCAGATCACAATCCGCAGCATTTCTTATCGGTGCCTCCCCTGAAACTGAATAAACAATTTATAATTCCCTTCCGGTATCGCCGCATCCGGCACACACTGTTTATCTGGCCTGTAAATCCGGTAACCCCGTGAATTATAGCCGTATCGATGAGCCATGTCACTTTTGCCCATCCGAGTCAGTCTTACTGCCATCATCTTCTCCCTGACTATTCTGTGTATTACAGCGGTGGTGCAACGTATCATTAGCGCCAGCTTCTGATCGGCCAGACTGGTTATCCACCGATAATACGAAGCTGAAAGGATAACATTCCTGCATCTTCGCCAGTGAGATATTCCCCTCCTGAAGAATATATGCACTTCCGCGTCGTAACCCTCAACATCCTGCGCCAGTGAGCTGGTATGTGATTGTGCTGAAACAGCCACATTCACCAGTGCTTTTTCAGTGTATTACGTTGTTTTCAAATGGTAAGGTTATCCATTTCCTGACGATGAACGGGATGGATATCCACACACTCCGCCAACGGGTTATCAATGTTTACATCTGTTTCCGCCCAGGCAAATACGGGCAGGAACATTCCTGCACAGGCAGACGCCGTAGAACCTCTACAGCGTACTTTCACAGCATCTGACCTCTATTTTTCATCCACCAGCGCAGCGCATCCACGCCCACCGCCCTTCGCTGTTCTGCTGCCCAGAACCAGCCTTGCCCTGGTGTTGCCTGCCCCAGCGTCTTACCCGGCCAGCCCGGCCACGGCGAATAACGCACGCCCGGCAGTTCACCGGTGCTTCATCCGGCAGCGGTCAGTTCCCGGATGCGCTACGTGTACGTCGTAACCCACACCAGCCACTGCACCCTGCAACCCAGCGCCATCGCCTCGCTGGTCCACCTGCTCCGTCAGCCGCCCGCCGCATCGTCGGCGTGATATGCTGCCGCCATCCTCCTCCAGCCAGTACAGCCAGCCCTGCGTCTCCTGCAAACTCTCGATCGCCCTCCCGACCGCAACGCTGGTAAATCACCAGCGGCCACGGGCATCCCCGCACTTATCCCGATGACGCCATCAGCACTCCGGTCAGCCAGTCAGTCGTCCCACCCATCGGCAAGTCCCGGCGACATCTACAAAGCTCCACGGCAGCGGATGGTGTTCCGACGCGGGCATATCATGTTCGCCACGCGGCGACAGGTACATCAGCCGTCACGCTCTTCATCGGTCGTCTGGCCTTCCGGCAACGATATGCCGCCCGCTGCCCACCCATGTCTGGCGGTTCCAGCCAGTCCTGCCATATTCGTCCCTCTTCAGCGCCCACCGCCAGCGCACATAGTCGCTCTGGTCCGGCAGCCTGCCGAACATCGCCGGAACA ++ +,($%$$$#&$#"##-(&$**)%%,+')0*+%"###&,,1797,$,2965,8:9:7&)78;885366.*+767;9B<:996,&..0/010004:.1.92::/8;;-+,,55(')0@A=<=BBB?:9./,&(&',-1675%5:9.#%#$$)(()**-)-0:9;7<758)638;:;;955(*+:C?9;:$$$&%&3,,/1,658-25735:?IC<956-*1:609-6))9,*436:6-+;53,+/20/'%%,-*633///.&&.<70$$$"$$%$$2.++-1473:6+00%%"##**43++(&.+04?>6::21146:=A98865/0+&-&4507;;?/).3-$,%%$565.+--.1/1.0*'$%,<:-005;52/'$"$%#--98774.%$$$$*..)&$#$&#"%""#&%(19:8::8<663&&(#)/+*+)#&+$&*'##$%%&#)$#&--4-+,+**2855+(-*10',++*)'2##()%%%825--$*00,2:;;5630($$$$*$##%&%$#1.59@>><67*%1--..%'+,554:;;74)3.,-//'($$$$&$$%$###&)9350.5.*))/03089993'058.81257A::9/.5:;<:<6?>9948701+?<.++2+%%$&/)).(&333;=339<8435;058:8&556;=A>???<-4/11)%"$#)./-3:85/0*69?:9?6;;5:79(%%%-*(++)(+,'&0125&59<=>686,++++-/&2231&%0/6731996042'$)*+30+*%+6767764.)#$($(++9:5-0:;>703(8)8<4-..+)7486775564.*%*($)+-&47:26.;=74+.20**+0,.0,*&&$%('$.$$+2(%)+,00#3689;98539<3,2:6&&'289>?611276$$&/3:9>5,+*(*('&$&+,+:;9')484:89:111)$5)%'/012;::65'#%%,&&121+*$'&'&&6711293=,*&'&)0379<<;7>?><;<9:.$$%.%2168;8:6/024@=789.(%/50386.+,*37803,)22&*::D:=B?@@:,;::;A89+./23)66+:;71+'(()))6.*1:::)8:867/2/576*%&&&&&,*$%#.%&(+$#(''(%&*/0:<;:@<804/'06:.79<85567:<;$042398<+984'$#&4,0/%&/&045;:01165332549''-)6+$%*.2.$%266:00*&''(((7+)(2(-863$#&.56:5()&+7>;9,1+.+*)488210045:677<:=47,BA<1402221%/110--94<;;,+70++/222/6?44384=;869:>=<<9:=A<:/'6*+$%15355.0)$#%%('&$1#*((,':?AA870*&-.:<=;92653$#)3%%0#%)/33893%$#&370$578989997-A:420''%'/:);1//7-653/23,::A543'($''$$((>=>A;15?=<:792(16948=:$**,$$#<>743@>;;6775,587/89342&%)%%.//6$9//<;.:?BC3>4D75;8*%.4<>8??7E>EDBC=7+-,+25466&+4522/++889?==73401696+%(/.5*,4814.526135514@?45==>7),)$$<;=>?7<863802189()43589:?6:34..0/*$+-1,6&',011/4:9'/1>539(%70+,$330,-*9;?611>A>=EA@.--..3(8+<0&*1371206*56899.31379:6/42('$$$+6/*4/102+'-9699547:%0:<003<8=>284131)&+68/?<=<='02*.$(-&77:5:A+79;:4+46;:'&*$%&)/02<88<9?2@>64KHG@<6#%$($&$*4979:::73:@A@768>=<5+:;9++%'/19<=133)+*9>@9,13/4)*'101750$%*:<988:;-405*,52)###/$776/$$$$+)/876/()$%%',1&&$&**%$&%*))*6<<::@<6824:3,.@?=-+6>=&5:8;:=@A<(B?>,3=??879==?>?C?621)..*2;7:>?7BCA??:@;;7915&%&()*(32---"""##$%)759,.59;;=:99+%%0?==3+''&'4:9(0'&%$$$$&/3423=;E=&/5:786&1689B@90%'&&$*&#)'(%'"##$*))-.056'%%''(CB:5=@@;5679;:68:9&&&(+(*)''.512%,&*997'($$'&#$$&11,++++#&(),/3<=10225/,'(7?HFD-%%$%$*22434+*),%&*&++)(*&)8=#5:<5651)*59>>@>;/454<:;;77<)035+%,.D9;::;=:686.,+(/?:AB565326/259:38;B9>;65;<>;C:?:<<:67623:42%)467?9&(/+$'+%%'$''()/;2;<@578%---*',,+-%,6574.)'*7*03)*'./.3:..69,'7(68&5#-/1.5-.-'%//0451&&&+54$($'(')#(2''''$*+,,232&1*,29;>/740/$(*-&+/)(%#($%,1,/5<9,-7-0-$.+..&,&$*3129=7<+&*$,,&,&'$%'::67:E<>1)-7,&**+'**'#&(--..%).&&/85.&*(59@-:)&/9%%(5<6<1/=;<>47;=3/2?@40151..((1*--01268<=9>A=.&&(6;8=8BADA94%%(@>?=3//2;21:647;5:9,9:;988:<@2+/+-%#&)+*0+*35))*-.#&'##$+366;<;9<,56<>A83><>@==?=<<<121**,,'$%+.=4/;<;48563004237?@A876045;<5=C@<:8;===;(5=;;<;+0..3#&(##(0(4%(&'*/3>58.2?5840/%&&&)'&$&1*),,+&);=14<<<89470,,$%%(0000&&%$%&(,/235+*36;>):8<>BA;88-272$$%'546<<0426&(&*,-./4=456+.;@92124(1368>=6'+'*-;>@A;;6377.74581<<<>642>-'669;6=>6;<=94$$)**+++35.-2CC:94,''>>=:;C>=<252,'3*/24B;;0*../'&'4788I?5('('123(6529;:77=83;@<;;1::@5456-+2,###)#$$"$%$,&+0.*1%)-9;<9086,;;?,,,.73./86<@FB>8A/,7574:=><;?<5<<;8;81//+,46+12&--/21;>::,/239::<=,222532=;54&%%).097:68957:B:;?B:<<;73-.%561*21/)*19?;@7(%%)215:9677;=C565%7;:&.0:/29<987::)1/.2.#)1-%%)+()46786/2342$(*+5<:;AE55:;8848./1151-$$/32*+2%,%(1,79,4-,/)%&'+2%%%2369:8.)0'0:2=DDE=:332820##$#67==;998,64<5;:7.08:'8>?1;@>A;0-*)&$$')&/>;;::>),-12041876692>>?>;97878:;<<5+,'(''**+-0/24%00((-876538?965+('$$%%%&&12'''&'/2342337./..:427<:EE?20-,+5795(/.67:<9/19=>B:8=A>7:7&'&%%$%&%%#48/**''''$$&+/)$(,*#$+007177441',,&1,.880034468?<86'.00+,,//-39568557694')&/23./;?:7AB?3<>=01/)#'')5;44,,558<;<*.-.527'&*,*,91334709B:629:,/0;:<5=8=<98/54700(%+&-)533.1+1$$(0-556+&$$))&()#"#%$%21''$$$#),,&'$#$'-5.)/,*&'&4039:-%//386995468,.3.110$)-,'+(48.60'(.1:?=:54('+&22)1368<612&%11,((('),(,0.14589672;66935310/295+)($#/)$0/$+IA985=:,./.34>4;47/647/(1483//2;712,)'%$./,-%$$#""%)*433,#+3777D?/2)%2347B4,,(&$0777-)0176794$##&&&$+2;9+-.*&&'$1>@=46858874*+,*+'.,,,+032:<=>@:::912:67933($$&&2$'3699'.0/:?=>84::/5'(879930&)')455547:4+*++'():5=;&88;B=?8:6:;83*12203578)+-$#2/6;;:242-*+301154*+2/416;>968148<5<;ADCC+9:;?><:71--)))'*)#)3571724>4;4122><70;8=?@98=?:?9+0&$$" diff --git a/test/data/basecaller_output/sequencing_summary.txt b/test/data/basecaller_output/sequencing_summary.txt new file mode 100644 index 0000000..0c11d37 --- /dev/null +++ b/test/data/basecaller_output/sequencing_summary.txt @@ -0,0 +1,6 @@ +filename read_id run_id channel start_time duration num_events passes_filtering template_start num_events_template template_duration sequence_length_template mean_qscore_template strand_score_template median_template mad_template +FAK40126_2fd3b110ca0d020049836a61f0dfb2b9983808f9_0.fast5 db6b45aa-5d21-45cf-a435-05fb8f12e839 9a076f39fd3254aeacc15a915c736105296275f3 69 56602.257812 8.547000 17094 TRUE 56602.296875 17015 8.507500 3119 10.648611 0.000000 78.712509 9.594757 +FAK40126_2fd3b110ca0d020049836a61f0dfb2b9983808f9_0.fast5 0f776a08-1101-41d4-8097-89136494a46e 9a076f39fd3254aeacc15a915c736105296275f3 9 44004.265625 7.001250 14002 TRUE 44004.386719 13751 6.875750 2828 8.749918 0.000000 76.047302 9.594749 +FAK40126_2fd3b110ca0d020049836a61f0dfb2b9983808f9_0.fast5 de1508c4-755b-489e-9ffb-51af35c9a7e6 9a076f39fd3254aeacc15a915c736105296275f3 248 80274.460938 4.759750 9519 TRUE 80274.507812 9437 4.718750 2085 10.235295 0.000000 74.981216 8.884033 +FAK40126_2fd3b110ca0d020049836a61f0dfb2b9983808f9_0.fast5 b7096acd-b528-474e-a863-51295d18d3de 9a076f39fd3254aeacc15a915c736105296275f3 244 7037.082031 6.587000 13174 FALSE 7037.144043 13050 6.525000 2929 5.839684 0.000000 82.799164 8.884033 +FAK40126_2fd3b110ca0d020049836a61f0dfb2b9983808f9_0.fast5 1f1a0f33-e2ac-431a-8f48-c3c687a7a7dc 9a076f39fd3254aeacc15a915c736105296275f3 173 989.189270 9.118500 18237 TRUE 989.277771 18060 9.030000 4824 9.933429 0.000000 79.600914 8.884033 diff --git a/test/data/multireads/FAK40126_2fd3b110ca0d020049836a61f0dfb2b9983808f9_0.fast5 b/test/data/multireads/FAK40126_2fd3b110ca0d020049836a61f0dfb2b9983808f9_0.fast5 new file mode 100644 index 0000000..982c2f7 Binary files /dev/null and b/test/data/multireads/FAK40126_2fd3b110ca0d020049836a61f0dfb2b9983808f9_0.fast5 differ diff --git a/test/data/readparams.tsv b/test/data/readparams.tsv new file mode 100755 index 0000000..745d7a6 --- /dev/null +++ b/test/data/readparams.tsv @@ -0,0 +1,6 @@ +UUID trim_start trim_end shift scale +db6b45aa-5d21-45cf-a435-05fb8f12e839 200 50 78.71251094341278 14.225180837774277 +de1508c4-755b-489e-9ffb-51af35c9a7e6 200 50 74.98121809959412 13.171463738679885 +1f1a0f33-e2ac-431a-8f48-c3c687a7a7dc 200 50 79.42323338985443 13.171463738679885 +b7096acd-b528-474e-a863-51295d18d3de 200 50 82.62148439884186 13.171463738679885 +0f776a08-1101-41d4-8097-89136494a46e 200 50 76.40266299247742 13.961751563000679 diff --git a/test/data/reads/0f776a08-1101-41d4-8097-89136494a46e.fast5 b/test/data/reads/0f776a08-1101-41d4-8097-89136494a46e.fast5 new file mode 100644 index 0000000..0f48391 Binary files /dev/null and b/test/data/reads/0f776a08-1101-41d4-8097-89136494a46e.fast5 differ diff --git a/test/data/reads/1f1a0f33-e2ac-431a-8f48-c3c687a7a7dc.fast5 b/test/data/reads/1f1a0f33-e2ac-431a-8f48-c3c687a7a7dc.fast5 new file mode 100644 index 0000000..ac5debb Binary files /dev/null and b/test/data/reads/1f1a0f33-e2ac-431a-8f48-c3c687a7a7dc.fast5 differ diff --git a/test/data/reads/b7096acd-b528-474e-a863-51295d18d3de.fast5 b/test/data/reads/b7096acd-b528-474e-a863-51295d18d3de.fast5 new file mode 100644 index 0000000..61b0b7d Binary files /dev/null and b/test/data/reads/b7096acd-b528-474e-a863-51295d18d3de.fast5 differ diff --git a/test/data/reads/db6b45aa-5d21-45cf-a435-05fb8f12e839.fast5 b/test/data/reads/db6b45aa-5d21-45cf-a435-05fb8f12e839.fast5 new file mode 100644 index 0000000..10e33f8 Binary files /dev/null and b/test/data/reads/db6b45aa-5d21-45cf-a435-05fb8f12e839.fast5 differ diff --git a/test/data/reads/de1508c4-755b-489e-9ffb-51af35c9a7e6.fast5 b/test/data/reads/de1508c4-755b-489e-9ffb-51af35c9a7e6.fast5 new file mode 100644 index 0000000..1809f84 Binary files /dev/null and b/test/data/reads/de1508c4-755b-489e-9ffb-51af35c9a7e6.fast5 differ diff --git a/test/data/strand_lists/strand_list.txt b/test/data/strand_lists/strand_list.txt new file mode 100644 index 0000000..fbda8c9 --- /dev/null +++ b/test/data/strand_lists/strand_list.txt @@ -0,0 +1,10 @@ +filename read_id +FAK40126_2fd3b110ca0d020049836a61f0dfb2b9983808f9_0.fast5 db6b45aa-5d21-45cf-a435-05fb8f12e839 +FAK40126_2fd3b110ca0d020049836a61f0dfb2b9983808f9_0.fast5 b4912dfe-755b-489e-a6c4-fcfaa4510096 +FAK40126_2fd3b110ca0d020049836a61f0dfb2b9983808f9_0.fast5 0f776a08-1101-41d4-8097-89136494a46e +FAK40126_2fd3b110ca0d020049836a61f0dfb2b9983808f9_0.fast5 de1508c4-755b-489e-9ffb-51af35c9a7e6 +FAK40126_2fd3b110ca0d020049836a61f0dfb2b9983808f9_0.fast5 b7096acd-b528-474e-a863-51295d18d3de +FAK40126_2fd3b110ca0d020049836a61f0dfb2b9983808f9_0.fast5 1f1a0f33-e2ac-431a-8f48-c3c687a7a7dc +FAK39447_9a076f39fd3254aeacc15a915c736105296275f3_1.fast5 32426499-1259-4187-a092-95fd065331f1 +FAK39447_9a076f39fd3254aeacc15a915c736105296275f3_27.fast5 13645279-95e6-4a5c-a6c4-fcf03f9269f4 +FAK39447_9a076f39fd3254aeacc15a915c736105296275f3_1.fast5 80f53642-e86a-4815-90d7-759d431cd0f2 diff --git a/test/data/strand_lists/strand_list_no_filename.txt b/test/data/strand_lists/strand_list_no_filename.txt new file mode 100644 index 0000000..8df0d61 --- /dev/null +++ b/test/data/strand_lists/strand_list_no_filename.txt @@ -0,0 +1,9 @@ +read_id +db6b45aa-5d21-45cf-a435-05fb8f12e839 +0f776a08-1101-41d4-8097-89136494a46e +de1508c4-755b-489e-9ffb-51af35c9a7e6 +b7096acd-b528-474e-a863-51295d18d3de +1f1a0f33-e2ac-431a-8f48-c3c687a7a7dc +32426499-1259-4187-a092-95fd065331f1 +13645279-95e6-4a5c-a6c4-fcf03f9269f4 +80f53642-e86a-4815-90d7-759d431cd0f2 diff --git a/test/data/strand_lists/strand_list_no_read_id.txt b/test/data/strand_lists/strand_list_no_read_id.txt new file mode 100644 index 0000000..38b9311 --- /dev/null +++ b/test/data/strand_lists/strand_list_no_read_id.txt @@ -0,0 +1,5 @@ +filename +FAK40126_2fd3b110ca0d020049836a61f0dfb2b9983808f9_0.fast5 +FAK39447_9a076f39fd3254aeacc15a915c736105296275f3_1.fast5 +FAK39447_9a076f39fd3254aeacc15a915c736105296275f3_27.fast5 +FAK39447_9a076f39fd3254aeacc15a915c736105296275f3_1.fast5 diff --git a/test/data/strand_lists/strand_list_single.txt b/test/data/strand_lists/strand_list_single.txt new file mode 100644 index 0000000..920ac4c --- /dev/null +++ b/test/data/strand_lists/strand_list_single.txt @@ -0,0 +1,10 @@ +filename read_id +db6b45aa-5d21-45cf-a435-05fb8f12e839.fast5 db6b45aa-5d21-45cf-a435-05fb8f12e839 +b4912dfe-755b-489e-a6c4-fcfaa4510096.fast5 b4912dfe-755b-489e-a6c4-fcfaa4510096 +0f776a08-1101-41d4-8097-89136494a46e.fast5 0f776a08-1101-41d4-8097-89136494a46e +de1508c4-755b-489e-9ffb-51af35c9a7e6.fast5 de1508c4-755b-489e-9ffb-51af35c9a7e6 +b7096acd-b528-474e-a863-51295d18d3de.fast5 b7096acd-b528-474e-a863-51295d18d3de +1f1a0f33-e2ac-431a-8f48-c3c687a7a7dc.fast5 1f1a0f33-e2ac-431a-8f48-c3c687a7a7dc +32426499-1259-4187-a092-95fd065331f1.fast5 32426499-1259-4187-a092-95fd065331f1 +13645279-95e6-4a5c-a6c4-fcf03f9269f4.fast5 13645279-95e6-4a5c-a6c4-fcf03f9269f4 +80f53642-e86a-4815-90d7-759d431cd0f2.fast5 80f53642-e86a-4815-90d7-759d431cd0f2 diff --git a/test/unit/__init__.py b/test/unit/__init__.py new file mode 100644 index 0000000..07c2c56 --- /dev/null +++ b/test/unit/__init__.py @@ -0,0 +1,4 @@ +import os + +THIS_DIR = os.path.dirname(os.path.abspath(__file__)) +DATA_DIR = os.path.join(THIS_DIR, "data") diff --git a/test/unit/data b/test/unit/data new file mode 120000 index 0000000..b30610e --- /dev/null +++ b/test/unit/data @@ -0,0 +1 @@ +../../data/unit \ No newline at end of file diff --git a/test/unit/test_cmdargs.py b/test/unit/test_cmdargs.py new file mode 100644 index 0000000..1fff46b --- /dev/null +++ b/test/unit/test_cmdargs.py @@ -0,0 +1,84 @@ +"""Tests for cmdards module""" +import argparse +import sys +import unittest +from taiyaki import cmdargs + + +class CmdArgsTest(unittest.TestCase): + + @classmethod + def setUpClass(self): + self.EPS = sys.float_info.epsilon + + def test_positive_valid_float_values(self): + f = cmdargs.Positive(float) + for x in [1e-30, self.EPS, 1e-5, 1.0, 1e5, 1e30]: + self.assertAlmostEqual(x, f(x)) + + def test_positive_invalid_float_values(self): + f = cmdargs.Positive(float) + for x in [-1.0, -self.EPS, -1e-5, 0.0]: + with self.assertRaises(argparse.ArgumentTypeError): + f(x) + + def test_positive_valid_int_values(self): + f = cmdargs.Positive(int) + for x in [1, 10, 10000]: + self.assertAlmostEqual(x, f(x)) + + def test_positive_invalid_int_values(self): + f = cmdargs.Positive(int) + for x in [-1, 0]: + with self.assertRaises(argparse.ArgumentTypeError): + f(x) + + def test_nonnegative_valid_float_values(self): + f = cmdargs.NonNegative(float) + for x in [1e-30, self.EPS, 1e-5, 0.0, 1.0, 1e5, 1e30]: + self.assertAlmostEqual(x, f(x)) + + def test_nonnegative_invalid_float_values(self): + f = cmdargs.NonNegative(float) + for x in [-1.0, -self.EPS, -1e-5]: + with self.assertRaises(argparse.ArgumentTypeError): + f(x) + + def test_nonegative_valid_int_values(self): + f = cmdargs.NonNegative(int) + for x in [0, 1, 10, 10000]: + self.assertAlmostEqual(x, f(x)) + + def test_nonegative_invalid_int_values(self): + f = cmdargs.NonNegative(int) + for x in [-1, -10]: + with self.assertRaises(argparse.ArgumentTypeError): + f(x) + + def test_proportion_valid_float_values(self): + f = cmdargs.proportion + for x in [1e-30, self.EPS, 1e-5, 0.0, 1.0, 1.0 - 1e-5, 1.0 - self.EPS, 1.0 - 1e-30]: + self.assertAlmostEqual(x, f(x)) + + def test_proportion_invalid_float_values(self): + f = cmdargs.proportion + for x in [-1e-30, -self.EPS, -1e-5, 1.0 + 1e-5, 1.0 + self.EPS]: + with self.assertRaises(argparse.ArgumentTypeError): + f(x) + + def test_bounded_valid_int_values(self): + f = cmdargs.Bounded(int, 0, 10) + for x in range(0, 11): + self.assertEqual(x, f(x)) + + def test_bounded_invalid_int_values(self): + f = cmdargs.Bounded(int, 0, 10) + for x in [-2, -1, 11, 12]: + with self.assertRaises(argparse.ArgumentTypeError): + f(x) + + def test_device_action_conversions(self): + parser = argparse.ArgumentParser() + parser.add_argument('device', action=cmdargs.DeviceAction) + self.assertEqual(2, parser.parse_args(['2']).device) + self.assertEqual(2, parser.parse_args(['cuda2']).device) diff --git a/test/unit/test_flipflop_remap.py b/test/unit/test_flipflop_remap.py new file mode 100644 index 0000000..b5b7b7f --- /dev/null +++ b/test/unit/test_flipflop_remap.py @@ -0,0 +1,80 @@ +import numpy as np +import unittest + +from taiyaki import flipflop_remap + + +class TestFlipFlopMapping(unittest.TestCase): + + def test_flipflop_mapping(self): + """Test that global flipflop remapping works as expected + + Sequence is AABA from an alphabet {A, B} + + Transition scores from 6 time points are used + + The best path is AaaBBAA where upper-case is a flip and + lower-case is a flop + + All transition scores are set to 1 (on best path) and 0 + otherwise so the score for the best path should be exactly 6. + """ + sequence = 'AABA' + alphabet = 'AB' + log_transitions = np.array([ + [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], # Aa step + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], # aa stay + [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], # aB step + [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], # BB stay + [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # BA step + [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # AA stay + ], dtype='f4') + score, path = flipflop_remap.flipflop_remap(log_transitions, sequence, alphabet=alphabet, localpen=-0.5) + self.assertEqual(score, 6.0) + self.assertEqual(path.tolist(), [0, 1, 1, 2, 2, 3, 3]) + + # Check we get the same with the lower-level interface + step_index = [8, 6, 1] + step_score = log_transitions[:, step_index] + stay_index = [0, 10, 5, 0] + stay_score = log_transitions[:, stay_index] + score2, path2 = flipflop_remap.map_to_crf_viterbi(log_transitions, step_index, stay_index, localpen=-0.5) + self.assertEqual(score, score2) + self.assertEqual(path.tolist(), path2.tolist()) + + def test_flipflop_mapping_glocal(self): + """Test the glocal flipflop remapping works as expected + + Sequence is BA from an alphabet {A, B} + + Transition scores from 5 time points are used + + The best path is --BA- where upper-case is a flip and + lower-case is a flop, and - represents parts that should be + clipped by the local mapping + + All transition scores are set to 1 (on best path) and 0 + otherwise. Scores in the local state are set to 0.5, so the + best path should have a score of 3.5. + """ + sequence = 'BA' + alphabet = 'AB' + log_transitions = np.array([ + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # clip + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # clip + [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], # BB stay + [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # BA step + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # clip + ], dtype='f4') + score, path = flipflop_remap.flipflop_remap(log_transitions, sequence, alphabet=alphabet, localpen=-0.5) + self.assertEqual(score, 3.5) + self.assertEqual(path.tolist(), [-1, -1, 0, 0, 1, -1]) + + # Check we get the same with the lower-level interface + step_index = [1] + step_score = log_transitions[:, step_index] + stay_index = [5, 0] + stay_score = log_transitions[:, stay_index] + score2, path2 = flipflop_remap.map_to_crf_viterbi(log_transitions, step_index, stay_index, localpen=-0.5) + self.assertEqual(score, score2) + self.assertEqual(path.tolist(), path2.tolist()) diff --git a/test/unit/test_iterate_fast5_reads.py b/test/unit/test_iterate_fast5_reads.py new file mode 100644 index 0000000..1bdc4a7 --- /dev/null +++ b/test/unit/test_iterate_fast5_reads.py @@ -0,0 +1,52 @@ +import os +import unittest + +from . import THIS_DIR +from taiyaki.fast5utils import iterate_fast5_reads + + +class TestStrandList(unittest.TestCase): + READ_DIR = os.path.join(THIS_DIR, "../data/reads") + MULTIREAD_DIR = os.path.join(THIS_DIR, "../data/multireads") + EXPECTED_READ_IDS = [ + '0f776a08-1101-41d4-8097-89136494a46e', + '1f1a0f33-e2ac-431a-8f48-c3c687a7a7dc', + 'b7096acd-b528-474e-a863-51295d18d3de', + 'db6b45aa-5d21-45cf-a435-05fb8f12e839', + 'de1508c4-755b-489e-9ffb-51af35c9a7e6', + ] + STRAND_LIST_DIR = os.path.join(THIS_DIR, "../data/strand_lists") + SEQUENCING_SUMMARY = os.path.join(THIS_DIR, "../data/basecaller_output/sequencing_summary.txt") + + def _check_found_read_ids(self, found_reads): + found_read_ids = sorted([rid for _, rid in found_reads]) + self.assertEqual(found_read_ids, self.EXPECTED_READ_IDS) + + def test_no_strand_list_multiread(self): + self._check_found_read_ids(iterate_fast5_reads(self.MULTIREAD_DIR)) + + def test_no_strand_list_single_reads(self): + self._check_found_read_ids(iterate_fast5_reads(self.READ_DIR)) + + def test_sequencing_summary_multiread(self): + self._check_found_read_ids(iterate_fast5_reads(self.MULTIREAD_DIR, strand_list=self.SEQUENCING_SUMMARY)) + + def test_strand_list_single_reads(self): + strand_list = os.path.join(self.STRAND_LIST_DIR, "strand_list_single.txt") + self._check_found_read_ids(iterate_fast5_reads(self.READ_DIR, strand_list=strand_list)) + + def test_strand_list_multiread(self): + strand_list = os.path.join(self.STRAND_LIST_DIR, "strand_list.txt") + self._check_found_read_ids(iterate_fast5_reads(self.MULTIREAD_DIR, strand_list=strand_list)) + + def test_strand_list_no_filename_multiread(self): + strand_list = os.path.join(self.STRAND_LIST_DIR, "strand_list_no_filename.txt") + self._check_found_read_ids(iterate_fast5_reads(self.MULTIREAD_DIR, strand_list=strand_list)) + + def test_strand_list_no_filename_single_reads(self): + strand_list = os.path.join(self.STRAND_LIST_DIR, "strand_list_no_filename.txt") + self._check_found_read_ids(iterate_fast5_reads(self.READ_DIR, strand_list=strand_list)) + + def test_strand_list_no_read_id_multiread(self): + strand_list = os.path.join(self.STRAND_LIST_DIR, "strand_list_no_read_id.txt") + self._check_found_read_ids(iterate_fast5_reads(self.MULTIREAD_DIR, strand_list=strand_list)) diff --git a/test/unit/test_iterators.py b/test/unit/test_iterators.py new file mode 100644 index 0000000..b7a1da3 --- /dev/null +++ b/test/unit/test_iterators.py @@ -0,0 +1,32 @@ +import unittest +import numpy as np + +from taiyaki import iterators + + +class IteratorsTest(unittest.TestCase): + + @classmethod + def setUpClass(self): + pass + + def f(self, L, x): + return list(iterators.centered_truncated_window(L, x)) + + def test_centered_truncated_window_even(self): + def f(L, x): + return list(iterators.centered_truncated_window(L, x)) + L = [1, 2, 3, 4] + self.assertRaises(AssertionError, self.f, L, 0) + self.assertEqual(self.f(L, 1), [(1,), (2,), (3,), (4,)]) + self.assertEqual(self.f(L, 2), [(1, 2), (2, 3), (3, 4), (4,)]) + self.assertEqual(self.f(L, 3), [(1, 2), (1, 2, 3), (2, 3, 4), (3, 4)]) + self.assertEqual(self.f(L, 4), [(1, 2, 3), (1, 2, 3, 4), (2, 3, 4), (3, 4)]) + + def test_centered_truncated_window_odd(self): + L = [1, 2, 3, 4, 5, 6, 7] + self.assertEqual(self.f(L, 6), [(1, 2, 3, 4), (1, 2, 3, 4, 5), (1, 2, 3, 4, 5, 6), + (2, 3, 4, 5, 6, 7), (3, 4, 5, 6, 7), (4, 5, 6, 7), (5, 6, 7)]) + +if __name__ == '__main__': + unittest.main() diff --git a/test/unit/test_layers.py b/test/unit/test_layers.py new file mode 100644 index 0000000..3979608 --- /dev/null +++ b/test/unit/test_layers.py @@ -0,0 +1,288 @@ +import abc +import pickle +import json +import numpy as np +import tempfile +import torch +import unittest + +from taiyaki import activation +from taiyaki.config import taiyaki_dtype, torch_dtype, numpy_dtype +from taiyaki.json import JsonEncoder +import taiyaki.layers as nn + + +def rvs(dim): + ''' + Draw random samples from SO(N) + + Taken from + + http://stackoverflow.com/questions/38426349/how-to-create-random-orthonormal-matrix-in-python-numpy + ''' + random_state = np.random + H = np.eye(dim) + D = np.ones((dim,)) + for n in range(1, dim): + x = random_state.normal(size=(dim - n + 1,)) + D[n - 1] = np.sign(x[0]) + x[0] -= D[n - 1] * np.sqrt((x * x).sum()) + # Householder transformation + Hx = (np.eye(dim - n + 1) - 2. * np.outer(x, x) / (x * x).sum()) + mat = np.eye(dim) + mat[n - 1:, n - 1:] = Hx + H = np.dot(H, mat) + # Fix the last sign such that the determinant is 1 + D[-1] = (-1) ** (1 - (dim % 2)) * D.prod() + # Equivalent to np.dot(np.diag(D), H) but faster, apparently + H = (D * H.T).T + return H + + +class ANNTest(unittest.TestCase): + + @classmethod + def setUpClass(self): + np.random.seed(0xdeadbeef) + self._NSTEP = 25 + self._NFEATURES = 3 + self._SIZE = 64 + self._NBATCH = 2 + + self.W = np.random.normal(size=(self._SIZE, self._NFEATURES)).astype(numpy_dtype) + self.b = np.random.normal(size=self._SIZE).astype(numpy_dtype) + self.x = np.random.normal(size=(self._NSTEP, self._NBATCH, self._NFEATURES)).astype(numpy_dtype) + self.res = self.x.dot(self.W.transpose()) + self.b + + def test_000_single_layer_linear(self): + network = nn.FeedForward(self._NFEATURES, self._SIZE, has_bias=True, + fun=activation.linear) + nn.init_(network.linear.weight, self.W) + nn.init_(network.linear.bias, self.b) + with torch.no_grad(): + y = network(torch.tensor(self.x)).numpy() + np.testing.assert_almost_equal(y, self.res, decimal=5) + + def test_001_single_layer_tanh(self): + network = nn.FeedForward(self._NFEATURES, self._SIZE, has_bias=True, + fun=activation.tanh) + nn.init_(network.linear.weight, self.W) + nn.init_(network.linear.bias, self.b) + with torch.no_grad(): + y = network(torch.tensor(self.x)).numpy() + np.testing.assert_almost_equal(y, np.tanh(self.res), decimal=5) + + def test_002_parallel_layers(self): + l1 = nn.FeedForward(self._NFEATURES, self._SIZE, has_bias=True) + nn.init_(l1.linear.weight, self.W) + nn.init_(l1.linear.bias, self.b) + l2 = nn.FeedForward(self._NFEATURES, self._SIZE, has_bias=True) + nn.init_(l2.linear.weight, self.W) + nn.init_(l2.linear.bias, self.b) + network = nn.Parallel([l1, l2]) + + with torch.no_grad(): + res = network(torch.tensor(self.x)).numpy() + np.testing.assert_almost_equal(res[:, :, :self._SIZE], res[:, :, self._SIZE:]) + + def test_003_simple_serial(self): + W2 = np.random.normal(size=(self._SIZE, self._SIZE)).astype(taiyaki_dtype) + res = self.res.dot(W2.transpose()) + + l1 = nn.FeedForward(self._NFEATURES, self._SIZE, has_bias=True, + fun=activation.linear) + nn.init_(l1.linear.weight, self.W) + nn.init_(l1.linear.bias, self.b) + l2 = nn.FeedForward(self._SIZE, self._SIZE, fun=activation.linear, has_bias=False) + nn.init_(l2.linear.weight, W2) + network = nn.Serial([l1, l2]) + + with torch.no_grad(): + y = network(torch.tensor(self.x)).numpy() + np.testing.assert_almost_equal(y, res, decimal=4) + + def test_004_reverse(self): + network1 = nn.FeedForward(self._NFEATURES, self._SIZE, has_bias=True) + nn.init_(network1.linear.weight, self.W) + nn.init_(network1.linear.bias, self.b) + network2 = nn.Reverse(network1) + with torch.no_grad(): + res1 = network1(torch.tensor(self.x)).numpy() + res2 = network2(torch.tensor(self.x)).numpy() + + np.testing.assert_almost_equal(res1, res2, decimal=5) + + def test_005_poormans_birnn(self): + layer1 = nn.FeedForward(self._NFEATURES, self._SIZE, has_bias=True) + nn.init_(layer1.linear.weight, self.W) + nn.init_(layer1.linear.bias, self.b) + layer2 = nn.FeedForward(self._NFEATURES, self._SIZE, has_bias=True) + nn.init_(layer2.linear.weight, self.W) + nn.init_(layer2.linear.bias, self.b) + network = nn.birnn(layer1, layer2) + + with torch.no_grad(): + res = network(torch.tensor(self.x)).numpy() + np.testing.assert_almost_equal(res[:, :, :self._SIZE], res[:, :, self._SIZE:], decimal=5) + + def test_006_softmax(self): + network = nn.Softmax(self._NFEATURES, self._SIZE, has_bias=True) + + with torch.no_grad(): + res = network(torch.tensor(self.x)).numpy() + res_sum = np.exp(res).sum(axis=2) + self.assertTrue(np.allclose(res_sum, 1.0)) + + def test_016_window(self): + _WINLEN = 3 + network = nn.Window(_WINLEN) + with torch.no_grad(): + res = network(torch.tensor(self.x)).numpy() + # Window is now 'SAME' not 'VALID'. Trim + wh = _WINLEN // 2 + res = res[wh: -wh] + for j in range(self._NBATCH): + for i in range(_WINLEN - 1): + try: + np.testing.assert_almost_equal( + res[:, j, i * _WINLEN: (i + 1) * _WINLEN], self.x[i: 1 + i - _WINLEN, j]) + except: + win_max = np.amax(np.fabs(res[:, :, i * _WINLEN: (i + 1) * _WINLEN] - self.x[i: 1 + i - _WINLEN])) + print("Window max: {}".format(win_max)) + raise + np.testing.assert_almost_equal(res[:, j, _WINLEN * (_WINLEN - 1):], self.x[_WINLEN - 1:, j]) + # Test first and last rows explicitly + np.testing.assert_almost_equal(self.x[:_WINLEN, j].ravel(), res[0, j].transpose().ravel()) + np.testing.assert_almost_equal(self.x[-_WINLEN:, j].ravel(), res[-1, j].transpose().ravel()) + + @unittest.skip('Decoding needs fixing') + def test_017_decode_simple(self): + _KMERLEN = 3 + network = nn.Decode(_KMERLEN) + f = network.compile() + res = f(self.res) + + def test_018_studentise(self): + network = nn.Studentise() + with torch.no_grad(): + res = network(torch.tensor(self.x)).numpy() + + np.testing.assert_almost_equal(np.mean(res, axis=(0, 1)), 0.0) + np.testing.assert_almost_equal(np.std(res, axis=(0, 1)), 1.0, decimal=4) + + def test_019_identity(self): + network = nn.Identity() + res = network(torch.tensor(self.res)).numpy() + + np.testing.assert_almost_equal(res, self.res) + + +class LayerTest(metaclass=abc.ABCMeta): + """Mixin abstract class for testing basic layer functionality + Writing a TestCase for a new layer is easy, for example: + + class RecurrentTest(LayerTest, unittest.TestCase): + # Inputs for testing the Layer.run() method + _INPUTS = [np.zeros((10, 20, 12)), + np.random.uniform(size=(10, 20, 12)),] + + # The setUp method should instantiate the layer + def setUp(self): + self.layer = nn.Recurrent(12, 64) + """ + + _INPUTS = None # List of input matrices for testing the layer's run method + + @abc.abstractmethod + def setUp(self): + """Create the layer as self.layer""" + return + + def test_000_run(self): + if self._INPUTS is None: + raise NotImplementedError("Please specify layer inputs for testing, or explicitly skip this test.") + f = self.layer.train(False) + outs = [f(torch.tensor(x, dtype=torch_dtype)) for x in self._INPUTS] + + def test_001_train(self): + if self._INPUTS is None: + raise NotImplementedError("Please specify layer inputs for testing, or explicitly skip this test.") + f = self.layer.train(True) + outs = [f(torch.tensor(x, dtype=torch_dtype)) for x in self._INPUTS] + + def test_002_json_dumps(self): + js = json.dumps(self.layer.json(), cls=JsonEncoder) + js2 = json.dumps(self.layer.json(params=True), cls=JsonEncoder) + + def test_003_json_decodes(self): + props = json.JSONDecoder().decode(json.dumps(self.layer.json(), cls=JsonEncoder)) + props2 = json.JSONDecoder().decode(json.dumps(self.layer.json(params=True), cls=JsonEncoder)) + + +class LstmTest(LayerTest, unittest.TestCase): + _INPUTS = [np.zeros((10, 20, 12)), + np.random.uniform(size=(10, 20, 12)), ] + + def setUp(self): + self.layer = nn.Lstm(12, 64) + + +class GruModTest(LayerTest, unittest.TestCase): + _INPUTS = [np.zeros((10, 20, 12)), + np.random.uniform(size=(10, 20, 12)), ] + + def setUp(self): + self.layer = nn.GruMod(12, 64) + + +class ConvolutionTest(LayerTest, unittest.TestCase): + _INPUTS = [np.random.uniform(size=(100, 20, 12))] + + def setUp(self): + self.layer = nn.Convolution(12, 32, 11, 5, has_bias=True) + + +class ResidualTest(LayerTest, unittest.TestCase): + _INPUTS = [np.random.uniform(size=(100, 20, 12))] + + def setUp(self): + sublayer = nn.FeedForward(12, 12, has_bias=True) + self.layer = nn.Residual(sublayer) + + +class DeltaSampleTest(LayerTest, unittest.TestCase): + _INPUTS = [np.random.uniform(size=(100, 20, 12))] + + def setUp(self): + self.layer = nn.DeltaSample() + + +class GlobalNormFlipFlopTest(LayerTest, unittest.TestCase): + _INPUTS = [np.random.uniform(size=(100, 20, 12))] + + def setUp(self): + self.layer = nn.GlobalNormFlipFlop(12, 4) + + @unittest.skip("Test requires GPU") + def test_cupy_and_non_cupy_same(self): + layer = nn.GlobalNormFlipFlop(12, 4).cuda() + + # Perform calculation using cupy + x1 = torch.randn((100, 4, 12)).cuda() + x1.requires_grad = True + loss1 = layer(x1).sum() + loss1.backward() + + # Repeat calculation using pure pytorch + x2 = x1.detach() + x2.requires_grad = True + layer._never_use_cupy = True + loss2 = layer(x2).sum() + loss2.backward() + + # Results and gradients should match + self.assertTrue(torch.allclose(loss1, loss2)) + # Higher atol on gradient because the final operation is a softmax, and + # rtol before softmax = atol after softmax. Therefore I've replaced + # the atol with the default value for rtol. + self.assertTrue(torch.allclose(x1.grad, x2.grad, atol=1e-05)) diff --git a/test/unit/test_mapped_signal_files.py b/test/unit/test_mapped_signal_files.py new file mode 100644 index 0000000..5a87ee0 --- /dev/null +++ b/test/unit/test_mapped_signal_files.py @@ -0,0 +1,173 @@ +import numpy as np +import os +import unittest +import matplotlib as mpl +mpl.use("Agg") +import matplotlib.pyplot as plt + +# To run as a single test, in taiyaki dir and in venv do +# pytest test/unit/test_mapped_signal_files.py + +# lines which plot the ref_to_sig mapping and compare +# with result of searches to obtain sig_to_ref +# and with chunk limits are commented out with an 'if False' +# may be useful in debugging + +from taiyaki import mapped_signal_files + + +def vectorprint(x): + print('[' + (' '.join([str(i) for i in x])) + ']') + + +def construct_mapped_read(): + """Test data for a mapped read file. + Returns a dictionary containing the data""" + Nsig = 20 + Nref = 16 + reftosigstart = np.concatenate(( + np.array([-1, -1], dtype=np.int32), # Start marker + np.arange(2, 5, dtype=np.int32), # Steps, starting at 2 + np.full(4, 5, dtype=np.int32), # Stays (this is four fives, not five fours!) + np.arange(7, 11, dtype=np.int32) # Skip followed by steps + )) + reftosig = np.full(Nref + 1, Nsig, dtype=np.int32) # Note length of reftosig is 1+reflen + reftosig[:len(reftosigstart)] = reftosigstart + return { + 'alphabet': 'ACGT', + 'collapse_alphabet': 'ACGT', + 'shift_frompA': 0.0, + 'scale_frompA': 0.001, + 'range': 1.0, + 'offset': 0.0, + 'digitisation': float(1000), + 'Dacs': np.arange(Nsig, dtype=np.int16), + 'Ref_to_signal': reftosig, + 'Reference': np.arange(Nref, dtype=np.int16), + 'read_id': '11b284b3-397f-45e1-b065-9965c10857ac' + } + + +class TestMappedReadFiles(unittest.TestCase): + + @classmethod + def setUpClass(self): + self.test_directory = os.path.splitext(__file__)[0] + self.testset_name = os.path.basename(self.test_directory) + self.testset_work_dir = self.testset_name + os.makedirs(self.testset_work_dir, exist_ok=True) + self.testfilepath = os.path.join(self.testset_work_dir, 'test_mapped_read_file.hdf5') + self.plotfilepath = os.path.join(self.testset_work_dir, 'test_mapped_read_file.png') + try: + os.remove(self.testfilepath) + print("Previous test file removed") + except: + print("No previous test file to remove") + + def test_HDF5_mapped_read_file(self): + """Test that we can save a mapped read file, open it again and + use some methods to get data from it. Plot a picture for diagnostics. + """ + + print("Creating Read object from test data") + read_dict = construct_mapped_read() + read_object = mapped_signal_files.Read(read_dict) + print("Checking contents") + check_text = read_object.check() + print("Check result on read object:") + print(check_text) + self.assertEqual(check_text, "pass") + + print("Writing to file") + with mapped_signal_files.HDF5(self.testfilepath, "w") as f: + f.write_read(read_object['read_id'], read_object) + f.write_version_number(7) + + print("Current dir = ", os.getcwd()) + print("File written to ", self.testfilepath) + + print("\nOpening file for reading") + with mapped_signal_files.HDF5(self.testfilepath, "r") as f: + ids = f.get_read_ids() + print("Read ids=", ids[0]) + print("Version number = ", f.get_version_number()) + self.assertEqual(ids[0], read_dict['read_id']) + + file_test_report = f.check() + print("Test report:", file_test_report) + self.assertEqual(file_test_report, "pass") + + read_list = f.get_multiple_reads("all") + + recovered_read = read_list[0] + reflen = len(recovered_read['Reference']) + siglen = len(recovered_read['Dacs']) + + # Get a chunk - note that chunkstart is relative to the start of the mapped + # region, not relative to the start of the signal + chunklen, chunkstart = 5, 3 + chunkdict = recovered_read.get_chunk_with_sample_length(chunklen, chunkstart) + + # Check that the extracted chunk is the right length + self.assertEqual(len(chunkdict['current']), chunklen) + + # Check that the mapping data agrees with what we put in + self.assertTrue(np.all(recovered_read['Ref_to_signal']==read_dict['Ref_to_signal'])) + + # Plot a picture showing ref_to_sig from the read object, def setup(): + # and the result of searches to find the inverse + if False: + plt.figure() + plt.xlabel('Signal coord') + plt.ylabel('Ref coord') + ix = np.array([0, -1]) + plt.scatter(chunkdict['current'][ix], chunkdict['sequence'][ix], + s=50, label='chunk limits', marker='s', color='black') + plt.scatter(recovered_read['Ref_to_signal'], np.arange(reflen + 1), label='reftosig (source data)', + color='none', edgecolor='blue', s=60) + siglocs = np.arange(siglen, dtype=np.int32) + sigtoref_fromsearch = recovered_read.get_reference_locations(siglocs) + plt.scatter(siglocs, sigtoref_fromsearch, label='from search', color='red', marker='x', s=50) + plt.legend() + plt.grid() + plt.savefig(self.plotfilepath) + print("Saved plot to", self.plotfilepath) + + #raise Exception("Fail so we can read output") + return + + def test_check_HDF5_mapped_read_file(self): + """Check that constructing a read object which doesn't conform + leads to errors. + """ + print("Creating flawed Read object from test data") + read_dict = construct_mapped_read() + read_dict['Reference'] = "I'm not a numpy array!" # Wrong type! + read_object = mapped_signal_files.Read(read_dict) + print("Checking contents") + check_text = read_object.check() + print("Check result on read object: should fail") + print(check_text) + self.assertNotEqual(check_text, "pass") + + print("Writing to file") + with mapped_signal_files.HDF5(self.testfilepath, "w") as f: + f.write_read(read_object['read_id'], read_object) + f.write_version_number(7) + + print("Current dir = ", os.getcwd()) + print("File written to ", self.testfilepath) + + print("\nOpening file for reading") + with mapped_signal_files.HDF5(self.testfilepath, "r") as f: + ids = f.get_read_ids() + print("Read ids=", ids[0]) + print("Version number = ", f.get_version_number()) + self.assertEqual(ids[0], read_dict['read_id']) + + file_test_report = f.check() + print("Test report (should fail):", file_test_report) + self.assertNotEqual(file_test_report, "pass") + + #raise Exception("Fail so we can read output") + return diff --git a/test/unit/test_maths.py b/test/unit/test_maths.py new file mode 100644 index 0000000..58d9f88 --- /dev/null +++ b/test/unit/test_maths.py @@ -0,0 +1,73 @@ +import unittest +import numpy as np +from taiyaki import maths + + +class MathsTest(unittest.TestCase): + + @classmethod + def setUpClass(self): + print('* Maths routines') + np.random.seed(0xdeadbeef) + + def test_001_studentise(self): + sh = (7, 4) + x = np.random.normal(size=sh) + x2 = maths.studentise(x) + self.assertTrue(x2.shape == sh) + self.assertAlmostEqual(np.mean(x2), 0.0) + self.assertAlmostEqual(np.std(x2), 1.0) + + def test_002_studentise_over_axis0(self): + sh = (7, 4) + x = np.random.normal(size=sh) + x2 = maths.studentise(x, axis=0) + self.assertTrue(x2.shape == sh) + self.assertTrue(np.allclose(np.mean(x2, axis=0), 0.0)) + self.assertTrue(np.allclose(np.std(x2, axis=0), 1.0)) + + def test_003_studentise_over_axis1(self): + sh = (7, 4) + x = np.random.normal(size=sh) + x2 = maths.studentise(x, axis=1) + self.assertTrue(x2.shape == sh) + self.assertTrue(np.allclose(np.mean(x2, axis=1), 0.0)) + self.assertTrue(np.allclose(np.std(x2, axis=1), 1.0)) + + def test_004_med_mad(self): + x = np.array([[0.5, 0.5, 0.5, 0.5], [0.5, 0.5, 1.0, 1.0], [0.0, 0.5, 0.5, 1.0]]) + factor = 1 + loc, scale = maths.med_mad(x, factor=factor) + self.assertTrue(np.allclose(loc, 0.5)) + self.assertTrue(np.allclose(scale, 0)) + + def test_005_med_mad_over_axis0(self): + x = np.array([[0.5, 0.5, 0.5, 0.5], [0.5, 0.5, 1.0, 1.0], [0.5, 1.0, 0.5, 1.0]]) + factor = 1 + loc, scale = maths.med_mad(x, factor=factor, axis=0) + expected_loc = [0.5, 0.5, 0.5, 1.0] + expected_scale = [0, 0, 0, 0] + self.assertTrue(np.allclose(loc, expected_loc)) + self.assertTrue(np.allclose(scale, expected_scale)) + + def test_006_med_mad_over_axis1(self): + x = np.array([[0.5, 0.5, 0.5, 0.5], [0.5, 0.5, 1.0, 1.0], [0.0, 0.5, 0.5, 1.0]]) + factor = 1 + loc, scale = maths.med_mad(x, factor=factor, axis=1) + expected_loc = [0.5, 0.75, 0.5] + expected_scale = [0, 0.25, 0.25] + self.assertTrue(np.allclose(loc, expected_loc)) + self.assertTrue(np.allclose(scale, expected_scale)) + + def test_007_mad_keepdims(self): + x = np.zeros((5, 6, 7)) + self.assertTrue(np.allclose(maths.mad(x, axis=0, keepdims=True), + np.zeros((1, 6, 7)))) + self.assertTrue(np.allclose(maths.mad(x, axis=1, keepdims=True), + np.zeros((5, 1, 7)))) + self.assertTrue(np.allclose(maths.mad(x, axis=2, keepdims=True), + np.zeros((5, 6, 1)))) + + +if __name__ == '__main__': + unittest.main() diff --git a/workflow/Makefile b/workflow/Makefile new file mode 100644 index 0000000..ee770bc --- /dev/null +++ b/workflow/Makefile @@ -0,0 +1,295 @@ +SHELL=/bin/bash -o pipefail + +MAKEFLAGS += --warn-undefined-variables + +# Makefile to prepare data and run training +# +############## +# USE +############## +# Should be run with the current directory being the taiyaki root directory +# +# make -f workflow/Makefile DEVICE=2 MAXREADS=1000 READDIR=readdir REFERENCEFILE=mygenome.fa BAMFILE=mybam.bam train_remap_samref +# or alternatively run from another directory specifying the root of the taiyaki installation +# +# make -f software/taiyaki/workflow/Makefile TAIYAKI_ROOT=software/taiyaki DEVICE=2 MAXREADS=1000 READDIR=readdir REFERENCEFILE=mygenome.fa BAMFILE=mybam.bam train_remap_samref +# +# All assignments on the make command line (like MAXREADS=1000 above) override variables defined with := in this file. +# +# The training results and training ingredients are placed in ${TAIYAKI_ROOT}/RESULTS +# +# This destination can be changed with the optional argument RESULT_DIR +# +# This Makefile will: +# --prepare per-read-parameter file, per-read-reference file and mapped-signal files +# --run training +# prepared data goes in directory RESULTS/training_ingredients/ +# training results go in RESULTS/remap_training or RESULTS/f5map_training +# both directories are created by the makefile +# The DEVICE is used for training, but not for remapping. +# +# It's also possible to specify your own per-read-reference file - see the variable USER_PER_READ_REFERENCE_FILE below. +# +# If you want to run on a UGE cluster, then use UGE=Yes rather than DEVICE=1 +# The Makefile will look to see which GPU has been allocated and act accordingly. +# NCPU should also be set. See comments near the definition of variable UGE below. +# +################# +# REQUIREMENTS +################# +# +######## DATA +# +# The variable READDIR must point to a directory +# of fast5 files. For example make -f workflow/Makefile READDIR=myreads DEVICE=2 MAXREADS=100 .... +# +# A bam or sam file containing alignments to a genomic reference are also required if using get_refs_from_sam.py. +# +# The bam is specified with BAMFILE= and the reference fasta with REFERENCEFILE= +# +######## REMAPPING MODEL +# +# For remapping, we also need a pytorch flip-flop model in the location +# REMAPMODELFILE below. A suitable model is at the location specified +# The model can be specified on the make command line: +# +# make -f workflow/Makefile READDIR=myreads REMAPMODELFILE=mymodel ... +# + +TAIYAKI_ROOT := $(shell pwd) +NCPU := $(shell nproc) +OPENBLAS_NUM_THREADS := 1 +export OPENBLAS_NUM_THREADS +OMP_NUM_THREADS := 1 +export OMP_NUM_THREADS + +##################################### +# INGREDIENTS AND PARAMETERS +##################################### +#Max number of reads to process - use small number for testing +MAXREADS := 10000000000 +#Max number of training iterations - use small number for testing +MAX_TRAINING_ITERS := 50000 +# Chunk logging threshold - if set to 0 then all chunks logged. Value of 10 means only batches with unusually high loss are recorded +CHUNK_LOGGING_THRESHOLD := 0 + +# Which device to use for remapping and for training (cpu for CPU, or integer for GPU number) +# Run nvidia-smi to get a summary of which GPUs are in use. Specify DEVICE=cpu if no GPU available. +DEVICE := 0 + +########################################################## +# RUNNING ON A UGE CLUSTER +########################################################## +# If the variable UGE is specified then we look to see +# which GPU is available on the current node. +# The variable SGE_HGR_gpu contains either cuda0 or cuda1 +# so we set DEVICE accordingly. +# +# Example of a UGE command-line: +# +# qsub -l gpu=1 -b y -P research -cwd -o SGEout.txt -e SGEerr.txt -pe mt 8 make UGE=Yes NCPU=8 MAXREADS=1000 train_remap_samref +# +# The option -l gpu=1 makes the system wait for a node that has at least one GPU available. +# +# Note that we need to specify the number of processors to use separately (NCPU=8), since on the UGE cluster the bash command nproc returns the total number +# of processors rather than the number allocated to a job. +# +########################################################## + +ifdef UGE + DEVICE:= ${SGE_HGR_gpu} +endif + +#The variables below should be set using command-line options to make +#E.g. make -f workflow/Makefile READDIR=myreaddir BAMFILE=mybam.bam REFERENCEFILE=mygenome.fa train_remap_samref +#In most use cases, the per-read-reference file is generated by the Makefile using taiyaki scripts. +#But if you want to specify your own, then use USER_PER_READ_REFERENCE_FILE and the make target train_remapuser_ref + + +READDIR = READDIR_SHOULD_POINT_TO_DIRECTORY_CONTAINING_READS +BAMFILE := BAMFILE_SHOULD_POINT_TO_BAM_ALIGNMENT_IF_USING_get_refs_from_sam +REFERENCEFILE := REFERENCEFILE_SHOULD_POINT_TO_GENOMIC_REFERENCE_IF_USING_get_refs_from_sam +USER_PER_READ_REFERENCE_FILE:= FOR_USER_PER_READ_REFERENCE_SET_THIS_AND_USE_MAKE_TARGET_train_remapuser_ref + + +#Pytorch flip-flop model for remapping +REMAPMODELFILE := ${TAIYAKI_ROOT}/models/mGru256_flipflop_remapping_model_r9_DNA.checkpoint + +#Model definition for training +TRAININGMODEL := ${TAIYAKI_ROOT}/models/mGru256_flipflop.py + +###################### +# WHERE TO PUT THINGS +###################### + +# Root directory for training ingredients and results. +# Training ingredients and training results directories will be created below +# training results are placed in +# ${RESULT_DIR}/remap_training and ${RESULT_DIR}/f5map_training +RESULT_DIR := ${TAIYAKI_ROOT}/RESULTS +#Directory to place TSV per-read files and mapped-read files +# chunk files will be placed in ${INGREDIENT_DIR}/mapped_f5map.hdf5 or ${INGREDIENT_DIR}/mapped_remap.hdf5 +INGREDIENT_DIR := ${RESULT_DIR}/training_ingredients +#Where to put TSV per-read files +PERREADFILE := ${INGREDIENT_DIR}/readparams.tsv + +# If the variable STRANDLIST is set on the make command line +# (e.g. make -f workflow/Makefile STRANDLIST=my_strand_list.tsv ) then we use a strand list +STRANDLISTOPT := +ifdef STRANDLIST + STRANDLISTOPT := --input_strand_list ${STRANDLIST} +endif + +# If the variable SEED is set on the make command line +# (e.g. make -f workflow/Makefile SEED=1 ) then we seed the random number generator used to select chunks in training. +# This is here so that in acceptance testing we can make the behaviour reproducible +SEEDOPT := +ifdef SEED + SEEDOPT := --seed ${SEED} +endif + + +###################### +# TAIYAKI PACKAGE +###################### +# This Makefile assumes taiyaki already installed +# with command line like the one below +#taiyaki: +# git clone https://github.com/nanoporetech/taiyaki +# (cd taiyaki && make install) +# TAIYAKIACTIVATE is placed before all taiyaki script invocations +# to activate the venv. If the venv is already activated, or if not using a venv, then +# set this variable to blank (e.g. make -f workflow/Makefile DEVICE=2 TAIYAKIACTIVATE= READDIR=myreads <...other params...> train_remap_samref +TAIYAKIACTIVATE := source ${TAIYAKI_ROOT}/venv/bin/activate && + +# Use +# make MAXREADS=1000 listparams +# to list make variables +listparams: + @echo "" + @echo "Listing parameter values...." + @echo "RESULT_DIR="${RESULT_DIR} + @echo "TESTPARAM="${TESTPARAM} + @echo "UGE="${UGE} + @echo "DEVICE="${DEVICE} + @echo "NCPU="${NCPU} + @echo "SGE_HGR_gpu="${SGE_HGR_gpu} + @echo "REMAPMODELFILE="${REMAPMODELFILE} + @echo "TRAININGMODEL="${TRAININGMODEL} + @echo "READDIR="${READDIR} + @echo "BAMFILE="${BAMFILE} + @echo "REFERENCEFILE="${REFERENCEFILE} + @echo "REMAPOPT="${REMAPOPT} + @echo "F5MAPOPT="${F5MAPOPT} + @echo "RESULT_DIR="${RESULT_DIR} + @echo "INGREDIENT_DIR="${INGREDIENT_DIR} + @echo "STRANDLIST="${STRANDLIST} + @echo "STRANDLISTOPT="${STRANDLISTOPT} + @echo "TAIYAKI_ROOT="${TAIYAKI_ROOT} + @echo "TAIYAKIACTIVATE="${TAIYAKIACTIVATE} + @echo "MAX_TRAINING_ITERS="${MAX_TRAINING_ITERS} + @echo "" + + + +###################### +# CREATE DIRECTORIES +###################### +${RESULT_DIR}: + @echo "" + @echo "------------Setting up directory ${RESULT_DIR}" + @echo "" + mkdir ${RESULT_DIR} + +${INGREDIENT_DIR}: | ${RESULT_DIR} + @echo "" + @echo "------------Setting up directory ${INGREDIENT_DIR}" + @echo "" + mkdir ${INGREDIENT_DIR} + +####################### +# DATA PREPARATION +####################### + +#Make TSV file with trimming and scaling parameters +${PERREADFILE}: | ${INGREDIENT_DIR} + @echo "" + @echo "------------Creating per-read parameter file for ${MAXREADS} reads" + @echo "" + ${TAIYAKIACTIVATE} ${TAIYAKI_ROOT}/bin/generate_per_read_params.py --limit ${MAXREADS} ${STRANDLISTOPT} --overwrite ${READDIR} $@ + + +#Make file containing reference segment for each read, using alignment in sam or bam +${INGREDIENT_DIR}/per_read_references_from_sam.fa: | ${INGREDIENT_DIR} + @echo "" + @echo "------------Creating reference file from sam or bam at ${BAMFILE}" + @echo "" + ${TAIYAKIACTIVATE} ${TAIYAKI_ROOT}/bin/get_refs_from_sam.py ${REFERENCEFILE} ${BAMFILE} > $@ + +#A third alternative is to supply your own per-read reference file. +#Make mapped-read file using flip-flop remapping, using any of these options to generate the per-read-reference file + +.PHONY: remapped_samref +.PHONY: remapped_userref + +#Cases where per-read-reference file generated by taiyaki scripts +remapped_samref: ${INGREDIENT_DIR}/mapped_remap_samref.hdf5 +#Case where per-read-reference file supplied by the user +#Note that this file has _ in the wrong place so doesn't fit the first template below - it has its own recipe +remapped_userref: ${INGREDIENT_DIR}/mapped_remapuser_ref.hdf5 + +#Recipe for cases where per-read-reference file generated by taiyaki scripts +${INGREDIENT_DIR}/mapped_remap_%ref.hdf5: ${INGREDIENT_DIR}/per_read_references_from_%.fa ${PERREADFILE} | ${INGREDIENT_DIR} + @echo "" + @echo "------------Creating mapped read file by flip=flop remapping for ${MAXREADS} reads from $<. Using ${NCPU} threads." + @echo "" + ${TAIYAKIACTIVATE} ${TAIYAKI_ROOT}/bin/prepare_mapped_reads.py --device cpu --limit ${MAXREADS} ${STRANDLISTOPT} --overwrite ${READDIR} --jobs ${NCPU} ${PERREADFILE} $@ ${REMAPMODELFILE} $< + +#Recipe for case where per-read-reference file supplied by the user +${INGREDIENT_DIR}/mapped_remapuser_ref.hdf5: ${PERREADFILE} | ${INGREDIENT_DIR} + @echo "" + @echo "------------Creating mapped read file by flip=flop remapping for ${MAXREADS} reads from $<. Using ${NCPU} threads." + @echo "" + ${TAIYAKIACTIVATE} ${TAIYAKI_ROOT}/bin/prepare_mapped_reads.py --device cpu --limit ${MAXREADS} ${STRANDLISTOPT} --overwrite ${READDIR} --jobs ${NCPU} $< $@ ${REMAPMODELFILE} ${USER_PER_READ_REFERENCE_FILE} + + +############################## +# BASECALL NETWORK TRAINING +############################## +# +# make train_remap_samref # to train with remap-derived chunks where the per-read-reference file is derived from a sam or bam +# make train_remapuser_ref # to train with remap-derived chunks where the per-read-reference file is supplied by the user +# +# The recipe makes a file (using touch) to signal that it's finished. +# It's likely that we'll stop training manually before this point is reached. +# The training directory (train_xxx) is created automatically by the training script if needed. + +.PHONY: train_remap_samref +.PHONY: train_remapuser_ref +#Note that the placement of the _ (remapuser_ref, not remap_userref) is not a mistake and is necessary to make the different paths through the Makefile work. + +train_remap_samref: ${RESULT_DIR}/train_remap_samref/trained +train_remapuser_ref: ${RESULT_DIR}/train_remapuser_ref/trained + +${RESULT_DIR}/train_%/trained: ${INGREDIENT_DIR}/mapped_%.hdf5 + @echo "" + @echo "------------Training with $* chunks" + @echo "" + ${TAIYAKIACTIVATE} ${TAIYAKI_ROOT}/bin/train_flipflop.py --overwrite --chunk_logging_threshold ${CHUNK_LOGGING_THRESHOLD} --niteration ${MAX_TRAINING_ITERS} --device ${DEVICE} ${SEEDOPT} ${TRAININGMODEL} $(dir $@) $< + touch $@ + + +######################################## +# SQUIGGLE-PREDICTOR NETWORK TRAINING +######################################## + +.PHONY: squiggletrain_remap_samref + +squiggletrain_remap_samref: ${RESULT_DIR}/squiggletrain_remap_samref/trained + +${RESULT_DIR}/squiggletrain_%/trained: ${INGREDIENT_DIR}/mapped_%.hdf5 + @echo "" + @echo "------------Training squiggle model with $* map chunks" + @echo "" + ${TAIYAKIACTIVATE} ${TAIYAKI_ROOT}/bin/train_squiggle.py --overwrite --chunk_logging_threshold ${CHUNK_LOGGING_THRESHOLD} --niteration ${MAX_TRAINING_ITERS} --device ${DEVICE} ${SEEDOPT} $< $(dir $@) + touch $@ diff --git a/workflow/remap_from_samrefs_then_train_multireadf5_test_workflow.sh b/workflow/remap_from_samrefs_then_train_multireadf5_test_workflow.sh new file mode 100755 index 0000000..abeb671 --- /dev/null +++ b/workflow/remap_from_samrefs_then_train_multireadf5_test_workflow.sh @@ -0,0 +1,54 @@ +#! /bin/bash -eux +set -o pipefail + +# Test workflow from fast5 files to trained model +# This is done with just a few reads so the model +# won't be useful for anything. +# This script must be executed with the current directory being the taiyaki base directory + +echo "" +echo "Test of fast5 map extraction from multi-read fast5s followed by basecall network training starting" +echo "" + + +# Execute the whole workflow, extracting references, generating per-read-params and mapped-read files and then training +READ_DIR=test/data/multireads +SAM_DIR=test/data/aligner_output +# The |xargs puts spaces rather than newlines between the filenames +SAMFILES=$(ls ${SAM_DIR}/*.sam |xargs) +REFERENCEFILE=test/data/genomic_reference.fasta + +echo "SAMFILES=${SAMFILES}" +echo "REFERENCEFILE=${REFERENCEFILE}" + +TAIYAKI_DIR=`pwd` +RESULT_DIR=${TAIYAKI_DIR}/RESULTS/train_remap_samref + +rm -rf $RESULT_DIR +rm -rf ${TAIYAKI_DIR}/RESULTS/training_ingredients + +#TAIYAKIACTIVATE=(nothing) makes the test run without activating the venv at each step. Necessary for running on the git server. +make -f workflow/Makefile MAXREADS=10 READDIR=${READ_DIR} TAIYAKI_ROOT=${TAIYAKI_DIR} DEVICE=cpu MAX_TRAINING_ITERS=2 BAMFILE="${SAMFILES}" REFERENCEFILE=${REFERENCEFILE} SEED=1 TAIYAKIACTIVATE= train_remap_samref + + +# Check that training chunk log and training log exist and have enough rows for us to be sure something useful has happened + +chunklog_lines=`wc -l ${RESULT_DIR}/chunklog.tsv | cut -f1 -d' '` +echo "Number of lines in training chunk log: ${chunklog_lines}" +if [ "$chunklog_lines" -lt "20" ] +then + echo "Training chunk log too short - not enough chunks generated" + exit 1 +fi + +traininglog_lines=`wc -l ${RESULT_DIR}/model.log | cut -f1 -d' '` +echo "Number of lines in training log: ${traininglog_lines}" +if [ "$traininglog_lines" -lt "9" ] +then + echo "Training log too short- training not started properly" + exit 1 +fi + +echo "" +echo "Test of fast5 map extraction from multi-read fast5s followed by basecall network training completed successfully" +echo "" diff --git a/workflow/remap_from_samrefs_then_train_squiggle_test_workflow.sh b/workflow/remap_from_samrefs_then_train_squiggle_test_workflow.sh new file mode 100755 index 0000000..98d0fa3 --- /dev/null +++ b/workflow/remap_from_samrefs_then_train_squiggle_test_workflow.sh @@ -0,0 +1,57 @@ +#! /bin/bash -eux +set -o pipefail + +# Test workflow from fast5 files with remapping to trained squiggle odel +# This is done with just a few reads so the model +# won't be useful for anything. +# This script must be executed with the current directory being the taiyaki base directory + +echo "" +echo "Test of remapping using references extracted from fast5s followed by basecall network training starting" +echo "" + + + +# Execute the whole workflow, extracting references, generating per-read-params and mapped-read files and then training +READ_DIR=test/data/reads +SAM_DIR=test/data/aligner_output +# The |xargs puts spaces rather than newlines between the filenames +SAMFILES=$(ls ${SAM_DIR}/*.sam |xargs) +REFERENCEFILE=test/data/genomic_reference.fasta + +echo "SAMFILES=${SAMFILES}" +echo "REFERENCEFILE=${REFERENCEFILE}" + +TAIYAKI_DIR=`pwd` +RESULT_DIR=${TAIYAKI_DIR}/RESULTS/squiggletrain_remap_samref + +rm -rf $RESULT_DIR +rm -rf ${TAIYAKI_DIR}/RESULTS/training_ingredients + +#TAIYAKIACTIVATE=(nothing) makes the test run without activating the venv at each step. Necessary for running on the git server. +make -f workflow/Makefile MAXREADS=10 READDIR=${READ_DIR} TAIYAKI_ROOT=${TAIYAKI_DIR} DEVICE=cpu MAX_TRAINING_ITERS=2 BAMFILE="${SAMFILES}" REFERENCEFILE=${REFERENCEFILE} SEED=1 TAIYAKIACTIVATE= squiggletrain_remap_samref + + +# Check that training chunk log and training log exist and have enough rows for us to be sure something useful has happened + + + +chunklog_lines=`wc -l ${RESULT_DIR}/chunklog.tsv | cut -f1 -d' '` +echo "Number of lines in training chunk log: ${chunklog_lines}" +if [ "$chunklog_lines" -lt "20" ] +then + echo "Training chunk log too short - not enough chunks generated" + exit 1 +fi + +traininglog_lines=`wc -l ${RESULT_DIR}/model.log | cut -f1 -d' '` +echo "Number of lines in training log: ${traininglog_lines}" +if [ "$traininglog_lines" -lt "9" ] +then + echo "Training log too short- training not started properly" + exit 1 +fi + +echo "" +echo "Test of remapping using references extracted from fast5s followed by basecall network training completed successfully" +echo "" diff --git a/workflow/remap_from_samrefs_then_train_test_workflow.sh b/workflow/remap_from_samrefs_then_train_test_workflow.sh new file mode 100755 index 0000000..680079b --- /dev/null +++ b/workflow/remap_from_samrefs_then_train_test_workflow.sh @@ -0,0 +1,53 @@ +#! /bin/bash -eux +set -o pipefail + +# Test workflow from fast5 files to trained model using flip-flop remapping with refs extracted from sam +# This is done with just a few reads so the model +# won't be useful for anything. +# This script must be executed with the current directory being the taiyaki base directory + +echo "" +echo "Test of extract-ref-from-sam followed by flip-flop remap and basecall network training starting" +echo "" + +# Execute the whole workflow, extracting references, generating per-read-params and mapped-read files and then training +READ_DIR=test/data/reads +SAM_DIR=test/data/aligner_output +# The |xargs puts spaces rather than newlines between the filenames +SAMFILES=$(ls ${SAM_DIR}/*.sam |xargs) +REFERENCEFILE=test/data/genomic_reference.fasta + +echo "SAMFILES=${SAMFILES}" +echo "REFERENCEFILE=${REFERENCEFILE}" + +TAIYAKI_DIR=`pwd` +RESULT_DIR=${TAIYAKI_DIR}/RESULTS/train_remap_samref + +rm -rf $RESULT_DIR +rm -rf ${TAIYAKI_DIR}/RESULTS/training_ingredients + +#TAIYAKIACTIVATE=(nothing) makes the test run without activating the venv at each step. Necessary for running on the git server. +make -f workflow/Makefile MAXREADS=10 READDIR=${READ_DIR} TAIYAKI_ROOT=${TAIYAKI_DIR} DEVICE=cpu MAX_TRAINING_ITERS=2 BAMFILE="${SAMFILES}" REFERENCEFILE=${REFERENCEFILE} SEED=1 TAIYAKIACTIVATE= train_remap_samref + +# Check that training chunk log and training log exist and have enough rows for us to be sure something useful has happened + + +chunklog_lines=`wc -l ${RESULT_DIR}/chunklog.tsv | cut -f1 -d' '` +echo "Number of lines in training chunk log: ${chunklog_lines}" +if [ "$chunklog_lines" -lt "20" ] +then + echo "Training chunk log too short - not enough chunks generated" + exit 1 +fi + +traininglog_lines=`wc -l ${RESULT_DIR}/model.log | cut -f1 -d' '` +echo "Number of lines in training log: ${traininglog_lines}" +if [ "$traininglog_lines" -lt "9" ] +then + echo "Training log too short- training not started properly" + exit 1 +fi + +echo "" +echo "Test of extract-ref-from-sam followed by flip-flop remap and basecall network training completed successfully" +echo ""