A collection of commandline tools for hybrid transcriptome assembly.
TODO
Requirements
- Minimap2 (https://github.com/lh3/minimap2)
- SeqKit (https://github.com/shenwei356/seqkit)
- Racon (https://github.com/isovic/racon)
- Python (>=3.6)
- Numpy
- Networkx
TODO
Short read assembly graph guided partitioning of long read data.
**
The driving idea here is, that components of an assembly graph
constructed from short reads should each correspond to the isoforms
of either a single gene or a group of genes that share parts of their
sequence. By mapping both long reads to the assembly graph they can
seperated into smaller clusters. This can greatly improve the speed
of a downstream long read assembly or correction methods that would
require an all-versus-all mapping.
TODO
- Assembly graph construction
Options:
a) Provide prcomputed assembly graph from SPAdes in .fastg-format
b) Provide short reads and Ivocluster will construct a compressed
de Bruijn graph. Using a precomputed assembly graph should be preferred as it is
expected to contain less sequenecing artifacts and usually have
a less complex structure which makes Ivocluster a lot faster and
posibly more accurate. - Separation of the non connected components of the assembly graph
By parsing the data from the previous step to a networkx.Graph
object, it's components can be separated from each other. The contig
sequences are written to a new fasta file where the header of each
entry shows the index of the corresponding graph component. - Mapping longreads to the assembly graph
The long reads are mapped to the contigs of the assembly graph
using Minimap2.
Multithreading:
Unfortunately, mapping must be performed using only a single thread
in order to guarantee, that mappings are reported in the same order
as their corresponding longreads appear in their file.
However, multiple threads can be used for the construction of the
index. By using SeqKit to split the longreads to smaller files first
and mapping each of them separately, multiple threads can be utilized.
Since each thread has to load the index of the assembly graph, which
is often sevaral gigabytes large. Using many threads will therefore
eat your RAM like a bachelor student whose grandma pays for the
all-you-can-eat buffet. - Assigning long reads
The SequenceMappingQueue allows to sequentially iterate through all
long reads and their corresponding mappings to the assembly graph.
For each longread and it's reported mappings the a greedy heuristic
attempts to find a set of longread to contig mappings where the
individual mappings do not overlap and have the highest possible combined
alignment score.
The longread then get assigned to the cluster corresponding to the
component of the assembly graph with the best set of non overlapping
mappings.
Longreads that could not be mapped to the assembly graph are stored
in a separate file. Similarly, longreads were different regions map
to different components of the assembly graph are also written to a separate file
Why could that happen?- Longreads are chimeras, transcript fragments unintentionally fused
together. - Longreads cover transcriptomic regions that have insufficient short
read coverage which leads to graph components breaking apart. What will Ivocluster do in cases like this? Nothing.
Ideally, a statistical analysis would determine which of the two cases
applies. For case 1, longreads would be discarded and for case 2 the
corresponding read clusters would be merged and the longreads added.
I will implement this once this happens.
- Longreads are chimeras, transcript fragments unintentionally fused
Issues
It sucks! (Only ~1/3 of the reads are succesfully assigned to a cluster, probably because of the chosen mapping strategy)
Limitations
- Multithreading: Depending on the size of the assembly graph a lot of RAM is required per thread
Future plans
Further plans (other than fixing known issues) sorted by priority:
- Support for assembly graphs in GFA-format
(GFA is probably the most used file format for assembly graphs
and would allow to use a wide variety of tools for the construction
of the sort read assembly graph) - Improving multithreading
(The user could provide any number of threads to be used for the
construction of the Minimap2 index while Ivocluster would limit the
number of threads used for the mapping according to the available RAM
and the size of the index and the long read file. - Clustering short read data.
- Include construction of an actual assembly graph from user provided
short reads. - Support for different mapping tools or user provided alignment files
"Hybrid self-correction" of clustered longreads with spike-in contigs
using Minimap2 and Racon.
- Spiking longread clusters with the contigs from the corresponding
components of short read assembly graph at a user defined rate. - All-versus-all mapping with Minimap2
- Self-correction with Racon
- Removing spike-in contigs from the files of corrected long-reads.