Skip to content

Algorithm

Quarkins edited this page Aug 23, 2016 · 7 revisions

The algorithm to construct the SuperTranscript came be conceptualized with the figure below:

Breaking this down into steps: 1) Input a list of contigs/trancsripts and their sequence in a fasta file and a text file with the clustering information for which gene/cluster each transcript belongs to.

For each gene (or a defined cluster):
2) Using BLAT (https://genome.ucsc.edu/FAQ/FAQblat.html) pairwise align each transcript in the cluster to find the regions which overlap.
3) Construct a directed graph, where each node is a base in one of the transcripts and the directed edge retains the ordering of the bases in each transcript. Using the pairwise alignments of all clusters merge shared bases (nodes) together.
4) Simplify graph and remove all cycles (see below) in order to create a Directed Acyclic Graph (DAG), necessary for sorting. 5) Topologically sort the nodes ( each node know is a string of bases from the original unsimplified graph) using Khan's algorithm, which will give a non-unique sorting of the bases.
6) Extract the annotations (both SuperBlock style and transcript style)

Example of a cycle break: