-
Notifications
You must be signed in to change notification settings - Fork 19
Example: Long Read superTranscriptome Construction and Visualisation
In this tutorial i will try and go through the steps on how one might construct superTranscripts from long read RNA reads (for example from Oxford Nanopore or PacBio).
If you don't want to read the details but just want the recipe for making a tasty long read super transcriptome just read the following summary:
-
De Novo assembled reads into contigs (For example with Trinity or Canu )
-
Cluster contigs (For examples with RapClust or [Corset] (https://github.com/Oshlack/Corset/wiki) )
-
Weave together superTranscripts with Lace
-
Visualise clusters with STViewer (part of the Lace suite)
Here we use Canu which specialise in Long Read data genome assembly, but we could have just as easily used Trinity but we already go through how to use Trinity in another tutorial, so it is good to promote other excellent tools.
N.B. You can download this play dataset here
canu \
-p MCF7 -d MCF7-pac \
genomeSize=3000m \
-pacbio-raw IsoSeq_MCF7_2015edition_polished.unimapped.fasta
Here you can use either RapClust or Corset (in fact they are very similar in approach) RapClust the main difference (apart from speed) is that RapClust uses equivalence classes from pseudo-aligned reads to the transcriptome to cluster using the MCL whereas Corset uses aligned reads to the transcriptome to group using a hierarchical clustering method. Here i will show how to use RapClust (since i highlighted how to use Corset in another tutorial)
##Step 2a - Pseudo-Allign
Before clustering we need to pseudo-align (or align using STAR if we wanted to use Trinity) Using Salmon or Sailfish one can index the transcriptome and pseudoalign using the flag to print the equivalence classes neeed by RapClust(--dumpEq).
cd MCF7-pac
salmon index -t MCF7.contigs.fasta -i transcripts_index --type quasi -k 31
salmon quant -i transcripts_index -l A -r ../IsoSeq_MCF7_2015edition_polished.unimapped.fasta -o transcripts_quant --dumpEq
Now because (at the current time of writing) the latest binary version of salmon is not 100% compatible with RapClust we need to rename the folder aux_info -> aux, furthermore we need to ensure we are using python 2.7 since that is what RapClust was written in.
mv transcripts_quant/aux_info/ transcripts_quant/aux
In order to run RapClust one should provide a config.yaml file with details on the samples conditions and output directory, for a single sample as we have this is easy. config.yaml
conditions:
-Control
samples:
-Control:
- transcripts_quant
outdir: canu_rapclust
Now we can cluster! Hooray! (In a python 2.7 environment)
RapClust --config config.yaml
Now we can use out trusty lace to construct a superTranscript per cluster. But first in order to do so we need to collate the cluster info from RapClust into the right format for Lace, I made a quick script to do that:
python CollateClusters.py canu_rapclust/mag.clust
The script contains:
import sys
if(len(sys.argv) <2):
print("Require cluster output file from RapClust")
sys.exit()
else:
f = open(sys.argv[1],'r')
fout = open('clusters.txt','w')
count = 1
for line in f:
contigs = line.split()
for contig in contigs:
fout.write(contig + "\t" + "Cluster" + str(count) + "\n")
count = count + 1
f.close()
fout.close()
Now run Lace: <Lace Requires Python3 so you will have to switch environments, this is easy with something like conda>
mkdir Laced
python Lace.py MCF7.contigs.fasta clusters.txt -o Laced
Now say we want to visualise how the contigs build up the superTranscripts, we can use STViewer.py to take a peek at Cluster2:
cd Laced
python STViewer.py Cluster2.fasta