-
Notifications
You must be signed in to change notification settings - Fork 1
Overview
SQANTI-SIM is a lrRNA-seq simulation tool designed to generate Nanopore dRNA and cDNA reads, as well as PacBio cDNA reads. It offers precise control over transcript novelty based on SQANTI3 structural categories. Additionally, SQANTI-SIM can simulate orthogonal data to support both known and novel transcripts.
To use SQANTI-SIM, you need to provide it with the reference genome and a GTF file containing transcriptome annotations for the organism of interest. Users can adjust various parameters to control the generation of novel transcripts. SQANTI-SIM uses tools such as NanoSim, PBSIM3, and IsoSeqSim to simulate long reads. Optionally, you can generate short-read and CAGE peak data. The output of SQANTI-SIM simulation includes the simulated long reads, a reduced GTF file excluding simulated novel transcripts, and the orthogonal datasets. Furthermore, it generates a comprehensive report that assesses the performance of the transcript reconstruction algorithm when applied to the simulated data.
SQANTI-SIM's approach is based on the structural categories defined by SQANTI3. For a detailed explanation of these categories, please refer to the SQANTI3 wiki. Briefly, degraded transcripts with fewer 5'/3' end exons are classified as an Incomplete Splice Match (ISM). On the other hand, novel isoforms are mainly classified as Novel In Catalog (NIC) or Novel Not in Catalog (NNC). NIC isoforms use a new combination of known donor/acceptor sites, while NNC isoforms have at least one donor or acceptor site that is not annotated. With SQANTI-SIM, you can also simulate other structural categories defined by SQANTI3.
Figure taken from the SQANTI3 wiki
SQANTI-SIM is a wrapper script (sqanti-sim.py) with modules performing different tasks that are run in the following order: (i) the classif stage classifies all the transcripts annotated in the reference GTF for their potential SQANTI3 structural category when they do not match with themselves; (ii) the design module allows the user to specify which known and novel transcripts should be simulated and outputs a reduced GTF file that excludes the "novel" isoforms.; (iii) the sim step runs NanoSim, PBSIM3 or IsoSeqSim to simulate reads based on the complete reference annotation, and (iv) the module eval assesses the accuracy of the long-read reconstructed transcriptome.
1. Transcript classification in SQANTI3 structural category
Transcripts annotated in the reference GTF are classified according to their potential SQANTI3 structural category when compared to other transcripts of the same gene.
Information about running the classification step can be found at SQANTI-SIM: classif.
2. Simulation experiment design
Based on this classification, a user-defined number of transcripts classified as ISM, NIC, NNC, or other novel structural categories, are removed from the annotation resulting in a “reduced” GTF file. Additionally, SQANTI-SIM assigns transcript expression values by offering three computation modes. The equal mode assigns the same expression value to all simulated transcripts. The custom mode allows the user to define different expression values for novel and known transcripts by customizing the parameters of two negative binomial distributions. Lastly, the sample mode utilizes inverse transform sampling from an empirical raw counts distribution.
Information about running the design step can be found at SQANTI-SIM: design.
3. Data simulation
Long reads are simulated using either NanoSim, PBSIM3, or IsoSeqSim operating on the complete GTF annotation file. Moreover, SQANTI-SIM can optionally simulate matching short reads and CAGE-peak data taking parameters from suitable reference datasets. Your transcriptome reconstruction algorithm utilizes the simulated reads and the “reduced” reference annotation generated by SQANTI-SIM to predict transcript models. If the transcriptome reconstruction algorithm permits, the simulated orthogonal data may be also incorporated into the transcript model prediction.
Information about running the simulation step can be found at SQANTI-SIM: sim.
4. SQANTI-SIM evaluation report
The performance of the method is assessed for each novel structural category using the SQANTI-SIM evaluation function that identifies true and false novel transcripts based on the simulated ground truth. Transcripts missing in the reduced GTF should be identified as true novel, and any novel transcript that was not simulated will result in a false call. A full performance evaluation report is provided.
Information about running the evaluation step can be found at SQANTI-SIM: eval.