Skip to content

Overview

jmestret edited this page Sep 13, 2022 · 2 revisions

What is SQANTI-SIM?

SQANTI-SIM is a simulator of controlled novelty and degradation of transcripts sequenced by long reads. It is a wrapper tool for long-read RNA-seq simulators such as IsoSeqSim and NanoSim (formerly Trans-NanoSim) to simulate transcripts based on the SQANTI3 structural categories (publication and code repository).

SQANTI-SIM uses NanoSim and IsoSeqSim to simulate PacBio cDNA reads and Nanopore cDNA and dRNA reads, and implements a strategy to simulate novel transcripts based on SQANTI structural categories. To simulate reliable novel transcripts, SQANTI-SIM screens the provided reference annotation to find real transcripts, which, when excluded from the reference but detected in a sequencing experiment, would result in transcripts being labeled as novel or incomplete according to the SQANTI3 classification of structural categories. This allows to have a robust ground truth and also to use orthogonal data (Short-read RNA-seq data, CAGE Peak data, poly(A) motifs...) that can be used to support the discovery and reconstruction of these simulated "novel" transcripts.

workflow

SQANTI-SIM's strategy is based on the structural categories of SQANTI3. A detailed explanation of the SQANTI3 structural categories can be found in the SQANTI3 wiki. Briefly, degraded transcripts with fewer 5'/3' end exons are classified as an Incomplete Splice Match (ISM). On the other hand, novel isoforms are mainly classified as Novel In Catalog (NIC) or Novel Not in Catalog (NNC). NIC isoforms use a new combination of known donor/acceptor sites, while NNC isoforms have at least one donor or acceptor site that is not annotated. With SQANTI-SIM we are also able to simulate all the rest structural categories defined by SQANTI3.

Trulli

Figure taken from the SQANTI3 wiki

How does it work?

SQANTI-SIM is a wrapper script (sqanti-sim.py) with modules performing different tasks that are run in the following order: (i) the classif stage classifies all the transcripts annotated in the reference GTF for their potential SQANTI3 structural category when they do not match with themselves; (ii) the design module allows the user to specify which known and novel transcripts should be simulated and outputs a reduced GTF file that excludes the "novel" isoforms.; (iii) the sim step runs NanoSim or IsoSeqSim to simulate reads based on the complete reference annotation, and the module eval assesses the accuracy of the long-read reconstructed transcriptome by running SQANTI3 with the reduced GTF and an additional file containing the structural annotation of the simulated novel transcripts.

1. Transcript classification in SQANTI3 structural category

The SQANTI-SIM protocol starts by classifying all the transcripts from a given transcriptome annotation in its potential SQANTI3 structural categories, assigning the query transcript an SQANTI3 structural category assuming it cannot match as FSM with itself. This will generate an index file with the structural annotation of each transcript from the reference annotation and its potential SQANTI3 structural category.

Information about running the classification step can be found at SQANTI-SIM: classif.

2. Simulation experiment design

Next, a simulation design is suggested. The user selects the structural categories to simulate and the level of novelty, and SQANTI-SIM modifies the annotation in accordance with those selections. The expression levels are then determined using one of the multiple implemented methods, such as assigning a uniform distribution to the count values, using two custom negative binomial distributions (one for the novel transcripts and the other for the known), or replicating the count levels distribution of a real sequenced sample mapping with Minimap2 against the reference transcriptome.

Information about running the design step can be found at SQANTI-SIM: design.

3. Data simulation

This stage uses NanoSim and IsoSeqSim to mimic ONT or PacBio reads using the index file with the requested counts and the original GTF reference annotation. To support long-read data, you may also simulate Illumina reads using Polyester. The requested_tpm column from the index file is used by all three simulators to create the expression values for each read. These simulated reads must be used as input for the transcript discovery and reconstruction pipeline you are employing, together with the modified GTF reference annotation created in the design step.

Information about running the simulation step can be found at SQANTI-SIM: sim.

4. SQANTI-SIM evaluation report

The reduced GTF reference annotation and the transcript models obtained by your pipeline are provided in this final step. With this, the SQANTI3 pipeline will be run, and a report will be produced with performance metrics on how successfully the pipeline recovered novel and well-known transcripts as well as measures of sensitivity and precision for each SQANTI3 structural category.

Information about running the evaluation step can be found at SQANTI-SIM: eval.