Huan ZHONG ([email protected])
We introduce TagSeqTools as a flexible, general pipeline for facilitating the identification and exploration of tagged-RNA (i.e. NAD-capped RNA) using NAD tagSeq data. TagSeqTools can differentiate tagged and untagged reads and conduct quantitative analysis by only two steps. Besides of TagSeek and TagSeqQuant two major modules, the pipeline also includes other advanced modules for detecting isoforms, antisense transcripts, pre-mRNA (un-spliced transcripts), or others. In addition, this package can automatically generate plots and tables for visualization and further analysis for users. Therefore, TagSeqTools provides a convenient and comprehensive workflow for researchers to study data produced by NAD tagSeq or similar method using Nanopore sequencing.
Ubuntu 18.04.3 LTS, Linux-based operating system (https://ubuntu.com/download)
FastQC> v0.11.4 (https://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc)
samtools> 1.7 (http://www.htslib.org/download/)
minimap2>2.12 (https://github.com/lh3/minimap2)
curl -L https://github.com/lh3/minimap2/releases/download/v2.17/minimap2-2.17_x64-linux.tar.bz2 | tar -jxvf -
Then add the minimamp2 to the system variant:
export path=$path $DIRECTORY/minimap2/minimap2
python 2.7 and R > 3.2.1 are suggested.
Modules required to be install in python: os, sys, re, Bio, SeqIO, regex, argparse. (e.g. pip install biopython regex). It is recommended to install the python modules in a clean environment, such as using "virtualenv" to build up a virtual environment to install the required modules and avoid direct collision of the softwares with the user’s system environment.
virtualenv tag_env
source tag_env/bin/activate
pip install biopython regex
Some R packages, like "ggplot", "gplots", "corrplot" are also required, but they will be automatically installed if using our pipeline.
No further installation is needed. You only need to format the input files and directory acording to the requirement, and run two scripts on these files.
For visualization: genome fasta file.
For quantification: transcriptome fasta file.
Usually fastq files produced from Nanopore will be separated in 2 or 3 folders, including "fastq_fail" and "fastq_pass", and the sub-folders contain 4000 reads each. User may need to use the following command to produce a final fastq file.
mkdir analysis
cd analysis
cat $DIRECTORY/fastq_fail/*.fastq $DIRECTORY/fastq_pass/*.fastq > all.fastq
python TagSeek.py --fastq INPUT_FILE_NAME --tag TAG_SEQUENCE --similarity SIMILARITY_CUTOFF
*tag.fastq and *nontag.fastq will be generated for tagged-RNA and nontagged-RNA reads.
--fastq: or -f the prefix name of input fastq. Such as "all.fastq", then the INPUT_FILE_NAME should be "all".
--tag: or -t the syntheic tag RNA sequence.
--similarity: or -s the number of exact consecutively matched bases between the tagRNA sequence and the first 40 bases of the reads.
python TagSeqQuant.py --name INPUT_FILE_NAME --trans TRANSCRIPTOME_REFERENCE --genome GENOME_REFERENCE
--name: or -n the prefix name of input. The tagged fastq and non-tagged fastq should be prefixed with the same name of sample, such as "demo.tag.fastq" and "demo.nontag.fastq", the INPUT_FILE_NAME should be "demo".
--trans: or -tr the transcriptome reference fasta files including all full cdna sequences for each annotated gene.
--genome: or -g the genome reference fasta files.
For TagSeek.github.py, 4 seconds for 20,000 fastq reads. For TagSeqQuant.github.py, 43 seconds for 20,000 fastq reads.
The quality control results will be deposited in the "QC_results" directory.
Tag_statistics.txt: Tagging statistics, including the number of total reads, the number of tagged reads.
fastqc.html: FastQC results, including quality scores across all bases, GC content per base, sequence duplication levels and so on.
The mapping statistics will be deposited in the "Mapping_statistics" directory.
NAD_map.html: The mapping statistics of tagged-reads to the whole genome, including mapping ratio, duplication, bases mapping status, error-rate, indel information and so on. nonNAD_map.html: The mapping statistics of tagged-reads to the whole genome, including mapping ratio, duplication, bases mapping status, error-rate, indel information and so on.
The mapping results will be deposited in the "Mapping_results" directory.
NAD.sort.bam: NAD-RNA genes/isoforms visulazation file, can be opened by IGV along with NAD.sort.bam.bai.
nonNAD.sort.bam: non-NAD-RNA genes/isoforms visulazation file, can be opened by IGV along with nonNAD.sort.bam.bai.
d) Quantification results of genes and isoforms: "NAD_total_counts.txt" and "NAD_total_isoform_counts.txt".
The quantification results will be deposited in the "Quantification_results" directory.
#NAD_total_counts.txt
Gene NAD.count total.count
AT1G01100 11 13
AT1G03130 3 10
#NAD_total_isoform_counts.txt
Gene NAD.count total.count
AT1G01100.1 3 4
AT1G01100.2 3 4
Gene: Gene/isoform names.
NAD.count: The number of tagged reads mapped to the gene/isoform.
total.count: The number of total reads mapped to the gene/isoform.
Counting_statistics.txt: including total number of count, total number of gene, total number of NAD count,total number of NAD gene
Download the demo folder, and go into the demo folder and simply run
tar -zxvf TAIR10.genome.fa.tar.gz ### un-compress reference fasta files
tar -zxvf TAIR10.trans.fa.tar.gz ### un-compress reference fasta files
python TagSeek.py --fastq demo --tag 'CCUGAACCUGAACCUGAACCUGAACCUGAACCUGAACCUGAACCUGAACCUGAACCUGAACCUGAA' --similarity 12
python TagSeqQuant.py --name demo --genome TAIR10.trans.fa --trans TAIR10.genome.fa
The human-friendly tables "NAD_total_counts.txt" and "NAD_total_isoform_counts.txt" and bam format files for visulization will be generated.
Step | Description | Software | command | input_files | output_files | demo files |
---|---|---|---|---|---|---|
1 | Quality control | fastqc | fastqc demo.fastq | demo.fastq | demo_fastqc.html, demo_fastqc.zip | demo_fastqc.html |
2 | Differentiate tagged and non-tagged reads | TagSeek | python TagSeek.py --fastq demo --tag 'CCUGAACCUGAACCUGAACCUGAACCUGAACCUGAACCUGAACCUGAACCUGAACCUGAACCUGAA' --similarity 12 | demo.fastq | demo.tag.fastq, demo.nontag.fastq, Tag_statistics.txt | Tag_statistics.txt |
3 | Quantification of genes and isoforms | TagSeqQuant | python TagSeqQuant.py --name demo --trans TAIR10.trans.fa --genome TAIR10.genome.fa | Input sample name, reference files (transcriptome and genome files) | NAD_map.html, nonNAD_map.html, Counting_statistics.txt, NAD_total_counts.txt, NAD_total_isoform_counts.txt, NAD_sort.bam, nonNAD_sort.bam | NAD_map.html, Counting_statistics.txt, NAD_total_counts.txt, NAD_total_isoform_counts.txt |
Zhang, Hailei*, Huan Zhong*, Shoudong Zhang, Xiaojian Shao, Min Ni, Zongwei Cai, Xuemei Chen, and Yiji Xia. 2019. “NAD TagSeq Reveals That NAD + -Capped RNAs Are Mostly Produced from a Large Number of Protein-Coding Genes in Arabidopsis.” Proceedings of the National Academy of Sciences, May, 201903683. https://doi.org/10.1073/pnas.1903683116.
Huan Zhong, Zongwei Cai, Zhu Yang, Yiji Xia. 2020. "TagSeqTools: a flexible and comprehensive analysis pipeline for NAD tagSeq data" bioRxiv 2020.03.09.982934; doi: https://doi.org/10.1101/2020.03.09.982934
Only demo for Nature protocol released.