Skip to content

RNA Seq analysis tutorial

CandiceChuDVM edited this page Aug 5, 2016 · 2 revisions

Howdy, welcome to the RNA-seq analysis tutorial.
I will guide you through the process of RNA-seq analysis.
RNA-seq analysis flow chart

(1) Log in the super computer

For mac user, open "terminal" then type "ssh [email protected]".
For Windows user, please download PuTTY (http://www.putty.nl/download.html). Put "[email protected]" in the "host name" and choose "ssh" as your "conneciton type" then hit "open".
Once you enter the terminal, please go to your scratch folder by typing cd ../../scratch/user/yourNetID
Now you can either create a new folder by typeing mkdir TheFolderNameYouWant.
For basic Linux/UNIX operation, please refer to the handout provided by TAMU HPCR: https://sc.tamu.edu/register/classlist.php

(2) Download .fastq files and genome files

Download the reads (.fastq)

Enter the folder that you want to store fastq files
Download your data from the sequencing facility page, go all the way down to Retrieve data to copy the actual link. It would be somting like wget https://download.txgen.tamu.edu/*.tar

Unzip the reads

tar -xvf "read filename"
The download link has been provided in the email.

Download the annotation (.gtf) and genome (.fa) from ensembl

Google "ensembl dog genome"
FTP download

annotation: wget ftp://ftp.ensembl.org/pub/current_gtf/canis_familiaris/
genome: wget ftp://ftp.ensembl.org/pub/current_fasta/canis_familiaris/dna/Canis_familiaris.CanFam3.1.dna.toplevel.fa.gz

Rename genome and annotation files

genome: canFam3.fa
annotation: canFam.gtf

(2) Mapping/Alignment

module spider name of the software

module load version of the software e.g. module load hisat2/2.0.4-intel-2015B

Hisat2-2.0.3-beta

Official website: https://ccb.jhu.edu/software/hisat2/index.shtml
$ mkdir hisat2

Create a job file to build index with ensembl fasta file

vim com.hisat2.job

Editing the job file

The followings are the content of your job file:
#####################################################################
BSUB -J com.hisat2.job
BSUB -L /bin/bash
BSUB -W 24:00
BSUB -o com.hisat2.%J
BSUB -n 20
BSUB -M 2700
BSUB -R "rusage[mem=2700]"
BSUB -R "span[ptile=20]]"
BSUB -u "your email" BSUB -B -N
DATADIR=/scratch/user/YourNetID/fastq
WORKDIR=/scratch/user/YourNetID/hisat2
HST2IDX=/scratch/user/YourNetID/hisat2_index/canFam3
ANNOT=/scratch/user/YourNetID/hisat2_index/canFam3.gtf
GENOME=/scratch/user/YourNetID/hisat2_index/canFam3.fa
hisat2-build -p 20 $GENOME /scratch/user/YourNetID/hisat2_index/canFam3
$ hisat2-build -p 20 $GENOME /scratch/user/candicechu/dog/hisat2_index/canFam3

Create splicesites.txt

  $ python /scratch/user/candicechu/dog/hisat2-2.0.3-beta/hisat2_extract_splice_sites.py $ANNOT > /scratch/user/candicechu/dog/hisat2/splicesites.txt   

Mapping with hisat2

  $ mkdir -p $WORKDIR/t1.rapid.czar  
  $ cd $WORKDIR/t1.rapid.czar  
  $ hisat2 -p 20 \  
   -q \  
   --dta \  
   --known-splicesite-infile $WORKDIR/splicesites.txt \  
   -x $HST2IDX \  
   -1 $DATADIR/czar1_I4_TGACCA_R1_combined.fastq.gz \  
   -2 $DATADIR/czar1_I4_TGACCA_R2_combined.fastq.gz \  
   -S align.sam  

(5) Samtools-1.2

The SAM output of each HISAT2 run must converted to BAM and sorted using samtools
$ samtools view -S -u $HSDIR/t1.rapid.czar/align.sam -o $HSDIR/t1.rapid.czar/align.bam
$ samtools sort -n -@ 20 -m 2G $HSDIR/t1.rapid.czar/align.bam -o $HSDIR/t1.rapid.czar/aligns.namesorted.bam

(4) Get raw read counts

HTseq-0.6.1

Offical website: http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
$ htseq-count -f bam -i gene_id -s no -t exon $HSDIR/t1.rapid.santana/aligns.namesorted.bam $ANNOT > $HTDIR/t1.rapid.santana.count.txt
Download the HTseq output files into your computer by typing:
$ scp -r [email protected]:(PATH TO YOUR FILE IN TERMINAL) (PATH TO YOUR LOCAL FOLDER IN YOUR COMPUTER)
For example, it's mine:
$ scp -r [email protected]:/scratch/user/candicechu/dog/stringtie2/ /Users/CandiceChu/Desktop/R

(5) Obtain the count table

You would like to have a table that the columns denote every samples you have (e.g. Control_Dog1@T1, Control_Dog2@T2, etc.) and the raws denote every genes and their read counts.
This can be done in the terminal or in Excel.

(6) Do the diffirential analysis in R by using DESeq2

Offical website: https://bioconductor.org/packages/release/bioc/html/DESeq2.html
Differential analysis of count data – the DESeq2 package: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf
Install DESeq2 in R.
Load the DESeq2 library in R.
The window should show the following messages:
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Use read.table function to import your count table.
Use data.frame function to make your sampleTable.
Use DESeqDataSetFromMatrix function to make DESeq data.
class: DESeqDataSet dim: 24580 24 metadata(0):
assays(1): counts
rownames(24580): ENSCAFG00000000001 ENSCAFG00000000002 ...
ENSCAFG00000032766 ENSCAFG00000032767
rowRanges metadata column names(0):
colnames(24): t1.control_1 t1.control_2 ... t3.slow_2
t3.slow_3
colData names(2): condition timepoints
Impletement DESeq function.
When you inspect the result from DESeq function, you should see the following information:
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
Use results function to generate the result.
The results should look like this:
log2 fold change (MAP): condition rapid vs slow
Wald test p-value: condition rapid vs slow
DataFrame with 24580 rows and 6 columns
baseMean log2FoldChange lfcSE stat

ENSCAFG00000000001 634.0943577 -0.3745988 0.2051203 -1.8262394
ENSCAFG00000000002 0.7144562 -0.4465013 0.4072834 -1.0962914
ENSCAFG00000000005 16.6377634 0.2360801 0.2421617 0.9748863
ENSCAFG00000000006 3.0631424 0.2209541 0.3512914 0.6289766
ENSCAFG00000000007 496.3284186 -0.1516945 0.1195441 -1.2689416
... ... ... ... ...
ENSCAFG00000032763 224.7980930 -0.11361131 0.3225751 -0.35220107
ENSCAFG00000032764 0.0000000 NA NA NA
ENSCAFG00000032765 5.5146917 -0.01163694 0.2937528 -0.03961474
ENSCAFG00000032766 4.3084496 -0.29644356 0.3992483 -0.74250421
ENSCAFG00000032767 0.1964307 -0.06024762 0.2006918 -0.30019971
pvalue padj

ENSCAFG00000000001 0.06781422 0.7757870
ENSCAFG00000000002 0.27295129 0.8231375
ENSCAFG00000000005 0.32961668 0.8502714
ENSCAFG00000000006 0.52936436 0.9198695
ENSCAFG00000000007 0.20446190 0.7939822
... ... ...
ENSCAFG00000032763 0.7246875 0.9476441
ENSCAFG00000032764 NA NA
ENSCAFG00000032765 0.9684003 1.0000000
ENSCAFG00000032766 0.4577819 0.8918822
ENSCAFG00000032767 0.7640248 0.9555973