-
Notifications
You must be signed in to change notification settings - Fork 23
RNA Seq analysis tutorial
Howdy, welcome to the RNA-seq analysis tutorial.
I will guide you through the process of RNA-seq analysis.
For mac user, open "terminal" then type "ssh [email protected]".
For Windows user, please download PuTTY (http://www.putty.nl/download.html). Put "[email protected]" in the "host name" and choose "ssh" as your "conneciton type" then hit "open".
Once you enter the terminal, please go to your scratch folder by typing cd ../../scratch/user/yourNetID
Now you can either create a new folder by typeing mkdir TheFolderNameYouWant
.
For basic Linux/UNIX operation, please refer to the handout provided by TAMU HPCR: https://sc.tamu.edu/register/classlist.php
Enter the folder that you want to store fastq files
Download your data from the sequencing facility page, go all the way down to Retrieve data to copy the actual link. It would be somting like wget https://download.txgen.tamu.edu/*.tar
tar -xvf "read filename"
The download link has been provided in the email.
Google "ensembl dog genome"
FTP download
annotation: wget ftp://ftp.ensembl.org/pub/current_gtf/canis_familiaris/
genome: wget ftp://ftp.ensembl.org/pub/current_fasta/canis_familiaris/dna/Canis_familiaris.CanFam3.1.dna.toplevel.fa.gz
genome: canFam3.fa
annotation: canFam.gtf
Official website: https://ccb.jhu.edu/software/hisat2/index.shtml
$ mkdir hisat2
vim com.hisat2.job
The followings are the content of your job file:
#####################################################################
BSUB -J com.hisat2.job
BSUB -L /bin/bash
BSUB -W 24:00
BSUB -o com.hisat2.%J
BSUB -n 20
BSUB -M 2700
BSUB -R "rusage[mem=2700]"
BSUB -R "span[ptile=20]]"
BSUB -u "your email"
BSUB -B -N
DATADIR=/scratch/user/YourNetID/fastq
WORKDIR=/scratch/user/YourNetID/hisat2
HST2IDX=/scratch/user/YourNetID/hisat2_index/canFam3
ANNOT=/scratch/user/YourNetID/hisat2_index/canFam3.gtf
GENOME=/scratch/user/YourNetID/hisat2_index/canFam3.fa
hisat2-build -p 20 $GENOME /scratch/user/YourNetID/hisat2_index/canFam3
$ hisat2-build -p 20 $GENOME /scratch/user/candicechu/dog/hisat2_index/canFam3
$ python /scratch/user/candicechu/dog/hisat2-2.0.3-beta/hisat2_extract_splice_sites.py $ANNOT > /scratch/user/candicechu/dog/hisat2/splicesites.txt
$ mkdir -p $WORKDIR/t1.rapid.czar
$ cd $WORKDIR/t1.rapid.czar
$ hisat2 -p 20 \
-q \
--dta \
--known-splicesite-infile $WORKDIR/splicesites.txt \
-x $HST2IDX \
-1 $DATADIR/czar1_I4_TGACCA_R1_combined.fastq.gz \
-2 $DATADIR/czar1_I4_TGACCA_R2_combined.fastq.gz \
-S align.sam
The SAM output of each HISAT2 run must converted to BAM and sorted using samtools
$ samtools view -S -u $HSDIR/t1.rapid.czar/align.sam -o $HSDIR/t1.rapid.czar/align.bam
$ samtools sort -n -@ 20 -m 2G $HSDIR/t1.rapid.czar/align.bam -o $HSDIR/t1.rapid.czar/aligns.namesorted.bam
Offical website: http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
$ htseq-count -f bam -i gene_id -s no -t exon $HSDIR/t1.rapid.santana/aligns.namesorted.bam $ANNOT > $HTDIR/t1.rapid.santana.count.txt
Download the HTseq output files into your computer by typing:
$ scp -r [email protected]:(PATH TO YOUR FILE IN TERMINAL) (PATH TO YOUR LOCAL FOLDER IN YOUR COMPUTER)
For example, it's mine:
$ scp -r [email protected]:/scratch/user/candicechu/dog/stringtie2/ /Users/CandiceChu/Desktop/R
You would like to have a table that the columns denote every samples you have (e.g. Control_Dog1@T1, Control_Dog2@T2, etc.) and the raws denote every genes and their read counts.
This can be done in the terminal or in Excel.
Offical website: https://bioconductor.org/packages/release/bioc/html/DESeq2.html
Differential analysis of count data – the DESeq2 package: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf
Install DESeq2 in R.
Load the DESeq2 library in R.
The window should show the following messages:
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Use read.table
function to import your count table.
Use data.frame
function to make your sampleTable
.
Use DESeqDataSetFromMatrix
function to make DESeq
data.
class: DESeqDataSet
dim: 24580 24
metadata(0):
assays(1): counts
rownames(24580): ENSCAFG00000000001 ENSCAFG00000000002 ...
ENSCAFG00000032766 ENSCAFG00000032767
rowRanges metadata column names(0):
colnames(24): t1.control_1 t1.control_2 ... t3.slow_2
t3.slow_3
colData names(2): condition timepoints
Impletement DESeq
function.
When you inspect the result from DESeq
function, you should see the following information:
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
Use results
function to generate the result.
The results should look like this:
log2 fold change (MAP): condition rapid vs slow
Wald test p-value: condition rapid vs slow
DataFrame with 24580 rows and 6 columns
baseMean log2FoldChange lfcSE stat
ENSCAFG00000000001 634.0943577 -0.3745988 0.2051203 -1.8262394
ENSCAFG00000000002 0.7144562 -0.4465013 0.4072834 -1.0962914
ENSCAFG00000000005 16.6377634 0.2360801 0.2421617 0.9748863
ENSCAFG00000000006 3.0631424 0.2209541 0.3512914 0.6289766
ENSCAFG00000000007 496.3284186 -0.1516945 0.1195441 -1.2689416
... ... ... ... ...
ENSCAFG00000032763 224.7980930 -0.11361131 0.3225751 -0.35220107
ENSCAFG00000032764 0.0000000 NA NA NA
ENSCAFG00000032765 5.5146917 -0.01163694 0.2937528 -0.03961474
ENSCAFG00000032766 4.3084496 -0.29644356 0.3992483 -0.74250421
ENSCAFG00000032767 0.1964307 -0.06024762 0.2006918 -0.30019971
pvalue padj
ENSCAFG00000000001 0.06781422 0.7757870
ENSCAFG00000000002 0.27295129 0.8231375
ENSCAFG00000000005 0.32961668 0.8502714
ENSCAFG00000000006 0.52936436 0.9198695
ENSCAFG00000000007 0.20446190 0.7939822
... ... ...
ENSCAFG00000032763 0.7246875 0.9476441
ENSCAFG00000032764 NA NA
ENSCAFG00000032765 0.9684003 1.0000000
ENSCAFG00000032766 0.4577819 0.8918822
ENSCAFG00000032767 0.7640248 0.9555973