Skip to content

Nuwah12/Demo-RNAseq-Pipeline

Repository files navigation

RNA-seq Demonstration Pipeline

Bash scripts describing a simple RNA-seq pipeline, from adapter trimming to visualization on a genome browser.

Here is a useful article on common filetypes used in bioinformatics

Workflow steps:

  1. Adapter trimming and fastq Quality Control (Trim Galore!)
    INPUT: raw .fastq.gz files
    OUTPUT: trimmed .fq.gz files, FastQC results in .html
  2. Alignment (STAR)
    INPUT: trimmed .fq.gz files, STAR index directory
    OUTPUT: raw .bam files
  3. Quality Control (samtools, Picard, bamtools)
    INPUT: raw .bam files
    OUTPUT: filtered .bam files
  4. Quantification (Subread, awk)
    INPUT: filtered .bam files, .gtf file (Gene Transfer Format)
    OUTPUT: by-gene read counts in a simple text .tsv
  5. Visualization (bedtools, bedGraphToBigWig)
    INPUT: filtered .bam files, Chromosome Sizes file in .bed format
    OUTPUT: .bw file that can be uploaded to a genome browser for visalization of read coverage.

Detailed methods

1. Adapter trimming is performed with Trim Galore! package.
      -q 15 - Trim base calls that are below this threshold in addition to adapter sequences.
      --stringency 5 - Minimum overlap of adapter sequence to be trimmed.
      -e 0.1 - Max. allowed error rate (no. of errors divided by the length of the matching region)
      --length 20 - Discard reads that become smaller than this after trimming
      --illumina - Adapter sequence to be trimmed is the first 13bp of the Illumina universal adapter AGATCGGAAGAGC
2. Alignment is performed with STAR aligner. This takes reads from our trimmed .fastq files and assigns them to regions on the genome.
      --genomeDir - Directory which has the STAR index files, is generated by the STAR command genomeGenerate.
      --outFilterType,--outFilterIntronMotifs, --alignIntronMax, --outSAMstrandField, --outSAMunmapped, --chimSegmentMin, --chimJunctionOverhangMin - Flags that control to what degree the output is filtered.
3. Quality control of BAM file. It is important to document these steps, as they can affect reproducibility in the results, and there are many options to choose from.
      a. Filter out any non-chromosomal sequences (contigs) and sort BAM
      b. Mark duplicate reads (set FLAG) via Picard MarkDuplicates and sort
      c. Remove the marked duplicates, as well as PCR artifacts, and reads failing vendor quality checks (-F 1540) and sort
      d. Index the final BAM
4. Read counting (quantification) counts reads over gene regions, given a BAM file. We will use these to perform downstream statistical analyses.       

featureCounts \
        -p \ # Paired-end reads
        -t exon \ # Feature types to consider in GTF file
        -g gene_id \ # Attribute to group features (e.g. exons) into meta-features (e.g. genes)
        -s 1 \ # Input data is strand-specific
        -O \ # Count reads that overlap more than one meta-feature more than once.
        -T <# of CPUs> \
        -a <GTF File> \ # Path to GTF file
        -o <outfile> \
        <BAM File>

5. We construct a bigWig file, which is a binary file format for viewing the data on a genome browser track.

##### Normalization #####
READS=$(samtools view -c <bam>) # Calculate number of reads in the bam file 
FACTOR=$(echo "scale=10; 1000000/${READS}" | bc -l) # Calculate a normalization factor 
#########################

bamToBed -i <bam> -bed12 | bed12ToBed6 -i stdin | genomeCoverageBed -bg -i stdin -g <ChromosomeSizes> -scale ${FACTOR} | sort -k1,1 -k2,2n > <temporary output> # Long pipe of commands to convert bam file to sorted text .bed file
bedGraphToBigWig ${outdir}/${prefix}.tmp.bg ${chromsizes} ${outdir}/${prefix}.bw # Convert the newly created bed file into a bigwig file

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages