Bash scripts describing a simple RNA-seq pipeline, from adapter trimming to visualization on a genome browser.
Here is a useful article on common filetypes used in bioinformatics
- Adapter trimming and fastq Quality Control (Trim Galore!)
INPUT: raw.fastq.gz
files
OUTPUT: trimmed.fq.gz
files, FastQC results in.html
- Alignment (STAR)
INPUT: trimmed.fq.gz
files, STAR index directory
OUTPUT: raw.bam
files - Quality Control (samtools, Picard, bamtools)
INPUT: raw.bam
files
OUTPUT: filtered.bam
files - Quantification (Subread, awk)
INPUT: filtered.bam
files,.gtf
file (Gene Transfer Format)
OUTPUT: by-gene read counts in a simple text.tsv
- Visualization (bedtools, bedGraphToBigWig)
INPUT: filtered.bam
files, Chromosome Sizes file in.bed
format
OUTPUT:.bw
file that can be uploaded to a genome browser for visalization of read coverage.
1. Adapter trimming is performed with Trim Galore! package.
-q 15
- Trim base calls that are below this threshold in addition to adapter sequences.
--stringency 5
- Minimum overlap of adapter sequence to be trimmed.
-e 0.1
- Max. allowed error rate (no. of errors divided by the length of the matching region)
--length 20
- Discard reads that become smaller than this after trimming
--illumina
- Adapter sequence to be trimmed is the first 13bp of the Illumina universal adapter AGATCGGAAGAGC
2. Alignment is performed with STAR aligner. This takes reads from our trimmed .fastq
files and assigns them to regions on the genome.
--genomeDir
- Directory which has the STAR index files, is generated by the STAR command genomeGenerate
.
--outFilterType
,--outFilterIntronMotifs
, --alignIntronMax
, --outSAMstrandField
, --outSAMunmapped
, --chimSegmentMin
, --chimJunctionOverhangMin
- Flags that control to what degree the output is filtered.
3. Quality control of BAM file. It is important to document these steps, as they can affect reproducibility in the results, and there are many options to choose from.
a. Filter out any non-chromosomal sequences (contigs) and sort BAM
b. Mark duplicate reads (set FLAG) via Picard MarkDuplicates and sort
c. Remove the marked duplicates, as well as PCR artifacts, and reads failing vendor quality checks (-F 1540
) and sort
d. Index the final BAM
4. Read counting (quantification) counts reads over gene regions, given a BAM file. We will use these to perform downstream statistical analyses.
featureCounts \
-p \ # Paired-end reads
-t exon \ # Feature types to consider in GTF file
-g gene_id \ # Attribute to group features (e.g. exons) into meta-features (e.g. genes)
-s 1 \ # Input data is strand-specific
-O \ # Count reads that overlap more than one meta-feature more than once.
-T <# of CPUs> \
-a <GTF File> \ # Path to GTF file
-o <outfile> \
<BAM File>
5. We construct a bigWig
file, which is a binary file format for viewing the data on a genome browser track.
##### Normalization #####
READS=$(samtools view -c <bam>) # Calculate number of reads in the bam file
FACTOR=$(echo "scale=10; 1000000/${READS}" | bc -l) # Calculate a normalization factor
#########################
bamToBed -i <bam> -bed12 | bed12ToBed6 -i stdin | genomeCoverageBed -bg -i stdin -g <ChromosomeSizes> -scale ${FACTOR} | sort -k1,1 -k2,2n > <temporary output> # Long pipe of commands to convert bam file to sorted text .bed file
bedGraphToBigWig ${outdir}/${prefix}.tmp.bg ${chromsizes} ${outdir}/${prefix}.bw # Convert the newly created bed file into a bigwig file