This document describes the output produced by the pipeline.
The pipeline is built using Nextflow and processes data using the following steps:
- FastQC - read quality control
- TrimGalore - adapter trimming
- BWA - alignment
- SAMtools - alignment result processing
- Bedtools - bam to bed file conversion
- Picard - duplicate reads removal
- Phantompeakqualtools - normalized strand cross-correlation (NSC) and relative strand cross-correlation (RSC)
- deepTools - fingerprint, correlation plots of reads over genome-wide bins; distribution of reads around genes
- MACS - peak calling
- MultiQC - aggregate report, describing results of the whole pipeline
FastQC gives general quality metrics about your reads. It provides information about the quality score distribution across your reads, the per base sequence content (%T/A/G/C). You get information about adapter contamination and other overrepresented sequences.
For further reading and documentation see the FastQC help.
NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality. To see how your reads look after trimming, look at the FastQC reports in the
trim_galore
directory.
Output directory: results/fastqc
sample_fastqc.html
- FastQC report, containing quality metrics for your untrimmed raw fastq files
sample_fastqc.zip
- zip file containing the FastQC report, tab-delimited data file and plot images
TrimGalore is used for removal of adapter contamination and trimming of low quality regions. TrimGalore uses Cutadapt for adapter trimming and runs FastQC after it finishes.
MultiQC reports the percentage of bases removed by TrimGalore in the General Statistics table, along with a line plot showing where reads were trimmed.
Output directory: results/trimgalore
Contains FastQ files with quality and adapter trimmed reads for each sample, along with a log file describing the trimming.
sample_val_1.fq.gz
,sample_val_2.fq.gz
- Trimmed FastQ data, reads 1 and 2.
sample_val_1.fastq.gz_trimming_report.txt
- Trimming report (describes which parameters that were used)
sample_val_1_fastqc.html
sample_val_1_fastqc.zip
- FastQC report for trimmed reads
Single-end data will have slightly different file names and only one FastQ file per sample:
sample_trimmed.fq.gz
- Trimmed FastQ data
sample.fastq.gz_trimming_report.txt
- Trimming report (describes which parameters that were used)
sample_trimmed_fastqc.html
sample_trimmed_fastqc.zip
- FastQC report for trimmed reads
BWA, or Burrows-Wheeler Aligner, is designed for mapping low-divergent sequence reads against reference genomes. The result alignment files are further processed with SAMtools and Bedtools.
Output directory: results/bwa
sample.sorted.bam
- The sorted aligned BAM file
sample.sorted.bam.bai
- The index file for aligned BAM file
sample.sorted.bed
- The sorted aligned BED file
The MarkDuplicates module in the Picard toolkit differentiates the primary and duplicate reads using an algorithm that ranks reads by the sums of their base-quality scores, which helps to identify duplicates that arise during sample preparation e.g. library construction using PCR.
The Picard section of the MultiQC report shows a bar plot with the numbers and proportions of primary reads, duplicate reads and unmapped reads.
Output directory: results/picard
sample.dedup.sorted.bam
- The sorted aligned BAM file after duplicate removal
sample.dedup.sorted.bam.bai
- The index file for aligned BAM file after duplicate removal
sample.dedup.sorted.bed
- The sorted aligned BED file after duplicate removal
sample.picardDupMetrics.txt
- The log report for duplicate removal
Phantompeakqualtools plots the strand cross-correlation of aligned reads for each sample. In a strand cross-correlation plot, reads are shifted in the direction of the strand they map to by an increasing number of base pairs and the Pearson correlation between the per-position read count vectors for each strand is calculated. Two cross-correlation peaks are usually observed in a ChIP experiment, one corresponding to the read length ("phantom" peak) and one to the average fragment length of the library. The absolute and relative height of the two peaks are useful determinants of the success of a ChIP-seq experiment. A high-quality IP is characterized by a ChIP peak that is much higher than the "phantom" peak, while often very small or no such peak is seen in failed experiments.
Source: Landt SG et al, Genome Research (2012)
Normalized strand coefficient (NSC) is the normalized ratio between the fragment-length cross-correlation peak and the background cross-correlation. NSC values range from a minimum of 1 to larger positive numbers. 1.1 is the critical threshold. Datasets with NSC values much less than 1.1 (< 1.05) tend to have low signal to noise or few peaks (this could be biological eg. a factor that truly binds only a few sites in a particular tissue type OR it could be due to poor quality). ENCODE cutoff: NSC > 1.05.
Relative strand correlation (RSC) is the ratio between the fragment-length peak and the read-length peak. RSC values range from 0 to larger positive values. 1 is the critical threshold. RSC values significantly lower than 1 (< 0.8) tend to have low signal to noise. The low scores can be due to failed and poor quality ChIP, low read sequence quality and hence lots of mismappings, shallow sequencing depth (significantly below saturation) or a combination of these. Like the NSC, datasets with few binding sites (< 200), which is biologically justifiable, also show low RSC scores. ENCODE cutoff: RSC > 0.8.
Output directory: results/phantompeakqualtools
sample.dedup.sorted.pdf
- The strand shift cross-correlation plot of aligned reads after duplicate removal
sample.spp.out
- Normalized strand cross-correlation (NSC) and relative strand cross-correlation (RSC) results
sample.spp.csv
- Raw data for creating the strand shift cross-correlation plot
deepTools visualizes the distribution of fragment sizes for paired-end dataset, the fingerprint of sequence reads distribution, the distribution of sequence reads around genes in reference genome, the pair-wise correlation and PCA clustering of samples based on genome-wide reads counts.
In a fingerprint plot, a completely random distribution of reads along the genome (i.e. without enrichments in open chromatin etc.) should generate a straight diagonal line. A very specific and strong regional enrichment will be indicated by a prominent and steep rise of the cumulative sum towards the highest rank. This means that a big chunk of reads from the sample is located in few bins which corresponds to high, narrow enrichments which are typically seen for transcription factors.
Distribution of sequence reads around genes plus upstream and downstream flanking regions in the reference genome. All genes have been normalized to the same length.
Spearman correlation coefficient in each square is calculated with the read counts in genomic bins between the sample pair at the top and the right. A higher correlation coefficient value indicates higher similarity between the sample pair.
The upper panel shows the clustering of samples based on the top 2 principal components of genome-wide distribution of sequence reads. The lower panel shows the contribution weights of principal components.
Output directory: results/deepTools
fragment_size_histogram.pdf
- Histogram of fragment sizes for paired-end reads
fingerprint.pdf
- Fingerprint plot for sequence reads distribution
read_distribution_profile.pdf
- Distribution of sequence reads around genes
heatmap_SpearmanCorr.pdf
- Spearman pairwise correlation of samples based on read counts
pcaplot.pdf
- Sample clusters based on top 2 principal components
MACS, or Model-based Analysis of ChIP-Seq, is used for capturing the enriched regions of sequence reads. It takes the influence of genome complexity into consideration, and improves the spatial resolution of binding sites through combining the information of both sequencing tag position and orientation.
Output directory: results/macs
assay_peaks.xls
- Tabular file which contains information about called peaks. Information include:
- chromosome name
- start position of peak
- end position of peak
- length of peak region
- absolute peak summit position
- pileup height at peak summit, -log10(pvalue) for the peak summit (e.g. pvalue =1e-10, then this value should be 10)
- fold enrichment for this peak summit against random Poisson distribution with local lambda, -log10(qvalue) at peak summit
- Tabular file which contains information about called peaks. Information include:
assay_peaks.narrowPeak
- BED6+4 format file which contains the peak locations together with peak summit, pvalue and qvalue.
assay_summits.bed
- BED format file which contains the peak summits locations for every peaks.
assay_peaks.broadPeak
- BED6+3 format file which is similar to narrowPeak file, except for missing the column for annotating peak summits.
assay_peaks.gappedPeak
- BED12+3 format file which contains both the broad region and narrow peaks.
assay_model.r
- R script with which a PDF image about the model based on your data can be produced.
.bdg
- bedGraph format files which can be imported to UCSC genome browser or be converted into even smaller bigWig files.
Refer to https://github.com/taoliu/MACS for the specifications of the output fields.
MultiQC is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in within the report data directory.
Output directory: results/MultiQC
multiqc_report.html
- MultiQC report - a standalone HTML file that can be viewed in your web browser
multiqc_data/
- Directory containing parsed statistics from the different tools used in the pipeline
For more information about how to use MultiQC reports, see http://multiqc.info