Skip to content

Latest commit

 

History

History
181 lines (134 loc) · 11.2 KB

output.md

File metadata and controls

181 lines (134 loc) · 11.2 KB

nf-core/chipseq Output

This document describes the output produced by the pipeline.

Pipeline overview:

The pipeline is built using Nextflow and processes data using the following steps:

  • FastQC - read quality control
  • TrimGalore - adapter trimming
  • BWA - alignment
  • SAMtools - alignment result processing
  • Bedtools - bam to bed file conversion
  • Picard - duplicate reads removal
  • Phantompeakqualtools - normalized strand cross-correlation (NSC) and relative strand cross-correlation (RSC)
  • deepTools - fingerprint, correlation plots of reads over genome-wide bins; distribution of reads around genes
  • MACS - peak calling
  • MultiQC - aggregate report, describing results of the whole pipeline

FastQC

FastQC gives general quality metrics about your reads. It provides information about the quality score distribution across your reads, the per base sequence content (%T/A/G/C). You get information about adapter contamination and other overrepresented sequences.

For further reading and documentation see the FastQC help.

NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality. To see how your reads look after trimming, look at the FastQC reports in the trim_galore directory.

Output directory: results/fastqc

  • sample_fastqc.html
    • FastQC report, containing quality metrics for your untrimmed raw fastq files
  • sample_fastqc.zip
    • zip file containing the FastQC report, tab-delimited data file and plot images

TrimGalore

TrimGalore is used for removal of adapter contamination and trimming of low quality regions. TrimGalore uses Cutadapt for adapter trimming and runs FastQC after it finishes.

MultiQC reports the percentage of bases removed by TrimGalore in the General Statistics table, along with a line plot showing where reads were trimmed.

Output directory: results/trimgalore

Contains FastQ files with quality and adapter trimmed reads for each sample, along with a log file describing the trimming.

  • sample_val_1.fq.gz, sample_val_2.fq.gz
    • Trimmed FastQ data, reads 1 and 2.
  • sample_val_1.fastq.gz_trimming_report.txt
    • Trimming report (describes which parameters that were used)
  • sample_val_1_fastqc.html
  • sample_val_1_fastqc.zip
    • FastQC report for trimmed reads

Single-end data will have slightly different file names and only one FastQ file per sample:

  • sample_trimmed.fq.gz
    • Trimmed FastQ data
  • sample.fastq.gz_trimming_report.txt
    • Trimming report (describes which parameters that were used)
  • sample_trimmed_fastqc.html
  • sample_trimmed_fastqc.zip
    • FastQC report for trimmed reads

BWA

BWA, or Burrows-Wheeler Aligner, is designed for mapping low-divergent sequence reads against reference genomes. The result alignment files are further processed with SAMtools and Bedtools.

Output directory: results/bwa

  • sample.sorted.bam
    • The sorted aligned BAM file
  • sample.sorted.bam.bai
    • The index file for aligned BAM file
  • sample.sorted.bed
    • The sorted aligned BED file

Picard

The MarkDuplicates module in the Picard toolkit differentiates the primary and duplicate reads using an algorithm that ranks reads by the sums of their base-quality scores, which helps to identify duplicates that arise during sample preparation e.g. library construction using PCR.

The Picard section of the MultiQC report shows a bar plot with the numbers and proportions of primary reads, duplicate reads and unmapped reads.

Picard

Output directory: results/picard

  • sample.dedup.sorted.bam
    • The sorted aligned BAM file after duplicate removal
  • sample.dedup.sorted.bam.bai
    • The index file for aligned BAM file after duplicate removal
  • sample.dedup.sorted.bed
    • The sorted aligned BED file after duplicate removal
  • sample.picardDupMetrics.txt
    • The log report for duplicate removal

Phantompeakqualtools

Phantompeakqualtools plots the strand cross-correlation of aligned reads for each sample. In a strand cross-correlation plot, reads are shifted in the direction of the strand they map to by an increasing number of base pairs and the Pearson correlation between the per-position read count vectors for each strand is calculated. Two cross-correlation peaks are usually observed in a ChIP experiment, one corresponding to the read length ("phantom" peak) and one to the average fragment length of the library. The absolute and relative height of the two peaks are useful determinants of the success of a ChIP-seq experiment. A high-quality IP is characterized by a ChIP peak that is much higher than the "phantom" peak, while often very small or no such peak is seen in failed experiments.

Phantompeakqualtools Source: Landt SG et al, Genome Research (2012)

Normalized strand coefficient (NSC) is the normalized ratio between the fragment-length cross-correlation peak and the background cross-correlation. NSC values range from a minimum of 1 to larger positive numbers. 1.1 is the critical threshold. Datasets with NSC values much less than 1.1 (< 1.05) tend to have low signal to noise or few peaks (this could be biological eg. a factor that truly binds only a few sites in a particular tissue type OR it could be due to poor quality). ENCODE cutoff: NSC > 1.05.

Relative strand correlation (RSC) is the ratio between the fragment-length peak and the read-length peak. RSC values range from 0 to larger positive values. 1 is the critical threshold. RSC values significantly lower than 1 (< 0.8) tend to have low signal to noise. The low scores can be due to failed and poor quality ChIP, low read sequence quality and hence lots of mismappings, shallow sequencing depth (significantly below saturation) or a combination of these. Like the NSC, datasets with few binding sites (< 200), which is biologically justifiable, also show low RSC scores. ENCODE cutoff: RSC > 0.8.

Output directory: results/phantompeakqualtools

  • sample.dedup.sorted.pdf
    • The strand shift cross-correlation plot of aligned reads after duplicate removal
  • sample.spp.out
    • Normalized strand cross-correlation (NSC) and relative strand cross-correlation (RSC) results
  • sample.spp.csv
    • Raw data for creating the strand shift cross-correlation plot

deepTools

deepTools visualizes the distribution of fragment sizes for paired-end dataset, the fingerprint of sequence reads distribution, the distribution of sequence reads around genes in reference genome, the pair-wise correlation and PCA clustering of samples based on genome-wide reads counts.

deepTools

In a fingerprint plot, a completely random distribution of reads along the genome (i.e. without enrichments in open chromatin etc.) should generate a straight diagonal line. A very specific and strong regional enrichment will be indicated by a prominent and steep rise of the cumulative sum towards the highest rank. This means that a big chunk of reads from the sample is located in few bins which corresponds to high, narrow enrichments which are typically seen for transcription factors.

deepTools

Distribution of sequence reads around genes plus upstream and downstream flanking regions in the reference genome. All genes have been normalized to the same length.

deepTools

Spearman correlation coefficient in each square is calculated with the read counts in genomic bins between the sample pair at the top and the right. A higher correlation coefficient value indicates higher similarity between the sample pair.

deepTools

The upper panel shows the clustering of samples based on the top 2 principal components of genome-wide distribution of sequence reads. The lower panel shows the contribution weights of principal components.

deepTools

Output directory: results/deepTools

  • fragment_size_histogram.pdf
    • Histogram of fragment sizes for paired-end reads
  • fingerprint.pdf
    • Fingerprint plot for sequence reads distribution
  • read_distribution_profile.pdf
    • Distribution of sequence reads around genes
  • heatmap_SpearmanCorr.pdf
    • Spearman pairwise correlation of samples based on read counts
  • pcaplot.pdf
    • Sample clusters based on top 2 principal components

MACS

MACS, or Model-based Analysis of ChIP-Seq, is used for capturing the enriched regions of sequence reads. It takes the influence of genome complexity into consideration, and improves the spatial resolution of binding sites through combining the information of both sequencing tag position and orientation.

Output directory: results/macs

  • assay_peaks.xls
    • Tabular file which contains information about called peaks. Information include:
      • chromosome name
      • start position of peak
      • end position of peak
      • length of peak region
      • absolute peak summit position
      • pileup height at peak summit, -log10(pvalue) for the peak summit (e.g. pvalue =1e-10, then this value should be 10)
      • fold enrichment for this peak summit against random Poisson distribution with local lambda, -log10(qvalue) at peak summit
  • assay_peaks.narrowPeak
    • BED6+4 format file which contains the peak locations together with peak summit, pvalue and qvalue.
  • assay_summits.bed
    • BED format file which contains the peak summits locations for every peaks.
  • assay_peaks.broadPeak
    • BED6+3 format file which is similar to narrowPeak file, except for missing the column for annotating peak summits.
  • assay_peaks.gappedPeak
    • BED12+3 format file which contains both the broad region and narrow peaks.
  • assay_model.r
    • R script with which a PDF image about the model based on your data can be produced.
  • .bdg
    • bedGraph format files which can be imported to UCSC genome browser or be converted into even smaller bigWig files.

Refer to https://github.com/taoliu/MACS for the specifications of the output fields.

MultiQC

MultiQC is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in within the report data directory.

Output directory: results/MultiQC

  • multiqc_report.html
    • MultiQC report - a standalone HTML file that can be viewed in your web browser
  • multiqc_data/
    • Directory containing parsed statistics from the different tools used in the pipeline

For more information about how to use MultiQC reports, see http://multiqc.info