Skip to content

Latest commit

 

History

History
147 lines (106 loc) · 8.15 KB

README.md

File metadata and controls

147 lines (106 loc) · 8.15 KB

HiCflow

Comprehensive bioinformatics analysis pipeline for processing raw HiC read data to publication read HiC maps.

HiCflow aims to provide an accessible and user-friendly experience to analyse HiC data using a wide range of published tools. The pipeline utilises the workflow management system Snakemake and automatically handles the installation of all required software with no user input. HiCflow can also be easily scaled to work in cluster environments. Current software utilised by HiCflow includes:

  • FastQC - A quality control tool for high throughput sequence data.
  • FastQ Screen - A tool to screen for species composition in FASTQ sequences.
  • Cutadapt - A tool to remove adapter sequences, primers, poly-A tails and others from high-throughput sequencing reads.
  • HiCUP - A tool for mapping and performing quality control on HiC data.
  • HiCExplorer - A set of tools for building, normalising and processing HiC matrices.
  • OnTAD - An optimised nested TAD caller for identifying hierarchical TADs in HiC data.
  • HiCRep - A tool for assessing the reproducibility of HiC data using a stratum-adjusted correlation coefficient.
  • HiCcompare - A tool for joint normalisation and comparison of HI-C datasets
  • pyGenomeTracks - A tool for plotting customisable, publication-ready genome tracks including HiC maps.
  • MultiQC - Aggregate results from bioinformatics analyses across many samples into a single report.

Table of contents

Installation

HiCflow works with python >=3.6 and requires Snakemake.

The HiCflow repository can be downloaded from GitHub as follows:

git clone https://github.com/StephenRicher/HiCFlow.git

Configuring HiCFlow

The HiCFlow pipeline is fully controlled through a single configuration file that describes parameter settings and paths to relevant files in the system. HiCFlow is bundled with a fully configured small HiC dataset (Wang et al., 2018) to test and serve as a template for configuring other datasets. The configuration file for the example dataset is provided at at here.

Example Configurations

  • Typical HiC Analysis
    • Run standard HiC workflow.
  • HiC Analysis + Variant Calling + Haplotype Assembly
    • Run standard HiC workflow and full variant calling and haplotype assembly pipeline.
    • Phased VCF output compatible with ASHiC workflow.
  • HiC Analysis + Haplotype Assembly
    • Run standard HiC workflow and haplotype assembly pipeline.
    • Requires a set of pre-called variants.
    • If high quality calls from WGS data are available then we recommend using these rather than performing variant calling with HiCFlow.
  • Allele Specific HiC
    • Perform allele-specific HiC workflow.
    • Requires a set of phased variants, either from HiCFlow or another source.

Note: If relative file paths are provided in the configuration file, then these are relative to the working directory. The working directory itself (defined by workdir) is relative to the directory snakemake is executed. If not set, the working directory defaults to the directory containing the Snakefile. Relative paths can be confusing; they are used here to ensure the example dataset works for all users. If in doubt, simply provide absolute paths.

Usage

Once Snakemake is installed, the example dataset can be processed using the following command. This command should be run from the HiCFlow base directory containing the Snakefile.

snakemake --use-conda --cores 4 --configfile example/config/config.yaml

This command will first install all relevant Conda environments within the defined working directory (example/analysis/); this may take some time. The pipeline should then run to completion producing the exact figures as shown in the example output below. Alternatively, you may also want to install the Conda environments in a custom directory. A custom directory is helpful if you perform multiple independent analyses and do not want to install the same Conda environments repeatedly.

snakemake --use-conda --conda-prefix /path/envs/ --cores 4 --configfile example/config/config.yaml

Running FastQ Screen

By default, the example analysis will not run FastQ Screen as the references genomes are too large to be packaged with GitHub. To obtain the reference genomes first install FastQ Screen. Then run the following command from the HiCFlow home directory to install the reference genomes to (example/) directory.

fastq_screen --get_genomes --outdir example/

Finally, uncomment the fastqScreen : line in the config file and rerun the workflow. Note: if you have downloaded the reference genomes to a different path you will need to update the paths in example/config/fastq_screen.config.

Cluster Execution

All Snakemake-based pipelines, including HiCFlow, are compatible with cluster environments. Consult the official Snakemake documentation here to learn more about running HiCFlow on your particular cluster environment.

Example output

HiC track

HiCflow utilises pyGenomeTracks to plot annotated HiC tracks with nested TAD domains, loops and TAD insulation scores. In addition, custom BED and Bedgraph files can be provided through the configuration file. HiC plot example

HiCcompare track

HiCflow uses HiCcompare to produce joint normalised log fold-change subtraction matrices between pairs of samples. HiCcompare example

Viewpoints

HiCFlow can also plot custom viewpoints of specific regions. Viewpoint regions must be provided as a BED file in the configuration file under plotParams -> viewpoints. The below example compares two samples using between-sample normalised contact frequencies provided by HiCcompare. Viewpoint example

Quality Control

MultiQC report

HiCflow utilises MultiQC to aggregate the QC and metric report across all samples and all compatible tools used in the pipeline. An example MultiQC report produced by HiCflow is shown here.

HiCRep

HiCflow uses HiCRep to assess sample reproducibility by calculating the stratum-adjusted correlation coefficient between all pairwise samples. HiCRep example

Other QC Metrics

Insert Size Distribution

InsertSize

Ditag Length

Ditag Length

References

Qi Wang, Qiu Sun, Daniel M. Czajkowsky, and Zhifeng Shao. Sub-kb Hi-C in D. melanogaster reveals conserved characteristics of TADs between insect and mammalian cells. Nature Communications, 2018. ISSN 20411723. doi: 10.1038/s41467-017-02526-9.