Skip to content

Latest commit

 

History

History
128 lines (86 loc) · 5.99 KB

README.md

File metadata and controls

128 lines (86 loc) · 5.99 KB

Snakemake Methylation Pipeline

A standardized, reproducible pipeline to process WGBS bisulfite & EM-seq data. This goes from .fastq to methylation calls (via biscuit) and includes extensive QC and plotting, using a Snakemake pipeline.

At a high level, this pipeline reproducibly:

  • Builds a reference genome
  • Trims & (minimally) filters reads
  • Aligns & calls methylation using biscuit
  • Flags non-converted reads
  • Generates standardized outputs & QC including
    • FastQC
    • fastp
    • Biscuit QC
    • samtools stats
    • MethylDackel mbias plots
    • Goleft covplots
    • epibed/epiread files
  • Runs multiqc across entire projects

Getting Started

Installation

  1. Clone this repo: git clone https://github.com/semenko/serpent-methylation-pipeline.git
  2. Install mamba: conda install -c conda-forge mamba or mambaforge
  3. Install snakemake: mamba install -c bioconda -c conda-forge snakemake
  4. (Optionally) Test install the dependencies: mamba env create -n test_env -f serpent-methylation-pipeline/workflow/envs/env.yaml

Test Run

$ nice snakemake --cores 50 --use-conda --printshellcmds --rerun-incomplete --rerun-triggers mtime --keep-going --dry-run

After removing the --dry-run flag, this will download reference genomes and build indices.

Data Definition

Expected Output

Raw data files from data are processed and analyzed by this snakemake workflow. Within each project directory, the output is (roughly) structured as:

SAMPLE_01/                  # e.g. melanoma / crc / healthy, etc.
│   SAMPLE.bam              # The final alignment file 
|   SAMPLE.bam.bai          #   (and its index)
|── biscuit_qc/             # biscuit QC.sh text files
|── epibeds/                # epibed files (bgzip-compressed & tabix-indexed)
├── fastp/                  # fastp statistics & logs
├── fastqc/                 # fastqc graphs 
├── goleft/                 # goleft coverage plots
├── logs/                   # runlogs from each pipeline component
├── methyldackel/           # mbias plots
├── raw/
│   ├── ...fastq.gz         # Raw reads
|   ├── ...md5.txt          # Checksums and validation
├── samtools/               # samtools statistics
SAMPLE_02/
...
...
multiqc/                    # A project-level multiqc stats across all data

Note each project also has a _subsampled directory with identical structure, which is the result of the pipeline run on only 10M reads/sample.

Production Runs

Pipeline Details

This pipeline was designed for highly reproducible, explainable alignments and analysis of epigenetic sequencing data.

Reference Genome

I chose GRCh38, with these specifics:

You can see a good explanation of the rationale for some of these components at this NCBI explainer.

Requirements

All software requirements are specified in env.yaml.

Most are relatively common, but a few are semi-unique:

biscuit was chosen after a comparison with bwa-meth and bismark — its latest version was the most flexible with extremely well annotated .bams (some critical tags are missing from bwa-meth for identifying read level methylation, and would require patching MethylDackel to extract data).

I briefly experimented with wgbs_tools (which defines nice .pat/.beta formats) but its licensing is too restrictive to use.

Trimming Approach

I chose a relatively conservative approach to trimming -- which is needed due to end-repair bias, adaptase bias, and more.

For EMseq, I trim 10 bp everywhere, after personal QC and offline discussions with NEB. See my notes here.

For BSseq, I trim 15 bp 5' R2, and 10 bp everywhere else due to adaptase bias.

For all reads, I set --trim_poly_g (due to two color bias) and set a --length_required (minimum read length) of 10 bp.

No Quality Filtering

Notably I do NOT do quality filtering here (I set --disable_quality_filtering), and save this for downstream analyses as desired.

I experimented with more stringent quality filtering early on, and found it had little yield / performance benefit.

Background & Inspiration

I strongly suggest reading work from Felix Krueger (author of Bismark) as background. In particular:

For similar pipelines and inspiration, see:

Pipeline Graph

Here's a high-level overview of the Snakemake pipeline (generated via snakemake --rulegraph | dot -Tpng > rules.png)

image