Skip to content

Dual assembly pipeline for ONT and Illumina reads

Notifications You must be signed in to change notification settings

DecodeGenetics/Emblask

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Emblask.nf

Haplotype-resolved dual assembly of ONT and Illumina reads

Emblask.nf is a diploid genome assembly pipeline that produces a set of two haplotype assemblies, also called dual assembly, from parent-offspring trio data. Unlike most haplotype-resolved dual assembly tools relying primarily on accurate long reads, Emblask is a hybrid approach using both noisy long reads of the offspring and accurate short reads of all three members of the trio, specifically ONT R9.4 and paired-end Illumina reads. Assemblies generated by Emblask have been shown to reach Q56 (about one error per 300,000 bp), more than 99.4% genome completeness, less than 0.05% switch error rate and a haplotig N50 up to 19Mbp.

Requirements

The pipeline has been designed exclusively for ONT R9.4 and Illumina data. ONT R10 will most likely work too but has not been thoroughly tested, the pipeline is currently being updated for this type of data.

Data

The pipeline takes as input parent-offspring trio data: long and short reads for the offspring genome to assemble plus short reads for the parents.

Minimum coverage:

  • Proband: 20x ONT R9.4 and 30x Illumina
  • Paternal: 20x Illumina
  • Maternal: 20x Illumina

Recommended coverage:

  • Proband: 40x ONT R9.4 and 40x Illumina
  • Paternal: 30x Illumina
  • Maternal: 30x Illumina

Output genome assembly quality, completeness, N50 and switch error rate are dependent on the input read coverage, N50 and error rate.

Software

Emblask has been implemented as a Nextflow pipeline and its software dependencies have been gathered within Singularity containers for ease of use.

  1. Clone the repository
git clone https://github.com/DecodeGenetics/emblask.git
cd emblask
  1. Pull the containers

    First, pull the Pepper-Margin-DeepVariant containers:

mkdir -p containers

singularity pull --dir containers docker://kishwars/pepper_deepvariant:r0.8
singularity pull --dir containers docker://kishwars/pepper_deepvariant:r0.7

Second, pull the Weaver & Emblask container. This container is currently hosted on Sylabs Cloud for which you will need to create an account:

  • Go to https://cloud.sylabs.io/
  • Click "Sign up".
  • Select your method to sign up (Google, GitHub, GitLab, or Microsoft) and choose a username.
  • Once signed in, click on your username (top right corner) and then "Access Tokens".
  • Enter a token name such as "SylabsCloud" and then click "Create access token".
  • Copy the generated access token
  • Add Sylabs Cloud as a remote endpoint in your Singularity installation:
singularity remote add SylabsCloud cloud.sylabs.io

When prompted, paste your previously generated access token.

  • Pull the Weaver & Emblask container:
singularity pull --dir containers --arch amd64 library://guillaumeholley/weaver_emblask/weaver_emblask:latest

After this, you should have 3 containers:

ls -lh containers
# weaver_emblask_latest.sif
# pepper_deepvariant_r0.7.sif
# pepper_deepvariant_r0.8.sif
  1. Decompress the Ratatosk-specific models for PMDV r0.7

These files are already included in this repository and just need to be decompressed

cat pmdv/r07/models/ratatosk_r9_guppy5_sup/R9_GUPPY_SUP.tar.gz.* | tar -xvzf -C pmdv/r07/models/ratatosk_r9_guppy5_sup -

The output should be 5 files:

ls -lh pmdv/r07/models/ratatosk_r9_guppy5_sup
# R9_GUPPY_SUP_DEEPVARIANT.data-00000-of-00001
# R9_GUPPY_SUP_DEEPVARIANT.index
# R9_GUPPY_SUP_DEEPVARIANT.meta
# R9_GUPPY_SUP_PEPPER_HP.pkl
# R9_GUPPY_SUP_PEPPER_SNP.pkl

Alternatively, one can download the models here and then decompress the archive:

mkdir -p pmdv/r07/models/ratatosk_r9_guppy5_sup
tar -xvzf R9_GUPPY_SUP_MODELS.tar.gz -C pmdv/r07/models/ratatosk_r9_guppy5_sup

Usage

IMPORTANT: See Cluster configuration below before running Emblask.nf to configure the pipeline to your cluster system.

nextflow run -profile cluster Emblask.nf \
--proband_lr_fq_in proband_long_reads.fastq.gz --proband_sr_fq_in proband_short_reads.fastq.gz \
--father_sr_fq_in paternal_short_reads.fastq.gz --mother_sr_fq_in maternal_short_reads.fastq.gz \
--out_dir /my/output/directory/

Pipeline arguments

Mandatory:

  • --proband_lr_fq_in or --proband_lr_bam_in: Corrected long reads (ONT R9.4) from the sample to assemble in FASTQ or BAM. Use Ratatosk.nf or Ratatosk to perform the correction.
  • --proband_sr_fq_in or --proband_sr_bam_in: Short reads from the sample to assemble in FASTQ or BAM. If in FASTQ format, it must be an interleaved FASTQ file!
  • --father_sr_fq_in or --father_sr_bam_in: Short reads from the father of the sample to assemble in FASTQ or BAM.
  • --mother_sr_fq_in or --mother_sr_bam_in: Short reads from the mother of the sample to assemble in FASTQ or BAM.
  • --out_dir: Output directory.

If the input proband long read coverage exceeds 50x, an estimate of the genome size (in bp) to assemble must be provided with --genome_size, e.g --genome_size 3100000000 for a human sample.

Optional:

  • --max-lr-bq: Maximum base quality of the input long reads to assemble. Default is 40.

Alternatively, one can avoid using command line arguments by editing the parameter file params.yaml instead. Once the file is edited, the pipeline can be run with the following command:

nextflow run -profile cluster -params-file params.yaml Emblask.nf

Cluster configuration

By default, Emblask.nf will run jobs on SLURM. You can use the workload manager of your choice by:

  • adding -process.executor=your_workload_manager on the command line
  • modifying the value of cluster.process.executor in nextflow.config

Nextflow supports a wide variety of workload managers and cloud systems: SLURM, SGE, LSF, AWS, Google Cloud, etc. See the Nextflow executor documentation for more information.

The pipeline uses 3 node profiles with default requirements:

  • small_node: 32 cores, 2GB of RAM per core. Used for manipulating FASTQ and BAM files.
  • medium_node: 32 cores, 4GB of RAM per core. Used for reads mapping, variant calling, phasing, etc.
  • large_node: 64 cores, 6GB of RAM per core. Used by compute-intensive and memory-consuming jobs such as sequence assembly.

These profiles can be edited in nextflow.config to fit your cluster configuration. Keep in mind that jobs with the large_node profile are very CPU, RAM and IO demanding so it is important to give large_node your "best" node specs.

About

Dual assembly pipeline for ONT and Illumina reads

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published