Emblask.nf

Haplotype-resolved dual assembly of ONT and Illumina reads

Emblask.nf is a diploid genome assembly pipeline that produces a set of two haplotype assemblies, also called dual assembly, from parent-offspring trio data. Unlike most haplotype-resolved dual assembly tools relying primarily on accurate long reads, Emblask is a hybrid approach using both noisy long reads of the offspring and accurate short reads of all three members of the trio, specifically ONT R9.4 and paired-end Illumina reads. Assemblies generated by Emblask have been shown to reach Q56 (about one error per 300,000 bp), more than 99.4% genome completeness, less than 0.05% switch error rate and a haplotig N50 up to 19Mbp.

Requirements

The pipeline has been designed exclusively for ONT R9.4 and Illumina data. ONT R10 will most likely work too but has not been thoroughly tested, the pipeline is currently being updated for this type of data.

Data

The pipeline takes as input parent-offspring trio data: long and short reads for the offspring genome to assemble plus short reads for the parents.

Minimum coverage:

Proband: 20x ONT R9.4 and 30x Illumina
Paternal: 20x Illumina
Maternal: 20x Illumina

Recommended coverage:

Proband: 40x ONT R9.4 and 40x Illumina
Paternal: 30x Illumina
Maternal: 30x Illumina

Output genome assembly quality, completeness, N50 and switch error rate are dependent on the input read coverage, N50 and error rate.

Software

Emblask has been implemented as a Nextflow pipeline and its software dependencies have been gathered within Singularity containers for ease of use.

Clone the repository

git clone https://github.com/DecodeGenetics/emblask.git
cd emblask

Pull the containers

First, pull the Pepper-Margin-DeepVariant containers:

mkdir -p containers

singularity pull --dir containers docker://kishwars/pepper_deepvariant:r0.8
singularity pull --dir containers docker://kishwars/pepper_deepvariant:r0.7

Second, pull the Weaver & Emblask container. This container is currently hosted on Sylabs Cloud for which you will need to create an account:

Go to https://cloud.sylabs.io/
Click "Sign up".
Select your method to sign up (Google, GitHub, GitLab, or Microsoft) and choose a username.
Once signed in, click on your username (top right corner) and then "Access Tokens".
Enter a token name such as "SylabsCloud" and then click "Create access token".
Copy the generated access token
Add Sylabs Cloud as a remote endpoint in your Singularity installation:

singularity remote add SylabsCloud cloud.sylabs.io

When prompted, paste your previously generated access token.

Pull the Weaver & Emblask container:

singularity pull --dir containers --arch amd64 library://guillaumeholley/weaver_emblask/weaver_emblask:latest

After this, you should have 3 containers:

ls -lh containers
# weaver_emblask_latest.sif
# pepper_deepvariant_r0.7.sif
# pepper_deepvariant_r0.8.sif

Decompress the Ratatosk-specific models for PMDV r0.7

These files are already included in this repository and just need to be decompressed

cat pmdv/r07/models/ratatosk_r9_guppy5_sup/R9_GUPPY_SUP.tar.gz.* | tar -xvzf -C pmdv/r07/models/ratatosk_r9_guppy5_sup -

The output should be 5 files:

ls -lh pmdv/r07/models/ratatosk_r9_guppy5_sup
# R9_GUPPY_SUP_DEEPVARIANT.data-00000-of-00001
# R9_GUPPY_SUP_DEEPVARIANT.index
# R9_GUPPY_SUP_DEEPVARIANT.meta
# R9_GUPPY_SUP_PEPPER_HP.pkl
# R9_GUPPY_SUP_PEPPER_SNP.pkl

Alternatively, one can download the models here and then decompress the archive:

mkdir -p pmdv/r07/models/ratatosk_r9_guppy5_sup
tar -xvzf R9_GUPPY_SUP_MODELS.tar.gz -C pmdv/r07/models/ratatosk_r9_guppy5_sup

Usage

IMPORTANT: See Cluster configuration below before running Emblask.nf to configure the pipeline to your cluster system.

nextflow run -profile cluster Emblask.nf \
--proband_lr_fq_in proband_long_reads.fastq.gz --proband_sr_fq_in proband_short_reads.fastq.gz \
--father_sr_fq_in paternal_short_reads.fastq.gz --mother_sr_fq_in maternal_short_reads.fastq.gz \
--out_dir /my/output/directory/

Pipeline arguments

Mandatory:

--proband_lr_fq_in or --proband_lr_bam_in: Corrected long reads (ONT R9.4) from the sample to assemble in FASTQ or BAM. Use Ratatosk.nf or Ratatosk to perform the correction.
--proband_sr_fq_in or --proband_sr_bam_in: Short reads from the sample to assemble in FASTQ or BAM. If in FASTQ format, it must be an interleaved FASTQ file!
--father_sr_fq_in or --father_sr_bam_in: Short reads from the father of the sample to assemble in FASTQ or BAM.
--mother_sr_fq_in or --mother_sr_bam_in: Short reads from the mother of the sample to assemble in FASTQ or BAM.
--out_dir: Output directory.

If the input proband long read coverage exceeds 50x, an estimate of the genome size (in bp) to assemble must be provided with --genome_size, e.g --genome_size 3100000000 for a human sample.

Optional:

--max-lr-bq: Maximum base quality of the input long reads to assemble. Default is 40.

Alternatively, one can avoid using command line arguments by editing the parameter file params.yaml instead. Once the file is edited, the pipeline can be run with the following command:

nextflow run -profile cluster -params-file params.yaml Emblask.nf

Cluster configuration

By default, Emblask.nf will run jobs on SLURM. You can use the workload manager of your choice by:

adding -process.executor=your_workload_manager on the command line
modifying the value of cluster.process.executor in nextflow.config

Nextflow supports a wide variety of workload managers and cloud systems: SLURM, SGE, LSF, AWS, Google Cloud, etc. See the Nextflow executor documentation for more information.

The pipeline uses 3 node profiles with default requirements:

small_node: 32 cores, 2GB of RAM per core. Used for manipulating FASTQ and BAM files.
medium_node: 32 cores, 4GB of RAM per core. Used for reads mapping, variant calling, phasing, etc.
large_node: 64 cores, 6GB of RAM per core. Used by compute-intensive and memory-consuming jobs such as sequence assembly.

These profiles can be edited in nextflow.config to fit your cluster configuration. Keep in mind that jobs with the large_node profile are very CPU, RAM and IO demanding so it is important to give large_node your "best" node specs.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
config		config
module		module
pmdv		pmdv
.gitignore		.gitignore
Emblask.nf		Emblask.nf
README.md		README.md
nextflow.config		nextflow.config
params.yaml		params.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Emblask.nf

Haplotype-resolved dual assembly of ONT and Illumina reads

Requirements

Data

Software

Usage

Pipeline arguments

Cluster configuration

About

Releases

Packages

Languages

DecodeGenetics/Emblask

Folders and files

Latest commit

History

Repository files navigation

Emblask.nf

Haplotype-resolved dual assembly of ONT and Illumina reads

Requirements

Data

Software

Usage

Pipeline arguments

Cluster configuration

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages