Emblask.nf is a diploid genome assembly pipeline that produces a set of two haplotype assemblies, also called dual assembly, from parent-offspring trio data. Unlike most haplotype-resolved dual assembly tools relying primarily on accurate long reads, Emblask is a hybrid approach using both noisy long reads of the offspring and accurate short reads of all three members of the trio, specifically ONT R9.4 and paired-end Illumina reads. Assemblies generated by Emblask have been shown to reach Q56 (about one error per 300,000 bp), more than 99.4% genome completeness, less than 0.05% switch error rate and a haplotig N50 up to 19Mbp.
The pipeline has been designed exclusively for ONT R9.4 and Illumina data. ONT R10 will most likely work too but has not been thoroughly tested, the pipeline is currently being updated for this type of data.
The pipeline takes as input parent-offspring trio data: long and short reads for the offspring genome to assemble plus short reads for the parents.
Minimum coverage:
- Proband: 20x ONT R9.4 and 30x Illumina
- Paternal: 20x Illumina
- Maternal: 20x Illumina
Recommended coverage:
- Proband: 40x ONT R9.4 and 40x Illumina
- Paternal: 30x Illumina
- Maternal: 30x Illumina
Output genome assembly quality, completeness, N50 and switch error rate are dependent on the input read coverage, N50 and error rate.
Emblask has been implemented as a Nextflow pipeline and its software dependencies have been gathered within Singularity containers for ease of use.
- Clone the repository
git clone https://github.com/DecodeGenetics/emblask.git
cd emblask
-
Pull the containers
First, pull the Pepper-Margin-DeepVariant containers:
mkdir -p containers
singularity pull --dir containers docker://kishwars/pepper_deepvariant:r0.8
singularity pull --dir containers docker://kishwars/pepper_deepvariant:r0.7
Second, pull the Weaver & Emblask container. This container is currently hosted on Sylabs Cloud for which you will need to create an account:
- Go to
https://cloud.sylabs.io/
- Click "Sign up".
- Select your method to sign up (Google, GitHub, GitLab, or Microsoft) and choose a username.
- Once signed in, click on your username (top right corner) and then "Access Tokens".
- Enter a token name such as "SylabsCloud" and then click "Create access token".
- Copy the generated access token
- Add Sylabs Cloud as a remote endpoint in your Singularity installation:
singularity remote add SylabsCloud cloud.sylabs.io
When prompted, paste your previously generated access token.
- Pull the Weaver & Emblask container:
singularity pull --dir containers --arch amd64 library://guillaumeholley/weaver_emblask/weaver_emblask:latest
After this, you should have 3 containers:
ls -lh containers
# weaver_emblask_latest.sif
# pepper_deepvariant_r0.7.sif
# pepper_deepvariant_r0.8.sif
- Decompress the Ratatosk-specific models for PMDV r0.7
These files are already included in this repository and just need to be decompressed
cat pmdv/r07/models/ratatosk_r9_guppy5_sup/R9_GUPPY_SUP.tar.gz.* | tar -xvzf -C pmdv/r07/models/ratatosk_r9_guppy5_sup -
The output should be 5 files:
ls -lh pmdv/r07/models/ratatosk_r9_guppy5_sup
# R9_GUPPY_SUP_DEEPVARIANT.data-00000-of-00001
# R9_GUPPY_SUP_DEEPVARIANT.index
# R9_GUPPY_SUP_DEEPVARIANT.meta
# R9_GUPPY_SUP_PEPPER_HP.pkl
# R9_GUPPY_SUP_PEPPER_SNP.pkl
Alternatively, one can download the models here and then decompress the archive:
mkdir -p pmdv/r07/models/ratatosk_r9_guppy5_sup
tar -xvzf R9_GUPPY_SUP_MODELS.tar.gz -C pmdv/r07/models/ratatosk_r9_guppy5_sup
IMPORTANT: See Cluster configuration below before running Emblask.nf
to configure the pipeline to your cluster system.
nextflow run -profile cluster Emblask.nf \
--proband_lr_fq_in proband_long_reads.fastq.gz --proband_sr_fq_in proband_short_reads.fastq.gz \
--father_sr_fq_in paternal_short_reads.fastq.gz --mother_sr_fq_in maternal_short_reads.fastq.gz \
--out_dir /my/output/directory/
Mandatory:
--proband_lr_fq_in
or--proband_lr_bam_in
: Corrected long reads (ONT R9.4) from the sample to assemble in FASTQ or BAM. Use Ratatosk.nf or Ratatosk to perform the correction.--proband_sr_fq_in
or--proband_sr_bam_in
: Short reads from the sample to assemble in FASTQ or BAM. If in FASTQ format, it must be an interleaved FASTQ file!--father_sr_fq_in
or--father_sr_bam_in
: Short reads from the father of the sample to assemble in FASTQ or BAM.--mother_sr_fq_in
or--mother_sr_bam_in
: Short reads from the mother of the sample to assemble in FASTQ or BAM.--out_dir
: Output directory.
If the input proband long read coverage exceeds 50x, an estimate of the genome size (in bp) to assemble must be provided with --genome_size
, e.g --genome_size 3100000000
for a human sample.
Optional:
- --max-lr-bq: Maximum base quality of the input long reads to assemble. Default is 40.
Alternatively, one can avoid using command line arguments by editing the parameter file params.yaml
instead. Once the file is edited, the pipeline can be run with the following command:
nextflow run -profile cluster -params-file params.yaml Emblask.nf
By default, Emblask.nf
will run jobs on SLURM. You can use the workload manager of your choice by:
- adding
-process.executor=your_workload_manager
on the command line - modifying the value of
cluster.process.executor
innextflow.config
Nextflow supports a wide variety of workload managers and cloud systems: SLURM, SGE, LSF, AWS, Google Cloud, etc. See the Nextflow executor documentation for more information.
The pipeline uses 3 node profiles with default requirements:
- small_node: 32 cores, 2GB of RAM per core. Used for manipulating FASTQ and BAM files.
- medium_node: 32 cores, 4GB of RAM per core. Used for reads mapping, variant calling, phasing, etc.
- large_node: 64 cores, 6GB of RAM per core. Used by compute-intensive and memory-consuming jobs such as sequence assembly.
These profiles can be edited in nextflow.config
to fit your cluster configuration. Keep in mind that jobs with the large_node profile are very CPU, RAM and IO demanding so it is important to give large_node your "best" node specs.