diff --git a/docs/index.rst b/docs/index.rst index 4bb6a47..589b96f 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,32 +1,26 @@ -Metagenome Assembly Workflow (v1.0.7) +Metagenome Assembly Workflow (v1.0.2) ===================================== .. image:: workflow_assembly.png :scale: 60% :alt: Metagenome assembly workflow dependencies - + Workflow Overview ----------------- -This workflow takes in paired-end Illumina short reads or paired-end PacBio long reads in interleaved format and performs error correction, then reformats the interleaved file into two FASTQ files for downstream tasks using bbcms (BBTools). The corrected reads are assembled using metaSPAdes. After assembly, the reads are mapped back to contigs by bbmap (BBTools) for coverage information. The `.wdl` (Workflow Description Language) file includes five tasks: *bbcms*, *assy*, *create_agp*, *read_mapping_pairs*, and *make_output*. +This workflow takes in paired-end Illumina reads in interleaved format and performs error correction, then reformats the interleaved file into two FASTQ files for downstream tasks using bbcms (BBTools). The corrected reads are assembled using metaSPAdes. After assembly, the reads are mapped back to contigs by bbmap (BBTools) for coverage information. The .wdl (Workflow Description Language) file includes five tasks, *bbcms*, *assy*, *create_agp*, *read_mapping_pairs*, and *make_output*. -1. The *bbcms* task takes in interleaved FASTQ inputs, performs error correction, and reformats the interleaved FASTQ into two output FASTQ files for paired-end reads for the next tasks. -2. The *assy* task performs metaSPAdes assembly. -3. Contigs and Scaffolds (output of metaSPAdes) are processed by the *create_agp* task to rename the FASTA header and generate an `AGP format `_ which describes the assembly. +1. The *bbcms* task takes in interleaved FASTQ inputs and performs error correction and reformats the interleaved fastq into two output FASTQ files for paired-end reads for the next tasks. +2. The *assy* task performs metaSPAdes assembly +3. Contigs and Scaffolds (output of metaSPAdes) are consumed by the *create_agp* task to rename the FASTA header and generate an `AGP format `_ which describes the assembly 4. The *read_mapping_pairs* task maps reads back to the final assembly to generate coverage information. -5. The final *make_output* task collects all output files into the specified directory. +5. The final *make_output* task adds all output files into the specified directory. Workflow Availability --------------------- -The workflow from GitHub uses all the listed Docker images to run all third-party tools. - -The workflow is available on GitHub: `https://github.com/microbiomedata/metaAssembly` - -The corresponding Docker images are available on DockerHub: - -- `https://hub.docker.com/r/microbiomedata/spades` -- `https://hub.docker.com/r/microbiomedata/bbtools` +The workflow from GitHub uses all the listed docker images to run all third-party tools. +The workflow is available in GitHub: https://github.com/microbiomedata/metaAssembly; the corresponding Docker images are available in DockerHub: https://hub.docker.com/r/microbiomedata/spades and https://hub.docker.com/r/microbiomedata/bbtools Requirements for Execution -------------------------- @@ -39,135 +33,114 @@ Requirements for Execution Hardware Requirements --------------------- -**Memory: >40 GB RAM** +- Memory: >40 GB RAM The memory requirement depends on the input complexity. Here is a simple estimation equation for the memory required based on kmers in the input file:: predicted_mem = (kmers * 2.962e-08 + 1.630e+01) * 1.1 (in GB) .. note:: + + The kmers variable for the equation above can be obtained using the kmercountmulti.sh script from BBTools. - The kmers variable for the equation above can be obtained using the `kmercountmulti.sh` script from BBTools. + kmercountmulti.sh -k=31 in=your.read.fq.gz - Example command: - - :: - - kmercountmulti.sh -k=31 in=your.read.fq.gz Workflow Dependencies --------------------- -Third-party software: (This is included in the Docker image.) +Third party software: (This is included in the Docker image.) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -- `metaSPAdes v3.15.0 `_ (License: `GPLv2 `_) -- `BBTools v38.94 `_ (License: `BSD-3-Clause-LBNL `_) +- `metaSPades v3.15.0 `_ (License: `GPLv2 `_) +- `BBTools:38.94 `_ (License: `BSD-3-Clause-LBNL `_) Sample dataset(s) ----------------- -- Small dataset: `Ecoli 10x (287M) `_ (Input/output included in tar.gz file) -- Large dataset: `Zymobiomics mock-community DNA control (22G) `_ (Input/output included in tar.gz file) +- small dataset: `Ecoli 10x (287M) `_ . You can find input/output in the downloaded tar gz file. -Long reads dataset: `PacBio `_ +- large dataset: `Zymobiomics mock-community DNA control (22G) `_ . You can find input/output in the downloaded tar gz file. -Zymobiomics mock-community DNA control (`SRR7877884 `_). The original dataset is ~4 GB. +Zymobiomics mock-community DNA control (`SRR7877884 `_); this original dataset is ~4 GB. + +For testing purposes and for the following examples, we used a 10% sub-sampling of the above dataset: (`SRR7877884-int-0.1.fastq.gz `_). This dataset is already interleaved. -For testing, a 10% subsample of the dataset is used: (`SRR7877884-int-0.1.fastq.gz `_). This dataset is already interleaved. Input ----- -A `JSON file `_ containing the following information: - -1. The path to the input FASTQ file (Illumina paired-end interleaved FASTQ or PacBio paired-end interleaved FASTQ) (recommended: output of the Reads QC workflow). -2. Project name: nmdc:XXXXXX -3. Memory (optional) e.g., `"jgi_metaAssembly.memory": "105G"` -4. Threads (optional) e.g., `"jgi_metaAssembly.threads": "16"` -5. Whether the input is short reads (boolean) +A JSON file containing the following information: -Example input JSON for short reads:: +1. the path to the input FASTQ file (Illumina paired-end interleaved FASTQ) (recommended the output of the Reads QC workflow.) +2. the contig prefix for the FASTA header +3. the output path +4. input_interleaved (boolean) +5. forwards reads fastq file (required value when input_interleaved is false, otherwise use [] ) +6. reverse reads fastq file (required value when input_interleaved is false, otherwise use [] ) +7. memory (optional) ex: “jgi_metaASM.memory”: “105G” +8. threads (optional) ex: “jgi_metaASM.threads”: “16” - { - "jgi_metaAssembly.input_files": ["https://portal.nersc.gov/project/m3408/test_data/smalltest.int.fastq.gz"], - "jgi_metaAssembly.proj": "nmdc:XXXXXX", - "jgi_metaAssembly.memory": "105G", - "jgi_metaAssembly.threads": "16", - "jgi_metaAssembly.shortRead": true - } - -Example input JSON for long reads:: +An example input JSON file is shown below:: { - "jgi_metaAssembly.input_files": ["/global/cfs/cdirs/m3408/www/test_data/SRR13128014.pacbio.subsample.ccs.fastq.gz"], - "jgi_metaAssembly.proj": "nmdc:XXXXXX", - "jgi_metaAssembly.memory": "105G", - "jgi_metaAssembly.threads": "16", - "jgi_metaAssembly.shortRead": false + "jgi_metaASM.input_file":["/path/to/SRR7877884-int-0.1.fastq.gz "], + "jgi_metaASM.rename_contig_prefix":"projectID", + "jgi_metaASM.outdir":"/path/to/ SRR7877884-int-0.1_assembly", + "jgi_metaASM.input_interleaved":true, + "jgi_metaASM.input_fq1":[], + "jgi_metaASM.input_fq2":[], + "jgi_metaASM.memory": "105G", + "jgi_metaASM.threads": "16" } Output ------ -The output directory will contain the following files for short reads:: - - output/ - ├── nmdc_XXXXXX_metaAsm.info - ├── nmdc_XXXXXX_covstats.txt - ├── nmdc_XXXXXX_bbcms.fastq.gz - ├── nmdc_XXXXXX_scaffolds.fna - ├── nmdc_XXXXXX_assembly.agp - ├── stats.json - ├── nmdc_XXXXXX_pairedMapped.sam.gz - └── nmdc_XXXXXX_pairedMapped_sorted.bam +The output directory will contain following files:: -The output directory will contain the following files for long reads:: output/ - ├── nmdc_XXXXXX_assembly.legend - ├── nmdc_XXXXXX_contigs.fna - ├── nmdc_XXXXXX_pairedMapped_sorted.bam - ├── nmdc_XXXXXX_read_count_report.txt - ├── nmdc_XXXXXX_metaAsm.info - ├── nmdc_XXXXXX_summary.stats - ├── nmdc_XXXXXX_scaffolds.fna - ├── nmdc_XXXXXX_pairedMapped.sam.gz - ├── stats.json - ├── nmdc_XXXXXX_contigs.sam.stats - ├── nmdc_XXXXXX_contigs.sorted.bam.pileup.basecov - ├── nmdc_XXXXXX_assembly.agp - └── nmdc_XXXXXX_contigs.sorted.bam.pileup.out - -Example output stats JSON file:: - - { - "scaffolds": 58, - "contigs": 58, - "scaf_bp": 28406, - "contig_bp": 28406, - "gap_pct": 0.00000, - "scaf_N50": 21, - "scaf_L50": 536, - "ctg_N50": 21, - "ctg_L50": 536, - "scaf_N90": 49, - "scaf_L90": 317, - "ctg_N90": 49, - "ctg_L90": 317, - "scaf_logsum": 22.158, - "scaf_powsum": 2.245, - "ctg_logsum": 22.158, - "ctg_powsum": 2.245, - "asm_score": 0.000, - "scaf_max": 1117, - "ctg_max": 1117, - "scaf_n_gt50K": 0, - "scaf_l_gt50K": 0, - "scaf_pct_gt50K": 0.0, - "gc_avg": 0.39129, - "gc_std": 0.03033 - } + ├── assembly.agp + ├── assembly_contigs.fna + ├── assembly_scaffolds.fna + ├── covstats.txt + ├── pairedMapped.sam.gz + ├── pairedMapped_sorted.bam + └── stats.json + +Part of an example output stats JSON file is shown below: + +``` +{ + "scaffolds": 58, + "contigs": 58, + "scaf_bp": 28406, + "contig_bp": 28406, + "gap_pct": 0.00000, + "scaf_N50": 21, + "scaf_L50": 536, + "ctg_N50": 21, + "ctg_L50": 536, + "scaf_N90": 49, + "scaf_L90": 317, + "ctg_N90": 49, + "ctg_L90": 317, + "scaf_logsum": 22.158, + "scaf_powsum": 2.245, + "ctg_logsum": 22.158, + "ctg_powsum": 2.245, + "asm_score": 0.000, + "scaf_max": 1117, + "ctg_max": 1117, + "scaf_n_gt50K": 0, + "scaf_l_gt50K": 0, + "scaf_pct_gt50K": 0.0, + "gc_avg": 0.39129, + "gc_std": 0.03033, + "filename": "/global/cfs/cdirs/m3408/aim2/metagenome/assembly/cromwell-executions/jgi_metaASM/3342a6e8-7f78-40e6-a831-364dd2a47baa/call-create_agp/execution/assembly_scaffolds.fna" +} +``` The table provides all of the output directories, files, and their descriptions. @@ -251,7 +224,7 @@ mapping/ stdout.background Version History --------------- -- 1.0.7 (release date **11/12/24**; previous versions: 1.0.6) +- 1.0.2 (release date **03/12/2021**; previous versions: 1.0.1) Point of contact ----------------