Skip to content

Commit

Permalink
Revert index.rst
Browse files Browse the repository at this point in the history
  • Loading branch information
vlilanl authored Nov 14, 2024
1 parent 4676bf3 commit 34e83d1
Showing 1 changed file with 81 additions and 108 deletions.
189 changes: 81 additions & 108 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -1,32 +1,26 @@
Metagenome Assembly Workflow (v1.0.7)
Metagenome Assembly Workflow (v1.0.2)
=====================================

.. image:: workflow_assembly.png
:scale: 60%
:alt: Metagenome assembly workflow dependencies

Workflow Overview
-----------------

This workflow takes in paired-end Illumina short reads or paired-end PacBio long reads in interleaved format and performs error correction, then reformats the interleaved file into two FASTQ files for downstream tasks using bbcms (BBTools). The corrected reads are assembled using metaSPAdes. After assembly, the reads are mapped back to contigs by bbmap (BBTools) for coverage information. The `.wdl` (Workflow Description Language) file includes five tasks: *bbcms*, *assy*, *create_agp*, *read_mapping_pairs*, and *make_output*.
This workflow takes in paired-end Illumina reads in interleaved format and performs error correction, then reformats the interleaved file into two FASTQ files for downstream tasks using bbcms (BBTools). The corrected reads are assembled using metaSPAdes. After assembly, the reads are mapped back to contigs by bbmap (BBTools) for coverage information. The .wdl (Workflow Description Language) file includes five tasks, *bbcms*, *assy*, *create_agp*, *read_mapping_pairs*, and *make_output*.

1. The *bbcms* task takes in interleaved FASTQ inputs, performs error correction, and reformats the interleaved FASTQ into two output FASTQ files for paired-end reads for the next tasks.
2. The *assy* task performs metaSPAdes assembly.
3. Contigs and Scaffolds (output of metaSPAdes) are processed by the *create_agp* task to rename the FASTA header and generate an `AGP format <https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/>`_ which describes the assembly.
1. The *bbcms* task takes in interleaved FASTQ inputs and performs error correction and reformats the interleaved fastq into two output FASTQ files for paired-end reads for the next tasks.
2. The *assy* task performs metaSPAdes assembly
3. Contigs and Scaffolds (output of metaSPAdes) are consumed by the *create_agp* task to rename the FASTA header and generate an `AGP format <https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/>`_ which describes the assembly
4. The *read_mapping_pairs* task maps reads back to the final assembly to generate coverage information.
5. The final *make_output* task collects all output files into the specified directory.
5. The final *make_output* task adds all output files into the specified directory.

Workflow Availability
---------------------

The workflow from GitHub uses all the listed Docker images to run all third-party tools.

The workflow is available on GitHub: `https://github.com/microbiomedata/metaAssembly`

The corresponding Docker images are available on DockerHub:

- `https://hub.docker.com/r/microbiomedata/spades`
- `https://hub.docker.com/r/microbiomedata/bbtools`
The workflow from GitHub uses all the listed docker images to run all third-party tools.
The workflow is available in GitHub: https://github.com/microbiomedata/metaAssembly; the corresponding Docker images are available in DockerHub: https://hub.docker.com/r/microbiomedata/spades and https://hub.docker.com/r/microbiomedata/bbtools

Requirements for Execution
--------------------------
Expand All @@ -39,135 +33,114 @@ Requirements for Execution
Hardware Requirements
---------------------

**Memory: >40 GB RAM**
- Memory: >40 GB RAM

The memory requirement depends on the input complexity. Here is a simple estimation equation for the memory required based on kmers in the input file::

predicted_mem = (kmers * 2.962e-08 + 1.630e+01) * 1.1 (in GB)

.. note::

The kmers variable for the equation above can be obtained using the kmercountmulti.sh script from BBTools.

The kmers variable for the equation above can be obtained using the `kmercountmulti.sh` script from BBTools.
kmercountmulti.sh -k=31 in=your.read.fq.gz

Example command:

::

kmercountmulti.sh -k=31 in=your.read.fq.gz

Workflow Dependencies
---------------------

Third-party software: (This is included in the Docker image.)
Third party software: (This is included in the Docker image.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- `metaSPAdes v3.15.0 <https://cab.spbu.ru/software/spades/>`_ (License: `GPLv2 <https://github.com/ablab/spades/blob/spades_3.15.0/assembler/GPLv2.txt>`_)
- `BBTools v38.94 <https://jgi.doe.gov/data-and-tools/bbtools/>`_ (License: `BSD-3-Clause-LBNL <https://bitbucket.org/berkeleylab/jgi-bbtools/src/master/license.txt>`_)
- `metaSPades v3.15.0 <https://cab.spbu.ru/software/spades/>`_ (License: `GPLv2 <https://github.com/ablab/spades/blob/spades_3.15.0/assembler/GPLv2.txt>`_)
- `BBTools:38.94 <https://jgi.doe.gov/data-and-tools/bbtools/>`_ (License: `BSD-3-Clause-LBNL <https://bitbucket.org/berkeleylab/jgi-bbtools/src/master/license.txt>`_)

Sample dataset(s)
-----------------

- Small dataset: `Ecoli 10x (287M) <https://portal.nersc.gov/cfs/m3408/test_data/metaAssembly_small_test_data.tgz>`_ (Input/output included in tar.gz file)
- Large dataset: `Zymobiomics mock-community DNA control (22G) <https://portal.nersc.gov/cfs/m3408/test_data/metaAssembly_large_test_data.tgz>`_ (Input/output included in tar.gz file)
- small dataset: `Ecoli 10x (287M) <https://portal.nersc.gov/cfs/m3408/test_data/metaAssembly_small_test_data.tgz>`_ . You can find input/output in the downloaded tar gz file.

Long reads dataset: `PacBio <https://portal.nersc.gov/project/m3408//test_data/SRR13128014.pacbio.subsample.ccs.fastq.gz>`_
- large dataset: `Zymobiomics mock-community DNA control (22G) <https://portal.nersc.gov/cfs/m3408/test_data/metaAssembly_large_test_data.tgz>`_ . You can find input/output in the downloaded tar gz file.

Zymobiomics mock-community DNA control (`SRR7877884 <https://www.ebi.ac.uk/ena/browser/view/SRR7877884>`_). The original dataset is ~4 GB.
Zymobiomics mock-community DNA control (`SRR7877884 <https://www.ebi.ac.uk/ena/browser/view/SRR7877884>`_); this original dataset is ~4 GB.

For testing purposes and for the following examples, we used a 10% sub-sampling of the above dataset: (`SRR7877884-int-0.1.fastq.gz <https://portal.nersc.gov/cfs/m3408/test_data/SRR7877884-int-0.1.fastq.gz>`_). This dataset is already interleaved.

For testing, a 10% subsample of the dataset is used: (`SRR7877884-int-0.1.fastq.gz <https://portal.nersc.gov/cfs/m3408/test_data/SRR7877884-int-0.1.fastq.gz>`_). This dataset is already interleaved.

Input
-----

A `JSON file <https://github.com/microbiomedata/metaAssembly/blob/master/input.json>`_ containing the following information:

1. The path to the input FASTQ file (Illumina paired-end interleaved FASTQ or PacBio paired-end interleaved FASTQ) (recommended: output of the Reads QC workflow).
2. Project name: nmdc:XXXXXX
3. Memory (optional) e.g., `"jgi_metaAssembly.memory": "105G"`
4. Threads (optional) e.g., `"jgi_metaAssembly.threads": "16"`
5. Whether the input is short reads (boolean)
A JSON file containing the following information:

Example input JSON for short reads::
1. the path to the input FASTQ file (Illumina paired-end interleaved FASTQ) (recommended the output of the Reads QC workflow.)
2. the contig prefix for the FASTA header
3. the output path
4. input_interleaved (boolean)
5. forwards reads fastq file (required value when input_interleaved is false, otherwise use [] )
6. reverse reads fastq file (required value when input_interleaved is false, otherwise use [] )
7. memory (optional) ex: “jgi_metaASM.memory”: “105G”
8. threads (optional) ex: “jgi_metaASM.threads”: “16”

{
"jgi_metaAssembly.input_files": ["https://portal.nersc.gov/project/m3408/test_data/smalltest.int.fastq.gz"],
"jgi_metaAssembly.proj": "nmdc:XXXXXX",
"jgi_metaAssembly.memory": "105G",
"jgi_metaAssembly.threads": "16",
"jgi_metaAssembly.shortRead": true
}

Example input JSON for long reads::
An example input JSON file is shown below::

{
"jgi_metaAssembly.input_files": ["/global/cfs/cdirs/m3408/www/test_data/SRR13128014.pacbio.subsample.ccs.fastq.gz"],
"jgi_metaAssembly.proj": "nmdc:XXXXXX",
"jgi_metaAssembly.memory": "105G",
"jgi_metaAssembly.threads": "16",
"jgi_metaAssembly.shortRead": false
"jgi_metaASM.input_file":["/path/to/SRR7877884-int-0.1.fastq.gz "],
"jgi_metaASM.rename_contig_prefix":"projectID",
"jgi_metaASM.outdir":"/path/to/ SRR7877884-int-0.1_assembly",
"jgi_metaASM.input_interleaved":true,
"jgi_metaASM.input_fq1":[],
"jgi_metaASM.input_fq2":[],
"jgi_metaASM.memory": "105G",
"jgi_metaASM.threads": "16"
}

Output
------

The output directory will contain the following files for short reads::

output/
├── nmdc_XXXXXX_metaAsm.info
├── nmdc_XXXXXX_covstats.txt
├── nmdc_XXXXXX_bbcms.fastq.gz
├── nmdc_XXXXXX_scaffolds.fna
├── nmdc_XXXXXX_assembly.agp
├── stats.json
├── nmdc_XXXXXX_pairedMapped.sam.gz
└── nmdc_XXXXXX_pairedMapped_sorted.bam
The output directory will contain following files::

The output directory will contain the following files for long reads::

output/
├── nmdc_XXXXXX_assembly.legend
├── nmdc_XXXXXX_contigs.fna
├── nmdc_XXXXXX_pairedMapped_sorted.bam
├── nmdc_XXXXXX_read_count_report.txt
├── nmdc_XXXXXX_metaAsm.info
├── nmdc_XXXXXX_summary.stats
├── nmdc_XXXXXX_scaffolds.fna
├── nmdc_XXXXXX_pairedMapped.sam.gz
├── stats.json
├── nmdc_XXXXXX_contigs.sam.stats
├── nmdc_XXXXXX_contigs.sorted.bam.pileup.basecov
├── nmdc_XXXXXX_assembly.agp
└── nmdc_XXXXXX_contigs.sorted.bam.pileup.out

Example output stats JSON file::

{
"scaffolds": 58,
"contigs": 58,
"scaf_bp": 28406,
"contig_bp": 28406,
"gap_pct": 0.00000,
"scaf_N50": 21,
"scaf_L50": 536,
"ctg_N50": 21,
"ctg_L50": 536,
"scaf_N90": 49,
"scaf_L90": 317,
"ctg_N90": 49,
"ctg_L90": 317,
"scaf_logsum": 22.158,
"scaf_powsum": 2.245,
"ctg_logsum": 22.158,
"ctg_powsum": 2.245,
"asm_score": 0.000,
"scaf_max": 1117,
"ctg_max": 1117,
"scaf_n_gt50K": 0,
"scaf_l_gt50K": 0,
"scaf_pct_gt50K": 0.0,
"gc_avg": 0.39129,
"gc_std": 0.03033
}
├── assembly.agp
├── assembly_contigs.fna
├── assembly_scaffolds.fna
├── covstats.txt
├── pairedMapped.sam.gz
├── pairedMapped_sorted.bam
└── stats.json

Part of an example output stats JSON file is shown below:

```
{
"scaffolds": 58,
"contigs": 58,
"scaf_bp": 28406,
"contig_bp": 28406,
"gap_pct": 0.00000,
"scaf_N50": 21,
"scaf_L50": 536,
"ctg_N50": 21,
"ctg_L50": 536,
"scaf_N90": 49,
"scaf_L90": 317,
"ctg_N90": 49,
"ctg_L90": 317,
"scaf_logsum": 22.158,
"scaf_powsum": 2.245,
"ctg_logsum": 22.158,
"ctg_powsum": 2.245,
"asm_score": 0.000,
"scaf_max": 1117,
"ctg_max": 1117,
"scaf_n_gt50K": 0,
"scaf_l_gt50K": 0,
"scaf_pct_gt50K": 0.0,
"gc_avg": 0.39129,
"gc_std": 0.03033,
"filename": "/global/cfs/cdirs/m3408/aim2/metagenome/assembly/cromwell-executions/jgi_metaASM/3342a6e8-7f78-40e6-a831-364dd2a47baa/call-create_agp/execution/assembly_scaffolds.fna"
}
```


The table provides all of the output directories, files, and their descriptions.
Expand Down Expand Up @@ -251,7 +224,7 @@ mapping/ stdout.background
Version History
---------------

- 1.0.7 (release date **11/12/24**; previous versions: 1.0.6)
- 1.0.2 (release date **03/12/2021**; previous versions: 1.0.1)

Point of contact
----------------
Expand Down

0 comments on commit 34e83d1

Please sign in to comment.