From fec73651fcde335b5c39621fee0468862024a957 Mon Sep 17 00:00:00 2001 From: Sage Wright Date: Wed, 4 Dec 2024 10:26:18 -0500 Subject: [PATCH] [Documentation] Various updates (#680) * random documentation updates * update readme --- README.md | 8 +-- docs/contributing/doc_contribution.md | 2 +- docs/index.md | 8 +-- .../pangolin_update.md | 6 ++- .../genomic_characterization/theiacov.md | 6 ++- .../genomic_characterization/theiameta.md | 51 +++++++++++++++++-- 6 files changed, 66 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index 84bccba76..138a1868c 100644 --- a/README.md +++ b/README.md @@ -42,17 +42,17 @@ You can expect a careful review of every PR and feedback as needed before mergin ### Authorship -(Ordered by contribution [# of lines changed] as of 2024-08-01) +(Ordered by contribution [# of lines changed] as of 2024-12-04) * **Sage Wright** ([@sage-wright](https://github.com/sage-wright)) - Conceptualization, Software, Validation, Supervision * **Inês Mendes** ([@cimendes](https://github.com/cimendes)) - Software, Validation * **Curtis Kapsak** ([@kapsakcj](https://github.com/kapsakcj)) - Conceptualization, Software, Validation -* **James Otieno** ([@jrotieno](https://github.com/jrotieno)) - Software, Validation * **Frank Ambrosio** ([@frankambrosio3](https://github.com/frankambrosio3)) - Conceptualization, Software, Validation * **Michelle Scribner** ([@michellescribner](https://github.com/michellescribner)) - Software, Validation * **Kevin Libuit** ([@kevinlibuit](https://github.com/kevinlibuit)) - Conceptualization, Project Administration, Software, Validation, Supervision -* **Emma Doughty** ([@emmadoughty](https://github.com/emmadoughty)) - Software, Validation +* **Fraser Combe** ([@fraser-combe](https://github.com/fraser-combe)) - Software, Validation * **Andrew Page** ([@andrewjpage](https://github.com/andrewjpage)) - Project Administration, Software, Supervision +* **Michal Babinski** ([@Michal-Babins](https://github.com/Michal-Babins)) - Software, Validation * **Andrew Lang** ([@AndrewLangVt](https://github.com/AndrewLangVt)) - Software, Supervision * **Kelsey Kropp** ([@kelseykropp](https://github.com/kelseykropp)) - Validation * **Emily Smith** ([@emily-smith1](https://github.com/emily-smith1)) - Validation @@ -62,7 +62,9 @@ You can expect a careful review of every PR and feedback as needed before mergin We would like to gratefully acknowledge the following individuals from the public health community for their contributions to the PHB repository: +* **James Otieno** ([@jrotieno](https://github.com/jrotieno)) * **Robert Petit** ([@rpetit3](https://github.com/rpetit3)) +* **Emma Doughty** ([@emmadoughty](https://github.com/emmadoughty)) * **Ash O'Farrel** ([@aofarrel](https://github.com/aofarrel)) * **Sam Baird** ([@sam-baird](https://github.com/sam-baird)) * **Holly Halstead** ([@HNHalstead](https://github.com/HNHalstead)) diff --git a/docs/contributing/doc_contribution.md b/docs/contributing/doc_contribution.md index 940468961..4ddb7e0de 100644 --- a/docs/contributing/doc_contribution.md +++ b/docs/contributing/doc_contribution.md @@ -147,7 +147,7 @@ A brief description of the documentation structure is as follows: - `assets/` - Contains images and other files used in the documentation. - `figures/` - Contains images, figures, and workflow diagrams used in the documentation. For workflows that contain many images (such as BaseSpace_Fetch), it is recommended to create a subdirectory for the workflow. - `files/` - Contains files that are used in the documentation. This may include example outputs or templates. For workflows that contain many files (such as TheiaValidate), it is recommended to create a subdirectory for the workflow. - - `logos/` - Contains Theiagen logos and symbols used int he documentation. + - `logos/` - Contains Theiagen logos and symbols used in the documentation. - `metadata_formatters/` - Contains the most up-to-date metadata formatters for our submission workflows. - `new_workflow_template.md` - A template for adding a new workflow page to the documentation. You can see this template [here](../assets/new_workflow_template.md) - `contributing/` - Contains the Markdown files for our contribution guides, such as this file diff --git a/docs/index.md b/docs/index.md index 058b2149d..ad825cfa3 100644 --- a/docs/index.md +++ b/docs/index.md @@ -60,17 +60,17 @@ You can expect a careful review of every PR and feedback as needed before mergin ### Authorship -(Ordered by contribution [# of lines changed] as of 2024-08-01) +(Ordered by contribution [# of lines changed] as of 2024-12-04) - **Sage Wright** ([@sage-wright](https://github.com/sage-wright)) - Conceptualization, Software, Validation, Supervision - **Inês Mendes** ([@cimendes](https://github.com/cimendes)) - Software, Validation - **Curtis Kapsak** ([@kapsakcj](https://github.com/kapsakcj)) - Conceptualization, Software, Validation -- **James Otieno** ([@jrotieno](https://github.com/jrotieno)) - Software, Validation - **Frank Ambrosio** ([@frankambrosio3](https://github.com/frankambrosio3)) - Conceptualization, Software, Validation - **Michelle Scribner** ([@michellescribner](https://github.com/michellescribner)) - Software, Validation - **Kevin Libuit** ([@kevinlibuit](https://github.com/kevinlibuit)) - Conceptualization, Project Administration, Software, Validation, Supervision -- **Emma Doughty** ([@emmadoughty](https://github.com/emmadoughty)) - Software, Validation +- **Fraser Combe** ([@fraser-combe](https://github.com/fraser-combe)) - Software, Validation - **Andrew Page** ([@andrewjpage](https://github.com/andrewjpage)) - Project Administration, Software, Supervision +- **Michal Babinski** ([@Michal-Babins](https://github.com/Michal-Babins)) - Software, Validation - **Andrew Lang** ([@AndrewLangVt](https://github.com/AndrewLangVt)) - Software, Supervision - **Kelsey Kropp** ([@kelseykropp](https://github.com/kelseykropp)) - Validation - **Emily Smith** ([@emily-smith1](https://github.com/emily-smith1)) - Validation @@ -80,7 +80,9 @@ You can expect a careful review of every PR and feedback as needed before mergin We would like to gratefully acknowledge the following individuals from the public health community for their contributions to the PHB repository: +- **James Otieno** ([@jrotieno](https://github.com/jrotieno)) - **Robert Petit** ([@rpetit3](https://github.com/rpetit3)) +- **Emma Doughty** ([@emmadoughty](https://github.com/emmadoughty)) - **Ash O'Farrel** ([@aofarrel](https://github.com/aofarrel)) - **Sam Baird** ([@sam-baird](https://github.com/sam-baird)) - **Holly Halstead** ([@HNHalstead](https://github.com/HNHalstead)) diff --git a/docs/workflows/genomic_characterization/pangolin_update.md b/docs/workflows/genomic_characterization/pangolin_update.md index 988db4404..a05756888 100644 --- a/docs/workflows/genomic_characterization/pangolin_update.md +++ b/docs/workflows/genomic_characterization/pangolin_update.md @@ -65,4 +65,8 @@ This workflow runs on the sample level. | **pangolin_updates** | String | Result of Pangolin Update (lineage changed versus unchanged) with lineage assignment and date of analysis | | **pangolin_versions** | String | All Pangolin software and database versions | - \ No newline at end of file + + +## References + +> **Pangolin**: RRambaut A, Holmes EC, O'Toole Á, Hill V, McCrone JT, Ruis C, du Plessis L, Pybus OG. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol. 2020 Nov;5(11):1403-1407. doi: 10.1038/s41564-020-0770-5. Epub 2020 Jul 15. PMID: 32669681; PMCID: PMC7610519. diff --git a/docs/workflows/genomic_characterization/theiacov.md b/docs/workflows/genomic_characterization/theiacov.md index 8e897e0d8..319a32ad8 100644 --- a/docs/workflows/genomic_characterization/theiacov.md +++ b/docs/workflows/genomic_characterization/theiacov.md @@ -899,6 +899,8 @@ All input reads are processed through "core tasks" in the TheiaCoV Illumina, ONT | Task | [task_pangolin.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/species_typing/betacoronavirus/task_pangolin.wdl) | | Software Source Code | [Pangolin on GitHub](https://github.com/cov-lineages/pangolin) | | Software Documentation | [Pangolin website](https://cov-lineages.org/resources/pangolin.html) | + | Original Publication(s) | [A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology](https://doi.org/10.1038/s41564-020-0770-5) | + ??? task "`nextclade`" @@ -1141,10 +1143,10 @@ All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch) | nextclade_json_flu_ha | File | Nextclade output in JSON file format, specific to Flu HA segment | ONT, PE | | nextclade_json_flu_na | File | Nextclade output in JSON file format, specific to Flu NA segment | ONT, PE | | nextclade_lineage | String | Nextclade lineage designation | CL, FASTA, ONT, PE, SE | -| nextclade_qc | String | QC metric as determined by Nextclade. (For Flu, this output will be specific to HA segment) | CL, FASTA, ONT, PE, SE | +| nextclade_qc | String | QC metric as determined by Nextclade. Will be blank for Flu | CL, FASTA, ONT, PE, SE | | nextclade_qc_flu_ha | String | QC metric as determined by Nextclade, specific to Flu HA segment | ONT, PE | | nextclade_qc_flu_na | String | QC metric as determined by Nextclade, specific to Flu NA segment | ONT, PE | -| nextclade_tsv | File | Nextclade output in TSV file format. (For Flu, this output will be specific to HA segment) | CL, FASTA, ONT, PE, SE | +| nextclade_tsv | File | Nextclade output in TSV file format. Will be blank for Flu | CL, FASTA, ONT, PE, SE | | nextclade_tsv_flu_ha | File | Nextclade output in TSV file format, specific to Flu HA segment | ONT, PE | | nextclade_tsv_flu_na | File | Nextclade output in TSV file format, specific to Flu NA segment | ONT, PE | | nextclade_version | String | The version of Nextclade software used | CL, FASTA, ONT, PE, SE | diff --git a/docs/workflows/genomic_characterization/theiameta.md b/docs/workflows/genomic_characterization/theiameta.md index e166088aa..d6b55e80a 100644 --- a/docs/workflows/genomic_characterization/theiameta.md +++ b/docs/workflows/genomic_characterization/theiameta.md @@ -242,21 +242,62 @@ The TheiaMeta_Illumina_PE workflow processes Illumina paired-end (PE) reads ge ??? task "`metaspades`: _De Novo_ Metagenomic Assembly" - While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging. A dedicated metagenomic assembly algorithm is necessary to circumvent the challenge of interpreting variation. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes. + While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging. A dedicated metagenomic assembly algorithm is necessary to circumvent the challenge of interpreting variation. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes. + + `metaspades` is a _de novo_ assembler that first constructs a de Bruijn graph of all the reads using the SPAdes algorithm. Through various graph simplification procedures, paths in the assembly graph are reconstructed that correspond to long genomic fragments within the metagenome. For more details, please see the original publication. !!! techdetails "MetaSPAdes Technical Details" | | Links | | --- | --- | | Task | [task_metaspades.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/assembly/task_metaspades.wdl) | - | Software Source Code | [SPAdes on GitHub](https://github.com/ablab/spades) | - | Software Documentation | | - | Original Publication(s) | [metaSPAdes: a new versatile metagenomic assembler](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411777/) | + | Software Source Code | [SPAdes on GitHub](https://github.com/ablab/spades) | + | Software Documentation | [SPAdes Manual](https://ablab.github.io/spades/index.html) | + | Original Publication(s) | [metaSPAdes: a new versatile metagenomic assembler](http://www.genome.org/cgi/doi/10.1101/gr.213959.116) | -??? task "`minimap2`: Assembly Alignment and Contig Filtering (if a reference is provided)" +??? task "`minimap2`: Assembly Alignment and Contig Filtering" If a reference genome is provided through the **`reference`** optional input, the assembly produced with `metaspades` will be mapped to the reference genome with `minimap2`. The contigs which align to the reference are retrieved and returned in the **`assembly_fasta`** output. + `minimap2` is a popular aligner that is used for correcting the assembly produced by metaSPAdes. This is done by aligning the reads back to the generated assembly or a reference genome. + + In minimap2, "modes" are a group of preset options. Two different modes are used in this task depending on whether a reference genome is provided. + + If a reference genome is _not_ provided, the only mode used in this task is `sr` which is intended for "short single-end reads without splicing". The `sr` mode indicates the following parameters should be used: `-k21 -w11 --sr --frag=yes -A2 -B8 -O12,32 -E2,1 -b0 -r100 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g100 -2K50m --heap-sort=yes --secondary=no`. The output file is in SAM format. + + If a reference genome is provided, then after the draft assembly polishing with `pilon`, this task runs again with the mode set to `asm20` which is intended for "long assembly to reference mapping". The `asm20` mode indicates the following parameters should be used: `-k19 -w10 -U50,500 --rmq -r100k -g10k -A1 -B4 -O6,26 -E2,1 -s200 -z200 -N50`. The output file is in PAF format. + + For more information, please see the [minimap2 manpage](https://lh3.github.io/minimap2/minimap2.html) + + !!! techdetails "minimap2 Technical Details" + | | Links | + |---|---| + | Task | [task_minimap2.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/alignment/task_minimap2.wdl) | + | Software Source Code | [minimap2 on GitHub](https://github.com/lh3/minimap2) | + | Software Documentation | [minimap2](https://lh3.github.io/minimap2) | + | Original Publication(s) | [Minimap2: pairwise alignment for nucleotide sequences](https://academic.oup.com/bioinformatics/article/34/18/3094/4994778) | + +??? task "`samtools`: SAM File Conversion " + This task converts the output SAM file from minimap2 and converts it to a BAM file. It then sorts the BAM based on the read names, and then generates an index file. + + !!! techdetails "samtools Technical Details" + | | Links | + |---|---| + | Task | [task_samtools.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/utilities/data_handling/task_parse_mapping.wdl) | + | Software Source Code | [samtools on GitHub](https://github.com/samtools/samtools) | + | Software Documentation | [samtools](https://www.htslib.org/doc/samtools.html) | + | Original Publication(s) | [The Sequence Alignment/Map format and SAMtools](https://doi.org/10.1093/bioinformatics/btp352)
[Twelve Years of SAMtools and BCFtools](https://doi.org/10.1093/gigascience/giab008) | + +??? task "`pilon`: Assembly Polishing" + `pilon` is a tool that uses read alignment to correct errors in an assembly. It is used to polish the assembly produced by metaSPAdes. The input to Pilon is the sorted BAM file produced by `samtools`, and the original draft assembly produced by `metaspades`. + + !!! techdetails "pilon Technical Details" + | | Links | + |---|---| + | Task | [task_pilon.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/quality_control/read_filtering/task_pilon.wdl) | + | Software Source Code | [Pilon on GitHub](https://github.com/broadinstitute/pilon) | + | Software Documentation | [Pilon Wiki](https://github.com/broadinstitute/pilon/wiki) | + | Original Publication(s) | [Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement](https://doi.org/10.1371/journal.pone.0112963) | #### Assembly QC ??? task "`quast`: Assembly Quality Assessment"