finish documentation

theiagen · Oct 28, 2024 · fcc17d9 · fcc17d9
1 parent 48a26a2
commit fcc17d9
Show file tree

Hide file tree

Showing 4 changed files with 330 additions and 14 deletions.
diff --git a/docs/workflows/genomic_characterization/pangolin_update.md b/docs/workflows/genomic_characterization/pangolin_update.md
@@ -65,4 +65,8 @@ This workflow runs on the sample level.
 | **pangolin_updates** | String | Result of Pangolin Update (lineage changed versus unchanged) with lineage assignment and date of analysis |
 | **pangolin_versions** | String | All Pangolin software and database versions |
 
-</div>
+</div>
+
+## References
+
+> **Pangolin**: RRambaut A, Holmes EC, O'Toole Á, Hill V, McCrone JT, Ruis C, du Plessis L, Pybus OG. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol. 2020 Nov;5(11):1403-1407. doi: 10.1038/s41564-020-0770-5. Epub 2020 Jul 15. PMID: 32669681; PMCID: PMC7610519.
diff --git a/docs/workflows/genomic_characterization/theiacov.md b/docs/workflows/genomic_characterization/theiacov.md
@@ -900,6 +900,7 @@ All input reads are processed through "core tasks" in the TheiaCoV Illumina, ONT
         | Task | [task_pangolin.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/species_typing/betacoronavirus/task_pangolin.wdl) |
         | Software Source Code | [Pangolin on GitHub](https://github.com/cov-lineages/pangolin) |
         | Software Documentation | [Pangolin website](https://cov-lineages.org/resources/pangolin.html) |
+        | Original Publication(s) | [A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology](https://doi.org/10.1038/s41564-020-0770-5) |
 
 ??? task "`nextclade`"
 
@@ -1138,7 +1139,7 @@ All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch)
 | nextclade_json_flu_ha | File | Nextclade output in JSON file format, specific to Flu HA segment | ONT, PE |
 | nextclade_json_flu_na | File | Nextclade output in JSON file format, specific to Flu NA segment | ONT, PE |
 | nextclade_lineage | String | Nextclade lineage designation | CL, FASTA, ONT, PE, SE |
-| nextclade_qc | String | QC metric as determined by Nextclade. (For Flu, this output will be specific to HA segment) | CL, FASTA, ONT, PE, SE |
+| nextclade_qc | String | QC metric as determined by Nextclade. Will be blank for Flu | CL, FASTA, ONT, PE, SE |
 | nextclade_qc_flu_ha | String | QC metric as determined by Nextclade, specific to Flu HA segment | ONT, PE |
 | nextclade_qc_flu_na | String | QC metric as determined by Nextclade, specific to Flu NA segment | ONT, PE |
 | nextclade_tsv | File | Nextclade output in TSV file format. (For Flu, this output will be specific to HA segment) | CL, FASTA, ONT, PE, SE |

diff --git a/docs/workflows/genomic_characterization/theiameta.md b/docs/workflows/genomic_characterization/theiameta.md
@@ -241,22 +241,62 @@ The TheiaMeta_Illumina_PE workflow processes Illumina paired-end (PE) reads ge
 #### Assembly
 
 ??? task "`metaspades`: _De Novo_ Metagenomic Assembly"
+    While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging. A dedicated metagenomic assembly algorithm is necessary to circumvent the challenge of interpreting variation. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes.
 
-    While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging. A dedicated metagenomic assembly algorithm is necessary to circumvent the challenge of interpreting variation. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes. 
+    `metaspades` is a _de novo_ assembler that first constructs a de Bruijn graph of all the reads using the SPAdes algorithm. Through various graph simplification procedures, paths in the assembly graph are reconstructed that correspond to long genomic fragments within the metagenome. For more details, please see the original publication.
 
     !!! techdetails "MetaSPAdes Technical Details"
-        
         |  | Links |
         | --- | --- |
         | Task | [task_metaspades.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/assembly/task_metaspades.wdl) |
         | Software Source Code | [SPAdes on GitHub](https://github.com/ablab/spades) |
-        | Software Documentation | <https://github.com/ablab/spades/blob/spades_3.15.5/README.md> |
-        | Original Publication(s) | [metaSPAdes: a new versatile metagenomic assembler](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411777/) |
+        | Software Documentation | [SPAdes Manual](https://ablab.github.io/spades/index.html) |
+        | Original Publication(s) | [metaSPAdes: a new versatile metagenomic assembler](http://www.genome.org/cgi/doi/10.1101/gr.213959.116) |
 
-??? task "`minimap2`: Assembly Alignment and Contig Filtering (if a reference is provided)"
+??? task "`minimap2`: Assembly Alignment and Contig Filtering"
 
     If a reference genome is provided through the **`reference`** optional input, the assembly produced with `metaspades` will be mapped to the reference genome with `minimap2`. The contigs which align to the reference are retrieved and returned in the **`assembly_fasta`** output.
 
+    `minimap2` is a popular aligner that is used for correcting the assembly produced by metaSPAdes. This is done by aligning the reads back to the generated assembly or a reference genome.
+
+    In minimap2, "modes" are a group of preset options. Two different modes are used in this task depending on whether a reference genome is provided.
+
+    If a reference genome is _not_ provided, the only mode used in this task is `sr` which is intended for "short single-end reads without splicing". The `sr` mode indicates the following parameters should be used: `-k21 -w11 --sr --frag=yes -A2 -B8 -O12,32 -E2,1 -b0 -r100 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g100 -2K50m --heap-sort=yes --secondary=no`. The output file is in SAM format.
+
+    If a reference genome is provided, then after the draft assembly polishing with `pilon`, this task runs again with the mode set to `asm20` which is intended for "long assembly to reference mapping". The `asm20` mode indicates the following parameters should be used: `-k19 -w10 -U50,500 --rmq -r100k -g10k -A1 -B4 -O6,26 -E2,1 -s200 -z200 -N50`. The output file is in PAF format.
+
+    For more information, please see the [minimap2 manpage](https://lh3.github.io/minimap2/minimap2.html)
+
+    !!! techdetails "minimap2 Technical Details"
+        | | Links |
+        |---|---|
+        | Task | [task_minimap2.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/alignment/task_minimap2.wdl) |
+        | Software Source Code | [minimap2 on GitHub](https://github.com/lh3/minimap2) |
+        | Software Documentation | [minimap2](https://lh3.github.io/minimap2) |
+        | Original Publication(s) | [Minimap2: pairwise alignment for nucleotide sequences](https://academic.oup.com/bioinformatics/article/34/18/3094/4994778) |
+
+??? task "`samtools`: SAM File Conversion "
+    This task converts the output SAM file from minimap2 and converts it to a BAM file. It then sorts the BAM based on the read names, and then generates an index file.
+
+    !!! techdetails "samtools Technical Details"
+        | | Links |
+        |---|---|
+        | Task | [task_samtools.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/utilities/data_handling/task_parse_mapping.wdl) |
+        | Software Source Code | [samtools on GitHub](https://github.com/samtools/samtools) |
+        | Software Documentation | [samtools](https://www.htslib.org/doc/samtools.html) |
+        | Original Publication(s) | [The Sequence Alignment/Map format and SAMtools](https://doi.org/10.1093/bioinformatics/btp352)<br>[Twelve Years of SAMtools and BCFtools](https://doi.org/10.1093/gigascience/giab008) |
+
+??? task "`pilon`: Assembly Polishing"
+    `pilon` is a tool that uses read alignment to correct errors in an assembly. It is used to polish the assembly produced by metaSPAdes. The input to Pilon is the sorted BAM file produced by `samtools`, and the original draft assembly produced by `metaspades`.
+
+    !!! techdetails "pilon Technical Details"
+        | | Links |
+        |---|---|
+        | Task | [task_pilon.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/quality_control/read_filtering/task_pilon.wdl) |
+        | Software Source Code | [Pilon on GitHub](https://github.com/broadinstitute/pilon) |
+        | Software Documentation | [Pilon Wiki](https://github.com/broadinstitute/pilon/wiki) |
+        | Original Publication(s) | [Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement](https://doi.org/10.1371/journal.pone.0112963) |
+
 #### Assembly QC
 
 ??? task  "`quast`: Assembly Quality Assessment"