Merge branch 'main' into smw-pipefail-dev

theiagen · Nov 8, 2024 · 5833b9d · 5833b9d
2 parents c425b04 + 4bcf3f2
commit 5833b9d
Show file tree

Hide file tree

Showing 26 changed files with 247 additions and 30 deletions.
diff --git a/docs/workflows/genomic_characterization/theiacov.md b/docs/workflows/genomic_characterization/theiacov.md
@@ -1157,6 +1157,7 @@ All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch)
 | pangolin_notes | String | Lineage notes as determined by Pangolin | CL, FASTA, ONT, PE, SE |
 | pangolin_versions | String | All Pangolin software and database versions | CL, FASTA, ONT, PE, SE |
 | percent_reference_coverage | Float | Percent coverage of the reference genome after performing primer trimming; calculated as assembly_length_unambiguous / length of the reference genome (SC2: 29903) x 100 | CL, FASTA, ONT, PE, SE |
+| percentage_mapped_reads | String | Percentage of reads that successfully aligned to the reference genome. This value is calculated by number of mapped reads / total number of reads x 100. | ONT, PE, SE |
 | primer_bed_name | String | Name of the primer bed files used for primer trimming | CL, ONT, PE, SE |
 | primer_trimmed_read_percent | Float | Percentage of read data with primers trimmed as determined by iVar trim | PE, SE |
 | qc_check | String | The results of the QC Check task | CL, FASTA, ONT, PE, SE |

diff --git a/docs/workflows/genomic_characterization/theiaprok.md b/docs/workflows/genomic_characterization/theiaprok.md
@@ -686,7 +686,34 @@ All input reads are processed through "[core tasks](#core-tasks-performed-for-al
         - relative_abundance: estimated relative abundance of species in metagenome
         
         The value in the `midas_primary_genus` column is derived by ordering the rows in order of "relative_abundance" and identifying the genus of top species in the "species_id" column (Salmonella). The value in the `midas_secondary_genus` column is derived from the genus of the second-most prevalent genus in the "species_id" column (Citrobacter). The `midas_secondary_genus_abundance` column is the "relative_abundance" of the second-most prevalent genus (0.009477003). The `midas_secondary_genus_coverage` is the "coverage" of the second-most prevalent genus (0.995216227).
-        
+
+    **MIDAS Reference Database Overview**
+
+    The **MIDAS reference database** is a comprehensive tool for identifying bacterial species in metagenomic and bacterial isolate WGS data. It includes several layers of genomic data, helping detect species abundance and potential contaminants.
+
+    **Key Components of the MIDAS Database**
+
+    1. **Species Groups**: 
+    - MIDAS clusters bacterial genomes based on 96.5% sequence identity, forming over 5,950 species groups from 31,007 genomes. These groups align with the gold-standard species definition (95% ANI), ensuring highly accurate species identification.
+
+    2. **Genomic Data Structure**:
+    - **Marker Genes**: Contains 15 universal single-copy genes used to estimate species abundance.
+    - **Representative Genome**: Each species group has a selected representative genome, which minimizes genetic variation and aids in accurate SNP identification.
+    - **Pan-genome**: The database includes clusters of non-redundant genes, with options for multi-level clustering (e.g., 99%, 95%, 90% identity), enabling MIDAS to identify gene content within strains at various clustering thresholds.
+
+    3. **Taxonomic Annotation**: 
+    - Genomes are annotated based on consensus Latin names. Discrepancies in name assignments may occur due to factors like unclassified genomes or genus-level ambiguities.
+
+    ---
+
+    **Using the Default MIDAS Database**
+
+    TheiaProk uses the pre-loaded MIDAS database in Terra (see input table for current version) by default for bacterial species detection in metagenomic data, requiring no additional setup.
+
+    **How to Set Up the Default MIDAS Database**
+
+    Users can also build their own custom MIDAS database if they want to include specific genomes or configurations. This custom database can replace the default MIDAS database used in Terra. To build a custom MIDAS database, follow the [MIDAS GitHub guide on building a custom database](https://github.com/snayfach/MIDAS/blob/master/docs/build_db.md). Once the database is built, users can upload it to a Google Cloud Storage bucket or Terra workkspace and provide the link to the database in the `midas_db` input variable.
+
     Alternatively to `MIDAS`, the `Kraken2` task can also be turned on through setting the `call_kraken` input variable as `true` for the identification of reads to detect contamination with non-target taxa.
 
     Kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate) whole genome sequence data. A database must be provided if this optional module is activated, through the kraken_db optional input. A list of suggested databases can be found on [Kraken2 standalone documentation](../standalone/kraken2.md).

diff --git a/docs/workflows/phylogenetic_construction/czgenepi_prep.md b/docs/workflows/phylogenetic_construction/czgenepi_prep.md
@@ -26,7 +26,7 @@ This workflow runs on the set level.
 | czgenepi_prep | **terra_table_name** | String | The name of the Terra table where the data is hosted |  | Required |
 | czgenepi_prep | **terra_project_name** | String | The name of the Terra project where the data is hosted |  | Required |
 | czgenepi_prep | **terra_workspace_name** | String | The name of the Terra workspace where the data is hosted |  | Required |
-| download_terra_table | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 10 | Optional |
+| download_terra_table | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 2 | Optional |
 | download_terra_table | **docker** | String | The Docker container to use for the task | quay.io/theiagen/terra-tools:2023-06-21 | Optional |
 | download_terra_table | **disk_size** | String | The size of the disk used when running this task | 1 | Optional |
 | download_terra_table | **cpu** | Int | Number of CPUs to allocate to the task | 1 | Optional |

diff --git a/docs/workflows/phylogenetic_construction/snippy_streamline.md b/docs/workflows/phylogenetic_construction/snippy_streamline.md
@@ -173,6 +173,32 @@ For all cases:
 
     `Snippy_Variants` aligns reads for each sample against the reference genome. As part of `Snippy_Streamline`, the only output from this workflow is the `snippy_variants_outdir_tarball` which is provided in the set-level data table. Please see the full documentation for [Snippy_Variants](./snippy_variants.md) for more information.
 
+??? task "snippy_variants (qc_metrics output)"
+
+    ##### snippy_variants {#snippy_variants}
+
+    This task runs Snippy to perform SNP analysis on individual samples. It extracts QC metrics from the Snippy output for each sample and saves them in per-sample TSV files (`snippy_variants_qc_metrics`). These per-sample QC metrics include the following columns:
+
+    - **samplename**: The name of the sample.
+    - **reads_aligned_to_reference**: The number of reads that aligned to the reference genome.
+    - **total_reads**: The total number of reads in the sample.
+    - **percent_reads_aligned**: The percentage of reads that aligned to the reference genome.
+    - **variants_total**: The total number of variants detected between the sample and the reference genome.
+    - **percent_ref_coverage**: The percentage of the reference genome covered by reads with a depth greater than or equal to the `min_coverage` threshold (default is 10).
+    - **#rname**: Reference sequence name (e.g., chromosome or contig name).
+    - **startpos**: Starting position of the reference sequence.
+    - **endpos**: Ending position of the reference sequence.
+    - **numreads**: Number of reads covering the reference sequence.
+    - **covbases**: Number of bases with coverage.
+    - **coverage**: Percentage of the reference sequence covered (depth ≥ 1).
+    - **meandepth**: Mean depth of coverage over the reference sequence.
+    - **meanbaseq**: Mean base quality over the reference sequence.
+    - **meanmapq**: Mean mapping quality over the reference sequence.
+
+    These per-sample QC metrics are then combined into a single file (`snippy_combined_qc_metrics`) in the downstream `snippy_tree_wf` workflow. The combined QC metrics file includes the same columns as above for all samples. Note that the last set of columns (`#rname` to `meanmapq`) may repeat for each chromosome or contig in the reference genome.
+
+    **Note:** The per-sample QC metrics provide valuable insights into the quality and coverage of your sequencing data relative to the reference genome. Monitoring these metrics can help identify samples with low coverage, poor alignment, or potential issues that may affect downstream analyses.
+
 ??? task "Snippy_Tree workflow"
 
     ##### Snippy_Tree {#snippy_tree}
@@ -194,6 +220,7 @@ For all cases:
 | snippy_centroid_version | String | Centroid version used |
 | snippy_cg_snp_matrix | File | CSV file of core genome pairwise SNP distances between samples, calculated from the final alignment  |
 | snippy_concatenated_variants | File | The concatenated variants file |
+| snippy_combined_qc_metrics | File | Combined QC metrics file containing concatenated QC metrics from all samples. |
 | snippy_filtered_metadata | File | TSV recording the columns of the Terra data table that were used in the summarize_data task |
 | snippy_final_alignment | File | Final alignment (FASTA file) used to generate the tree (either after snippy alignment, gubbins recombination removal, and/or core site selection with SNP-sites) |
 | snippy_final_tree | File | Final phylogenetic tree produced by Snippy_Streamline |

diff --git a/docs/workflows/phylogenetic_construction/snippy_streamline_fasta.md b/docs/workflows/phylogenetic_construction/snippy_streamline_fasta.md
@@ -37,6 +37,34 @@ The `Snippy_Streamline_FASTA` workflow is an all-in-one approach to generating a
 
     **If reference genomes have multiple contigs, they will not be compatible with using Gubbins** to mask recombination in the phylogenetic tree. The automatic selection of a reference genome by the workflow may result in a reference with multiple contigs. In this case, an alternative reference genome should be sought.
 
+### Workflow Tasks
+
+??? task "snippy_variants (qc_metrics output)"
+
+    ##### snippy_variants {#snippy_variants}
+
+    This task runs Snippy to perform SNP analysis on individual samples. It extracts QC metrics from the Snippy output for each sample and saves them in per-sample TSV files (`snippy_variants_qc_metrics`). These per-sample QC metrics include the following columns:
+
+    - **samplename**: The name of the sample.
+    - **reads_aligned_to_reference**: The number of reads that aligned to the reference genome.
+    - **total_reads**: The total number of reads in the sample.
+    - **percent_reads_aligned**: The percentage of reads that aligned to the reference genome.
+    - **variants_total**: The total number of variants detected between the sample and the reference genome.
+    - **percent_ref_coverage**: The percentage of the reference genome covered by reads with a depth greater than or equal to the `min_coverage` threshold (default is 10).
+    - **#rname**: Reference sequence name (e.g., chromosome or contig name).
+    - **startpos**: Starting position of the reference sequence.
+    - **endpos**: Ending position of the reference sequence.
+    - **numreads**: Number of reads covering the reference sequence.
+    - **covbases**: Number of bases with coverage.
+    - **coverage**: Percentage of the reference sequence covered (depth ≥ 1).
+    - **meandepth**: Mean depth of coverage over the reference sequence.
+    - **meanbaseq**: Mean base quality over the reference sequence.
+    - **meanmapq**: Mean mapping quality over the reference sequence.
+
+    These per-sample QC metrics are then combined into a single file (`snippy_combined_qc_metrics`) in the downstream `snippy_tree_wf` workflow. The combined QC metrics file includes the same columns as above for all samples. Note that the last set of columns (`#rname` to `meanmapq`) may repeat for each chromosome or contig in the reference genome.
+
+    **Note:** The per-sample QC metrics provide valuable insights into the quality and coverage of your sequencing data relative to the reference genome. Monitoring these metrics can help identify samples with low coverage, poor alignment, or potential issues that may affect downstream analyses.
+
 ### Inputs
 
 <div class="searchable-table" markdown="1">
@@ -123,6 +151,7 @@ The `Snippy_Streamline_FASTA` workflow is an all-in-one approach to generating a
 | snippy_centroid_samplename | String | Name of the centroid sample |
 | snippy_centroid_version | String | Centroid version used |
 | snippy_cg_snp_matrix | File | CSV file of core genome pairwise SNP distances between samples, calculated from the final alignment  |
+| snippy_combined_qc_metrics | File | Combined QC metrics file containing concatenated QC metrics from all samples. |
 | snippy_concatenated_variants | File | The concatenated variants file |
 | snippy_filtered_metadata | File | TSV recording the columns of the Terra data table that were used in the summarize_data task |
 | snippy_final_alignment | File | Final alignment (FASTA file) used to generate the tree (either after snippy alignment, gubbins recombination removal, and/or core site selection with SNP-sites) |

diff --git a/docs/workflows/phylogenetic_construction/snippy_tree.md b/docs/workflows/phylogenetic_construction/snippy_tree.md
@@ -310,6 +310,32 @@ Sequencing data used in the Snippy_Tree workflow must:
         | Task | task_shared_variants.wdl |
         | Software Source Code | [task_shared_variants.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/phylogenetic_inference/utilities/task_shared_variants.wdl) |
 
+??? task "snippy_variants (qc_metrics output)"
+
+    ##### snippy_variants {#snippy_variants}
+
+    This task runs Snippy to perform SNP analysis on individual samples. It extracts QC metrics from the Snippy output for each sample and saves them in per-sample TSV files (`snippy_variants_qc_metrics`). These per-sample QC metrics include the following columns:
+
+    - **samplename**: The name of the sample.
+    - **reads_aligned_to_reference**: The number of reads that aligned to the reference genome.
+    - **total_reads**: The total number of reads in the sample.
+    - **percent_reads_aligned**: The percentage of reads that aligned to the reference genome.
+    - **variants_total**: The total number of variants detected between the sample and the reference genome.
+    - **percent_ref_coverage**: The percentage of the reference genome covered by reads with a depth greater than or equal to the `min_coverage` threshold (default is 10).
+    - **#rname**: Reference sequence name (e.g., chromosome or contig name).
+    - **startpos**: Starting position of the reference sequence.
+    - **endpos**: Ending position of the reference sequence.
+    - **numreads**: Number of reads covering the reference sequence.
+    - **covbases**: Number of bases with coverage.
+    - **coverage**: Percentage of the reference sequence covered (depth ≥ 1).
+    - **meandepth**: Mean depth of coverage over the reference sequence.
+    - **meanbaseq**: Mean base quality over the reference sequence.
+    - **meanmapq**: Mean mapping quality over the reference sequence.
+
+    These per-sample QC metrics are then combined into a single file (`snippy_combined_qc_metrics`) in the downstream `snippy_tree_wf` workflow. The combined QC metrics file includes the same columns as above for all samples. Note that the last set of columns (`#rname` to `meanmapq`) may repeat for each chromosome or contig in the reference genome.
+
+    **Note:** The per-sample QC metrics provide valuable insights into the quality and coverage of your sequencing data relative to the reference genome. Monitoring these metrics can help identify samples with low coverage, poor alignment, or potential issues that may affect downstream analyses.
+
 ### Outputs
 
 <div class="searchable-table" markdown="1">
@@ -318,6 +344,7 @@ Sequencing data used in the Snippy_Tree workflow must:
 |---|---|---|
 | snippy_cg_snp_matrix | File | CSV file of core genome pairwise SNP distances between samples, calculated from the final alignment  |
 | snippy_concatenated_variants | File | Concatenated snippy_results file across all samples in the set |
+| snippy_combined_qc_metrics | File | Combined QC metrics file containing concatenated QC metrics from all samples. |
 | snippy_filtered_metadata | File | TSV recording the columns of the Terra data table that were used in the summarize_data task |
 | snippy_final_alignment | File | Final alignment (FASTA file) used to generate the tree (either after snippy alignment, gubbins recombination removal, and/or core site selection with SNP-sites) |
 | snippy_final_tree | File | Newick tree produced from the final alignment. Depending on user input for core_genome, the tree could be a core genome tree (default when core_genome is true) or whole genome tree (if core_genome is false) |

diff --git a/docs/workflows/phylogenetic_construction/snippy_variants.md b/docs/workflows/phylogenetic_construction/snippy_variants.md
@@ -62,6 +62,13 @@ The `Snippy_Variants` workflow aligns single-end or paired-end reads (in FASTQ f
 
 `Snippy_Variants` uses the snippy tool to align reads to the reference and call SNPs, MNPs and INDELs according to optional input parameters. The output includes a file of variants that is then queried using the `grep` bash command to identify any mutations in specified genes or annotations of interest. The query string MUST match the gene name or annotation as specified in the GenBank file and provided in the output variant file in the `snippy_results` column.
 
+Additionally, `Snippy_Variants` extracts quality control (QC) metrics from the Snippy output for each sample. These per-sample QC metrics are saved in TSV files (`snippy_variants_qc_metrics`). The QC metrics include:
+
+- **Percentage of reads aligned to the reference genome** (`snippy_variants_percent_reads_aligned`).
+- **Percentage of the reference genome covered at or above the specified depth threshold** (`snippy_variants_percent_ref_coverage`).
+
+These per-sample QC metrics can be combined into a single file (`snippy_combined_qc_metrics`) in downstream workflows, such as `snippy_tree_wf`, providing an overview of QC metrics across all samples.
+
 ### Outputs
 
 !!! tip "Visualize your outputs in IGV"