From 4862f9475d1f4edbe7accc829841c2da5a573905 Mon Sep 17 00:00:00 2001 From: fraser-combe Date: Wed, 6 Nov 2024 10:49:21 -0600 Subject: [PATCH] update streamline fasta docs --- .../snippy_streamline_fasta.md | 26 ++++++++++++++++--- 1 file changed, 23 insertions(+), 3 deletions(-) diff --git a/docs/workflows/phylogenetic_construction/snippy_streamline_fasta.md b/docs/workflows/phylogenetic_construction/snippy_streamline_fasta.md index f03695d44..ca544c398 100644 --- a/docs/workflows/phylogenetic_construction/snippy_streamline_fasta.md +++ b/docs/workflows/phylogenetic_construction/snippy_streamline_fasta.md @@ -39,11 +39,31 @@ The `Snippy_Streamline_FASTA` workflow is an all-in-one approach to generating a ### Workflow Tasks -??? task "snippy_variants" (qc_metrics output) +??? task "snippy_variants (qc_metrics output)" ##### snippy_variants {#snippy_variants} - This task runs Snippy to perform SNP analysis on individual samples. It extracts QC metrics from the Snippy output for each sample and saves them in per-sample TSV files (`snippy_variants_qc_metrics`). These per-sample QC metrics are then combined into a single file (`snippy_combined_qc_metrics`) in the downstream `snippy_tree_wf` workflow. + This task runs Snippy to perform SNP analysis on individual samples. It extracts QC metrics from the Snippy output for each sample and saves them in per-sample TSV files (`snippy_variants_qc_metrics`). These per-sample QC metrics include the following columns: + + - **samplename**: The name of the sample. + - **reads_aligned_to_reference**: The number of reads that aligned to the reference genome. + - **total_reads**: The total number of reads in the sample. + - **percent_reads_aligned**: The percentage of reads that aligned to the reference genome. + - **variants_total**: The total number of variants detected between the sample and the reference genome. + - **percent_ref_coverage**: The percentage of the reference genome covered by reads with a depth greater than or equal to the `min_coverage` threshold (default is 10). + - **#rname**: Reference sequence name (e.g., chromosome or contig name). + - **startpos**: Starting position of the reference sequence. + - **endpos**: Ending position of the reference sequence. + - **numreads**: Number of reads covering the reference sequence. + - **covbases**: Number of bases with coverage. + - **coverage**: Percentage of the reference sequence covered (depth ≥ 1). + - **meandepth**: Mean depth of coverage over the reference sequence. + - **meanbaseq**: Mean base quality over the reference sequence. + - **meanmapq**: Mean mapping quality over the reference sequence. + + These per-sample QC metrics are then combined into a single file (`snippy_combined_qc_metrics`) in the downstream `snippy_tree_wf` workflow. The combined QC metrics file includes the same columns as above for all samples. Note that the last set of columns (`#rname` to `meanmapq`) may repeat for each chromosome or contig in the reference genome. + + **Note:** The per-sample QC metrics provide valuable insights into the quality and coverage of your sequencing data relative to the reference genome. Monitoring these metrics can help identify samples with low coverage, poor alignment, or potential issues that may affect downstream analyses. ### Inputs @@ -125,7 +145,7 @@ The `Snippy_Streamline_FASTA` workflow is an all-in-one approach to generating a | snippy_centroid_samplename | String | Name of the centroid sample | | snippy_centroid_version | String | Centroid version used | | snippy_cg_snp_matrix | File | CSV file of core genome pairwise SNP distances between samples, calculated from the final alignment | -| snippy_combined_qc_metrics | File | Combined QC metrics file containing concatenated QC metrics from all samples. The file is a tab-separated values (TSV) file with the following columns:
- samplename
- reads_aligned_to_reference
- total_reads
- percent_reads_aligned
- variants_total
- percent_ref_coverage
- #rname
- startpos
- endpos
- numreads
- covbases
- coverage
- meandepth
- meanbaseq
- meanmapq

The last set of columns (`#rname` to `meanmapq`) may repeat for each chromosome or contig in the reference genome. | +| snippy_combined_qc_metrics | File | Combined QC metrics file containing concatenated QC metrics from all samples. | | snippy_concatenated_variants | File | The concatenated variants file | | snippy_filtered_metadata | File | TSV recording the columns of the Terra data table that were used in the summarize_data task | | snippy_final_alignment | File | Final alignment (FASTA file) used to generate the tree (either after snippy alignment, gubbins recombination removal, and/or core site selection with SNP-sites) |