diff --git a/docs/workflows/genomic_characterization/pangolin_update.md b/docs/workflows/genomic_characterization/pangolin_update.md index 988db4404..a05756888 100644 --- a/docs/workflows/genomic_characterization/pangolin_update.md +++ b/docs/workflows/genomic_characterization/pangolin_update.md @@ -65,4 +65,8 @@ This workflow runs on the sample level. | **pangolin_updates** | String | Result of Pangolin Update (lineage changed versus unchanged) with lineage assignment and date of analysis | | **pangolin_versions** | String | All Pangolin software and database versions | - \ No newline at end of file + + +## References + +> **Pangolin**: RRambaut A, Holmes EC, O'Toole Á, Hill V, McCrone JT, Ruis C, du Plessis L, Pybus OG. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol. 2020 Nov;5(11):1403-1407. doi: 10.1038/s41564-020-0770-5. Epub 2020 Jul 15. PMID: 32669681; PMCID: PMC7610519. diff --git a/docs/workflows/genomic_characterization/theiacov.md b/docs/workflows/genomic_characterization/theiacov.md index a21d46f89..b78c368ae 100644 --- a/docs/workflows/genomic_characterization/theiacov.md +++ b/docs/workflows/genomic_characterization/theiacov.md @@ -900,6 +900,7 @@ All input reads are processed through "core tasks" in the TheiaCoV Illumina, ONT | Task | [task_pangolin.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/species_typing/betacoronavirus/task_pangolin.wdl) | | Software Source Code | [Pangolin on GitHub](https://github.com/cov-lineages/pangolin) | | Software Documentation | [Pangolin website](https://cov-lineages.org/resources/pangolin.html) | + | Original Publication(s) | [A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology](https://doi.org/10.1038/s41564-020-0770-5) | ??? task "`nextclade`" @@ -1138,7 +1139,7 @@ All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch) | nextclade_json_flu_ha | File | Nextclade output in JSON file format, specific to Flu HA segment | ONT, PE | | nextclade_json_flu_na | File | Nextclade output in JSON file format, specific to Flu NA segment | ONT, PE | | nextclade_lineage | String | Nextclade lineage designation | CL, FASTA, ONT, PE, SE | -| nextclade_qc | String | QC metric as determined by Nextclade. (For Flu, this output will be specific to HA segment) | CL, FASTA, ONT, PE, SE | +| nextclade_qc | String | QC metric as determined by Nextclade. Will be blank for Flu | CL, FASTA, ONT, PE, SE | | nextclade_qc_flu_ha | String | QC metric as determined by Nextclade, specific to Flu HA segment | ONT, PE | | nextclade_qc_flu_na | String | QC metric as determined by Nextclade, specific to Flu NA segment | ONT, PE | | nextclade_tsv | File | Nextclade output in TSV file format. (For Flu, this output will be specific to HA segment) | CL, FASTA, ONT, PE, SE | diff --git a/docs/workflows/genomic_characterization/theiameta.md b/docs/workflows/genomic_characterization/theiameta.md index 55c26d9a6..eb501b301 100644 --- a/docs/workflows/genomic_characterization/theiameta.md +++ b/docs/workflows/genomic_characterization/theiameta.md @@ -241,22 +241,62 @@ The TheiaMeta_Illumina_PE workflow processes Illumina paired-end (PE) reads ge #### Assembly ??? task "`metaspades`: _De Novo_ Metagenomic Assembly" + While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging. A dedicated metagenomic assembly algorithm is necessary to circumvent the challenge of interpreting variation. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes. - While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging. A dedicated metagenomic assembly algorithm is necessary to circumvent the challenge of interpreting variation. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes. + `metaspades` is a _de novo_ assembler that first constructs a de Bruijn graph of all the reads using the SPAdes algorithm. Through various graph simplification procedures, paths in the assembly graph are reconstructed that correspond to long genomic fragments within the metagenome. For more details, please see the original publication. !!! techdetails "MetaSPAdes Technical Details" - | | Links | | --- | --- | | Task | [task_metaspades.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/assembly/task_metaspades.wdl) | | Software Source Code | [SPAdes on GitHub](https://github.com/ablab/spades) | - | Software Documentation | | - | Original Publication(s) | [metaSPAdes: a new versatile metagenomic assembler](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411777/) | + | Software Documentation | [SPAdes Manual](https://ablab.github.io/spades/index.html) | + | Original Publication(s) | [metaSPAdes: a new versatile metagenomic assembler](http://www.genome.org/cgi/doi/10.1101/gr.213959.116) | -??? task "`minimap2`: Assembly Alignment and Contig Filtering (if a reference is provided)" +??? task "`minimap2`: Assembly Alignment and Contig Filtering" If a reference genome is provided through the **`reference`** optional input, the assembly produced with `metaspades` will be mapped to the reference genome with `minimap2`. The contigs which align to the reference are retrieved and returned in the **`assembly_fasta`** output. + `minimap2` is a popular aligner that is used for correcting the assembly produced by metaSPAdes. This is done by aligning the reads back to the generated assembly or a reference genome. + + In minimap2, "modes" are a group of preset options. Two different modes are used in this task depending on whether a reference genome is provided. + + If a reference genome is _not_ provided, the only mode used in this task is `sr` which is intended for "short single-end reads without splicing". The `sr` mode indicates the following parameters should be used: `-k21 -w11 --sr --frag=yes -A2 -B8 -O12,32 -E2,1 -b0 -r100 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g100 -2K50m --heap-sort=yes --secondary=no`. The output file is in SAM format. + + If a reference genome is provided, then after the draft assembly polishing with `pilon`, this task runs again with the mode set to `asm20` which is intended for "long assembly to reference mapping". The `asm20` mode indicates the following parameters should be used: `-k19 -w10 -U50,500 --rmq -r100k -g10k -A1 -B4 -O6,26 -E2,1 -s200 -z200 -N50`. The output file is in PAF format. + + For more information, please see the [minimap2 manpage](https://lh3.github.io/minimap2/minimap2.html) + + !!! techdetails "minimap2 Technical Details" + | | Links | + |---|---| + | Task | [task_minimap2.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/alignment/task_minimap2.wdl) | + | Software Source Code | [minimap2 on GitHub](https://github.com/lh3/minimap2) | + | Software Documentation | [minimap2](https://lh3.github.io/minimap2) | + | Original Publication(s) | [Minimap2: pairwise alignment for nucleotide sequences](https://academic.oup.com/bioinformatics/article/34/18/3094/4994778) | + +??? task "`samtools`: SAM File Conversion " + This task converts the output SAM file from minimap2 and converts it to a BAM file. It then sorts the BAM based on the read names, and then generates an index file. + + !!! techdetails "samtools Technical Details" + | | Links | + |---|---| + | Task | [task_samtools.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/utilities/data_handling/task_parse_mapping.wdl) | + | Software Source Code | [samtools on GitHub](https://github.com/samtools/samtools) | + | Software Documentation | [samtools](https://www.htslib.org/doc/samtools.html) | + | Original Publication(s) | [The Sequence Alignment/Map format and SAMtools](https://doi.org/10.1093/bioinformatics/btp352)
[Twelve Years of SAMtools and BCFtools](https://doi.org/10.1093/gigascience/giab008) | + +??? task "`pilon`: Assembly Polishing" + `pilon` is a tool that uses read alignment to correct errors in an assembly. It is used to polish the assembly produced by metaSPAdes. The input to Pilon is the sorted BAM file produced by `samtools`, and the original draft assembly produced by `metaspades`. + + !!! techdetails "pilon Technical Details" + | | Links | + |---|---| + | Task | [task_pilon.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/quality_control/read_filtering/task_pilon.wdl) | + | Software Source Code | [Pilon on GitHub](https://github.com/broadinstitute/pilon) | + | Software Documentation | [Pilon Wiki](https://github.com/broadinstitute/pilon/wiki) | + | Original Publication(s) | [Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement](https://doi.org/10.1371/journal.pone.0112963) | + #### Assembly QC ??? task "`quast`: Assembly Quality Assessment" diff --git a/docs/workflows/genomic_characterization/theiameta_panel.md b/docs/workflows/genomic_characterization/theiameta_panel.md index 31821973f..553b52856 100644 --- a/docs/workflows/genomic_characterization/theiameta_panel.md +++ b/docs/workflows/genomic_characterization/theiameta_panel.md @@ -304,7 +304,7 @@ TheiaMeta_Panel_Illumina_PE was created initially for the [Illumina Viral Survei ### Workflow Tasks ??? task "`read_QC_trim`: Read Quality Trimming, Adapter Removal, Quantification, and Identification" - + ##### Read Cleaning {#read_QC_trim} `read_QC_trim` is a sub-workflow within TheiaMeta that removes low-quality reads, low-quality regions of reads, and sequencing adapters to improve data quality. It uses a number of tasks, described below. **Read quality trimming** @@ -372,7 +372,7 @@ TheiaMeta_Panel_Illumina_PE was created initially for the [Illumina Viral Survei | Original Publication(s) | [Trimmomatic: a flexible trimmer for Illumina sequence data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4103590/)
[fastp: an ultra-fast all-in-one FASTQ preprocessor](https://academic.oup.com/bioinformatics/article/34/17/i884/5093234?login=false)
[An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography](https://pubmed.ncbi.nlm.nih.gov/27803195/) | ??? task "`kraken2`: Taxonomic Classification" - + ##### Kraken2 {#kraken2} Kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data. Kraken2 is run on the clean reads that result from the `read_QC_trim` subworkflow. By default, the Kraken2 database is set to the `k2_viral_20240112` database, located at `"gs://theiagen-large-public-files-rp/terra/databases/kraken2/k2_viral_20240112.tar.gz"`. @@ -389,6 +389,7 @@ TheiaMeta_Panel_Illumina_PE was created initially for the [Illumina Viral Survei | Original Publication(s) | [Improved metagenomic analysis with Kraken 2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0) | ??? task "`extract_kraken_reads` from KrakenTools: Read Binning" + ##### KrakenTools {#extract_kraken_reads} KrakenTools is a collection of scripts that can be used to help downstream analysis of Kraken2 results. In particular, this task uses the `extract_kraken_reads` script, which extracts reads classified at any user-specified taxonomy IDs. All parent and children reads of the specified taxonomic ID are also extracted. !!! techdetails "KrakenTools Technical Details" @@ -397,7 +398,170 @@ TheiaMeta_Panel_Illumina_PE was created initially for the [Illumina Viral Survei | Task | [task_kraken_tools.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/taxon_id/task_krakentools.wdl) | Software Source Code | [KrakenTools on GitHub](https://github.com/jenniferlu717/KrakenTools) | | Software Documentation | [KrakenTools on GitHub](https://github.com/jenniferlu717/KrakenTools) | - | Original Publication | [Metagenome analysis using the Kraken software suite](https://doi.org/10.1038/s41596-022-00738-y) | + | Original Publication(s) | [Metagenome analysis using the Kraken software suite](https://doi.org/10.1038/s41596-022-00738-y) | + +??? task "`fastq_scan`: Summarizing Read Bins" + ##### FASTQ Scan {#fastq_scan} + `fastq_scan` is used to summarize the read bins generated by the `extract_kraken_reads` task. It provides basic statistics about the read bins, such as the number of reads in each bin, the number of read pairs, and the number of reads in each bin. + + !!! techdetails "fastq_scan Technical Details" + | | Links | + | --- | --- | + | Task | [task_fastq_scan.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/quality_control/basic_statistics/task_fastq_scan.wdl) | + | Software Source Code | [fastq-scan](https://github.com/rpetit3/fastq-scan) | + | Software Documentation | [fastq-scan](https://github.com/rpetit3/fastq-scan) | + +??? task "`metaspades`: _De Novo_ Metagenomic Assembly" + ##### metaSPAdes {#metaspades} + While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging. A dedicated metagenomic assembly algorithm is necessary to circumvent the challenge of interpreting variation. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes. + + `metaspades` is a _de novo_ assembler that first constructs a de Bruijn graph of all the reads using the SPAdes algorithm. Through various graph simplification procedures, paths in the assembly graph are reconstructed that correspond to long genomic fragments within the metagenome. For more details, please see the original publication. + + !!! techdetails "MetaSPAdes Technical Details" + | | Links | + | --- | --- | + | Task | [task_metaspades.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/assembly/task_metaspades.wdl) | + | Software Source Code | [SPAdes on GitHub](https://github.com/ablab/spades) | + | Software Documentation | [SPAdes Manual](https://ablab.github.io/spades/index.html) | + | Original Publication(s) | [metaSPAdes: a new versatile metagenomic assembler](http://www.genome.org/cgi/doi/10.1101/gr.213959.116) | + +??? task "`minimap2`: Assembly Alignment and Contig Filtering" + + ##### minimap2 {#minimap2} + + `minimap2` is a popular aligner that is used in TheiaMeta_Panel for correcting the assembly produced by metaSPAdes. This is done by aligning the reads back to the generated assembly. + + The default mode used in this task is `sr` which is intended for "short single-end reads without splicing". In minimap2, "modes" are a group of preset options; the `sr` mode indicates the following parameters should be used: `-k21 -w11 --sr --frag=yes -A2 -B8 -O12,32 -E2,1 -b0 -r100 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g100 -2K50m --heap-sort=yes --secondary=no`. + + For more information, please see the [minimap2 manpage](https://lh3.github.io/minimap2/minimap2.html) + + !!! techdetails "minimap2 Technical Details" + | | Links | + |---|---| + | Task | [task_minimap2.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/alignment/task_minimap2.wdl) | + | Software Source Code | [minimap2 on GitHub](https://github.com/lh3/minimap2) | + | Software Documentation | [minimap2](https://lh3.github.io/minimap2) | + | Original Publication(s) | [Minimap2: pairwise alignment for nucleotide sequences](https://academic.oup.com/bioinformatics/article/34/18/3094/4994778) | + +??? task "`samtools`: SAM File Conversion" + This task converts the output SAM file from minimap2 and converts it to a BAM file. It then sorts the BAM based on the read names, and then generates an index file. + + !!! techdetails "samtools Technical Details" + | | Links | + |---|---| + | Task | [task_samtools.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/utilities/data_handling/task_parse_mapping.wdl) | + | Software Source Code | [samtools on GitHub](https://github.com/samtools/samtools) | + | Software Documentation | [samtools](https://www.htslib.org/doc/samtools.html) | + | Original Publication(s) | [The Sequence Alignment/Map format and SAMtools](https://doi.org/10.1093/bioinformatics/btp352)
[Twelve Years of SAMtools and BCFtools](https://doi.org/10.1093/gigascience/giab008) | + +??? task "`pilon`: Assembly Polishing" + + ##### Pilon {#pilon} + + `pilon` is a tool that uses read alignment to correct errors in an assembly. It is used to polish the assembly produced by metaSPAdes. The input to Pilon is the sorted BAM file produced by `samtools`, and the original draft assembly produced by `metaspades`. + + !!! techdetails "pilon Technical Details" + | | Links | + |---|---| + | Task | [task_pilon.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/quality_control/read_filtering/task_pilon.wdl) | + | Software Source Code | [Pilon on GitHub](https://github.com/broadinstitute/pilon) | + | Software Documentation | [Pilon Wiki](https://github.com/broadinstitute/pilon/wiki) | + | Original Publication(s) | [Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement](https://doi.org/10.1371/journal.pone.0112963) | + +??? task "`quast`: Assembly Quality Assessment" + + ##### QUAST {#quast} + + QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics without a reference being necessary. It includes useful metrics such as number of contigs, length of the largest contig and N50. + + !!! techdetails "QUAST Technical Details" + | | Links | + | --- | --- | + | Task | [task_quast.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/quality_control/task_quast.wdl) | + | Software Source Code | [QUAST on GitHub](https://github.com/ablab/quast) | + | Software Documentation | | + | Original Publication(s) | [QUAST: quality assessment tool for genome assemblies](https://academic.oup.com/bioinformatics/article/29/8/1072/228832) | + +??? task "`morgana_magic`: Genomic Characterization" + + ##### Morgana Magic {#morgana_magic} + + Morgana Magic is the viral equivalent of the `merlin_magic` subworkflow used in the TheiaProk workflows. This workflow launches several tasks the characterize the viral genome, including Pangolin4, Nextclade, and others. + + This subworkflow currently only supports the organisms that are natively supported by the [TheiaCoV workflows](./theiacov.md). + + The following tasks only run for the appropriate taxon ID if sufficient reads were extracted. The following table illustrates which characterization tools are run for the indicated organism. + + | | SARS-CoV-2 | MPXV | WNV | Influenza | RSV-A | RSV-B | + | --- | --- | --- | --- | --- | --- | --- | + | Pangolin | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | + | Nextclade | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | + | IRMA | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | + | Abricate | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | + | GenoFLU | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | + + ??? task "`pangolin`" + Pangolin designates SARS-CoV-2 lineage assignments. + + !!! techdetails "Pangolin Technical Details" + + | | Links | + | --- | --- | + | Task | [task_pangolin.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/species_typing/betacoronavirus/task_pangolin.wdl) | + | Software Source Code | [Pangolin on GitHub](https://github.com/cov-lineages/pangolin) | + | Software Documentation | [Pangolin website](https://cov-lineages.org/resources/pangolin.html) | + | Original Publication(s) | [A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology](https://doi.org/10.1038/s41564-020-0770-5) | + + ??? task "`nextclade`" + ["Nextclade is an open-source project for viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement."](https://docs.nextstrain.org/projects/nextclade/en/stable/) + + !!! techdetails "Nextclade Technical Details" + + | | Links | + | --- | --- | + | Task | [task_nextclade.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/taxon_id/task_nextclade.wdl#L63) | + | Software Source Code | | + | Software Documentation | [Nextclade](https://docs.nextstrain.org/projects/nextclade/en/stable/) | + | Original Publication(s) | [Nextclade: clade assignment, mutation calling and quality control for viral genomes.](https://doi.org/10.21105/joss.03773) | + + ??? task "`irma`" + Cleaned reads are re-assembled using `irma` which does not use a reference due to the rapid evolution and high variability of influenza. Assemblies produced by `irma` will be orderd from largest to smallest assembled flu segment. `irma` also performs typing and subtyping as part of the assembly process. + + General statistics about the assembly are generated with the `consensus_qc` task ([task_assembly_metrics.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/quality_control/basic_statistics/task_assembly_metrics.wdl)). + + !!! techdetails "IRMA Technical Details" + | | Links | + | --- | --- | + | Task | [task_irma.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/assembly/task_irma.wdl) | + | Software Documentation | [IRMA website](https://wonder.cdc.gov/amd/flu/irma/) | + | Original Publication(s) | [Viral deep sequencing needs an adaptive approach: IRMA, the iterative refinement meta-assembler](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3030-6) | + + ??? task "`abricate`" + Abricate assigns types and subtype/lineages for flu samples + + !!! techdetails "Abricate Technical Details" + | | Links | + | --- | --- | + | Task | [task_abricate.wdl (abricate_flu subtask)](https://github.com/theiagen/public_health_bioinformatics/blob/2dff853defc6ea540a058873f6fe6a78cc2350c7/tasks/gene_typing/drug_resistance/task_abricate.wdl#L59) | + | Software Source Code | [ABRicate on GitHub](https://github.com/tseemann/abricate) | + | Software Documentation | [ABRicate on GitHub](https://github.com/tseemann/abricate) | + + ??? task "`genoflu`" + This sub-workflow determines the whole-genome genotype of an H5N1 flu sample. + + !!! techdetails "GenoFLU Technical Details" + | | Links | + | --- | --- | + | Task | [task_genoflu.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/species_typing/orthomyxoviridae/task_genoflu.wdl) | + | Software Source Code | [GenoFLU on GitHub](https://github.com/USDA-VS/GenoFLU) | + +??? task "`gather_scatter`: Generate Summary File" + The `gather_scatter` task generates a summary file with all the results for all taxon IDs with identified reads. Please see the [`results_by_taxon_tsv`](#results_by_taxon_tsv) section below for more information. + + !!! techdetails "gather_scatter Technical Details" + | | Links | + | --- | --- | + | Task | [task_gather_scatter.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/data_handling/task_gather_scatter.wdl) | ### Outputs @@ -411,15 +575,122 @@ TheiaMeta_Panel_Illumina_PE was created initially for the [Illumina Viral Survei | kraken2_docker | String | Docker image used to run kraken2 | | kraken2_report | File | Text document describing taxonomic prediction of every FASTQ record. This file can be very large and cumbersome to open and view | | kraken2_version | String | The version of Kraken2 used in the analysis | -| results_by_taxon_tsv | File | A TSV file that contains the results for every taxon ID provided in the taxon_ids input variable that had reads identified; characterization (if applicable) and basic statistics regarding read count, assembly generation (if applicable), and general quality, are also associated with each bin | +| results_by_taxon_tsv | File | A TSV file that contains the results for every taxon ID provided in the taxon_ids input variable that had reads identified; characterization (if applicable) and basic statistics regarding read count, assembly generation (if applicable), and general quality, are also associated with each bin; see below for more details. | | theiameta_panel_illumina_pe_analysis_date | String | Date the workflow was run | | theiameta_panel_illumina_pe_version | String | Version of PHB used to run the workflow | -#### The `results_by_taxon_tsv` Output File - -This file contains the +#### The `results_by_taxon_tsv` Output File {#results_by_taxon_tsv} + +This TSV file contains a summary of all of the taxon IDs provided in the `taxon_ids` input variable that had reads identified, with each row representing a taxon ID. + +Depending on if reads could be extract for the taxon ID, the `organism` column will contain the name of the organism. This column will be blank if no reads were able to be extracted for the taxon ID in the sample. + +??? toggle "What columns are included?" + The following columns are included in the `results_by_taxon_tsv` file: + + - `taxon_id`: The taxon ID used for the binning, generated for all taxon IDs provided in the `taxon_ids` input variable + - `organism`: The name of the organism associated with the taxon ID if reads were able to be extracted; the following columns are blank if no reads were able to be extracted for the taxon ID in the sample + - `extracted_read1`: The GSURI of the extracted read1 FASTQ file + - `extracted_read2`: The GSURI of the extracted read2 FASTQ file + - `krakentools_docker`: The Docker image used to run KrakenTools' `extract_kraken_reads` + - `fastq_scan_num_reads_binned1`: The number of reads in the extracted read1 FASTQ file + - `fastq_scan_num_reads_binned2`: The number of reads in the extracted read2 FASTQ file + - `fastq_scan_num_reads_binned_pairs`: The number of read pairs in the extracted read1 and read2 FASTQ files + - `fastq_scan_docker`: The Docker image used to run the `fastq_scan` task + - `fastq_scan_version`: The version of the `fastq_scan` tool used in the analysis + - `metaspades_warning`: A warning message if an empty assembly was produced for the taxon ID; blank if assembly was successful + - `pilon_warning`: A warning message if Pilon failed, blank if assembly polishing was successful + - `assembly_fasta`: A GSURI to the assembly FASTA file + - `quast_genome_length`: The length of the assembly + - `quast_number_contigs`: The number of contigs in the assembly + - `quast_n50`: The N50 value of the assembly + - `quast_gc_percent`: The GC content of the assembly + - `number_N`: The number of Ns in the assembly + - `number_ATCG`: The number of ATCGs in the assembly + - `number_Degenerate`: The number of degenerate bases in the assembly + - `number_Total`: The total number of bases in the assembly + - `percent_reference_coverage`: The percent of the reference genome covered by the assembly; only applicable if the taxon ID is already supported by TheiaCoV (additional assembly files may be added in the future) + + Any subsequent columns are specific to the identified organism and taxon ID; typically, values for these columns are only produced if the organism is natively supported by the TheiaCoV workflows. + + ??? toggle "SARS-CoV-2: _Pangolin_" + - `pango_lineage`: The Pango lineage of the assembly + - `pango_lineage_expanded`: The Pango lineage of the assembly without aliases + - `pangolin_conflicts`: The number of conflicts in the Pango lineage + - `pangolin_notes`: Any notes generated by Pangolin about the lineage + - `pangolin_assignment_version`: The version of the assignment module used to assign the Pango lineage + - `pangolin_version`: The version of Pangolin used to generate the Pango lineage + - `pangolin_docker`: The Docker image used to run Pangolin + + ??? toggle "Mpox, SARS-CoV-2, RSV-A, RSV-B: _Nextclade_" + - `nextclade_version`: The version of Nextclade used + - `nextclade_docker`: The Docker image used to run Nextclade + - `nextclade_ds_tag`: The dataset tag used to run Nextclade + - `nextclade_aa_subs`: Amino-acid substitutions as detected by Nextclade + - `nextclade_aa_dels`: Amino-acid deletions as detected by Nextclade + - `nextclade_clade`: Nextclade clade designation + - `nextclade_lineage`: Nextclade lineage designation + - `nextclade_qc`: QC metric as determined by Nextclade + + ??? toggle "Flu: _Nextclade_, _IRMA_, _GenoFLU_, _ABRicate_" + - `nextclade_version`: The version of Nextclade used + - `nextclade_docker`: The Docker image used to run Nextclade + - `nextclade_ds_tag_flu_ha`: The dataset tag used to run Nextclade for the HA segment + - `nextclade_aa_subs_flu_ha`: Amino-acid substitutions as detected by Nextclade for the HA segment + - `nextclade_aa_dels_flu_ha`: Amino-acid deletions as detected by Nextclade for the HA segment + - `nextclade_clade_flu_ha`: Nextclade clade designation for the HA segment + - `nextclade_lineage_flu_ha`: Nextclade lineage designation for the HA segment + - `nextclade_qc_flu_ha`: QC metric as determined by Nextclade for the HA segment + - `nextclade_ds_tag_flu_na`: The dataset tag used to run Nextclade for the NA segment + - `nextclade_aa_subs_na`: Amino-acid substitutions as detected by Nextclade for the NA segment + - `nextclade_aa_dels_na`: Amino-acid deletions as detected by Nextclade for the NA segment + - `nextclade_clade_flu_na`: Nextclade clade designation for the NA segment + - `nextclade_lineage_flu_na`: Nextclade lineage designation for the NA segment + - `nextclade_qc_flu_na`: QC metric as determined by Nextclade for the NA segment + - `irma_version`: The version of IRMA used + - `irma_docker`: The Docker image used to run IRMA + - `irma_type`: The flu type identified by IRMA + - `irma_subtype`: The flu subtype identified by IRMA + - `irma_subtype_notes`: Any notes generated by IRMA about the subtype + - `genoflu_version`: The version of GenoFLU used + - `genoflu_genotype`: The complete genotype of the flu sample + - `genoflu_all_segments`: The genotype of each flu segment in the sample + - `abricate_flu_type`: The flu type identified by ABRicate + - `abricate_flu_subtype`: The flu subtype identified by ABRicate + - `abricate_flu_database`: The flu database used by ABRicate + - `abricate_flu_version`: The version of ABRicate used + +This file can be downloaded and opened in Excel to view the full result summary for the sample. Due to the nature of the TheiaMeta_Panel workflow and Terra, displaying this information in the Terra table would be challenging to view, which is why we have generated this file. If you have any suggestions on formatting or additional outputs, please let us know at or by submitting an issue. ## References +> **Trimmomatic**: Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014 Aug 1;30(15):2114-20. doi: 10.1093/bioinformatics/btu170. Epub 2014 Apr 1. PMID: 24695404; PMCID: PMC4103590. + +> **fastp**: Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sep 1;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560. PMID: 30423086; PMCID: PMC6129281. + +> **MIDAS**: Nayfach S, Rodriguez-Mueller B, Garud N, Pollard KS. An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography. Genome Res. 2016 Nov;26(11):1612-1625. doi: 10.1101/gr.201863.115. Epub 2016 Oct 18. PMID: 27803195; PMCID: PMC5088602. + +> **Kraken2**: Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019 Nov 28;20(1):257. doi: 10.1186/s13059-019-1891-0. PMID: 31779668; PMCID: PMC6883579. + +> **KrakenTools**: Lu J, Rincon N, Wood DE, Breitwieser FP, Pockrandt C, Langmead B, Salzberg SL, Steinegger M. Metagenome analysis using the Kraken software suite. Nat Protoc. 2022 Dec;17(12):2815-2839. doi: 10.1038/s41596-022-00738-y. Epub 2022 Sep 28. Erratum in: Nat Protoc. 2024 Aug 29. doi: 10.1038/s41596-024-01064-1. PMID: 36171387; PMCID: PMC9725748. + +> **metaSPAdes**: Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017 May;27(5):824-834. doi: 10.1101/gr.213959.116. Epub 2017 Mar 15. PMID: 28298430; PMCID: PMC5411777. + +> **minimap2**: Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018 Sep 15;34(18):3094-3100. doi: 10.1093/bioinformatics/bty191. PMID: 29750242; PMCID: PMC6137996. + +> **SAMtools**: Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PMID: 19505943; PMCID: PMC2723002. + +> **SAMtools**: Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. Twelve years of SAMtools and BCFtools. Gigascience. 2021 Feb 16;10(2):giab008. doi: 10.1093/gigascience/giab008. PMID: 33590861; PMCID: PMC7931819. + +> **Pilon**: Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA, Zeng Q, Wortman J, Young SK, Earl AM. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One. 2014 Nov 19;9(11):e112963. doi: 10.1371/journal.pone.0112963. PMID: 25409509; PMCID: PMC4237348. + +> **QUAST**: Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19. PMID: 23422339; PMCID: PMC3624806. + +> **Pangolin**: RRambaut A, Holmes EC, O'Toole Á, Hill V, McCrone JT, Ruis C, du Plessis L, Pybus OG. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol. 2020 Nov;5(11):1403-1407. doi: 10.1038/s41564-020-0770-5. Epub 2020 Jul 15. PMID: 32669681; PMCID: PMC7610519. + +> **Nextclade**: Aksamentov et al., (2021). Nextclade: clade assignment, mutation calling and quality control for viral genomes. Journal of Open Source Software, 6(67), 3773, https://doi.org/10.21105/joss.03773 + +> **IRMA**: Shepard SS, Meno S, Bahl J, Wilson MM, Barnes J, Neuhaus E. Viral deep sequencing needs an adaptive approach: IRMA, the iterative refinement meta-assembler. BMC Genomics. 2016 Sep 5;17(1):708. doi: 10.1186/s12864-016-3030-6. Erratum in: BMC Genomics. 2016 Oct 13;17(1):801. doi: 10.1186/s12864-016-3138-8. PMID: 27595578; PMCID: PMC5011931. +