Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cauris_Cladetyper] Various improvements and removal of old TheiaCauris references #700

Merged
merged 11 commits into from
Dec 31, 2024
94 changes: 65 additions & 29 deletions docs/workflows/genomic_characterization/theiaeuk.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibliity** | **Workflow Level** |
|---|---|---|---|---|
| [Genomic Characterization](../../workflows_overview/workflows_type.md/#genomic-characterization) | [Mycotics](../../workflows_overview/workflows_kingdom.md/#mycotics) | PHB v2.3.0 | Yes | Sample-level |
| [Genomic Characterization](../../workflows_overview/workflows_type.md/#genomic-characterization) | [Mycotics](../../workflows_overview/workflows_kingdom.md/#mycotics) | PHB vX.X.X | Yes | Sample-level |

## TheiaEuk Workflows

Expand Down Expand Up @@ -598,64 +598,100 @@ All input reads are processed through "core tasks" in the TheiaEuk workflows. Th

| **Variable** | **Type** | **Description** |
|---|---|---|
| assembly_fasta | File | _De novo_ genome assembly in FASTA format |
| assembly_length | Int | Length of assembly (total number of nucleotides) as determined by QUAST |
| bbduk_docker| String | BBDuk docker image used |
| busco_database | String | BUSCO database used |
| busco_docker | String | BUSCO docker image used |
| busco_report | File | A plain text summary of the results in BUSCO notation |
| busco_results | String | BUSCO results (see above for explanation of BUSCO notation) |
| busco_version | String | BUSCO software version used |
| cg_pipeline_docker | String | Docker file used for running CG-Pipeline on cleaned reads |
| cg_pipeline_report | File | TSV file of read metrics from raw reads, including average read length, number of reads, and estimated genome coverage |
| est_coverage_clean | Float | Estimated coverage calculated from clean reads and genome length |
| est_coverage_raw | Float | Estimated coverage calculated from raw reads and genome length |
| cladetyper_annotated_reference | String | The annotated reference file for the identified clade, "None" if no clade was identified |
| cladetyper_clade | String | The clade assigned to the input assembly |
| cladetyper_docker_image | String | The Docker container used for the task |
| cladetyper_gambit_version | String | The version of GAMBIT used for the analysis |
| combined_mean_q_clean | Float | Mean quality score for the combined clean reads |
| combined_mean_q_raw | Float | Mean quality score for the combined raw reads |
| combined_mean_readlength_clean | Float | Mean read length for the combined clean reads |
| combined_mean_readlength_raw | Float | Mean read length for the combined raw reads |
| contigs_fastg | File | Assembly graph if megahit used for genome assembly |
| contigs_gfa | File | Assembly graph if spades used for genome assembly |
| contigs_lastgraph | File | Assembly graph if velvet used for genome assembly |
| est_coverage_clean | Float | Estimated coverage calculated from clean reads and genome length |
| est_coverage_raw | Float | Estimated coverage calculated from raw reads and genome length |
| fastp_html_report | File | The HTML report made with fastp |
| fastp_version | String | Version of fastp software used |
| fastq_scan_clean1_json | File | JSON file output from `fastq-scan` containing summary stats about clean forward read quality and length |
| fastq_scan_clean2_json | File | JSON file output from `fastq-scan` containing summary stats about clean reverse read quality and length |
fastq_scan_num_reads_clean_pairs | String | Number of read pairs after cleaning as calculated by fastq_scan |
| fastq_scan_num_reads_clean1 | Int | Number of forward reads after cleaning as calculated by fastq_scan |
| fastq_scan_num_reads_clean2 | Int | Number of reverse reads after cleaning as calculated by fastq_scan |
| fastq_scan_num_reads_raw_pairs | String | Number of input read pairs calculated by fastq_scan |
| fastq_scan_num_reads_raw1 | Int | Number of input forward reads calculated by fastq_scan |
| fastq_scan_num_reads_raw2 | Int | Number of input reverse reads calculated by fastq_scan |
| fastq_scan_num_reads_raw_pairs | String | Number of input read pairs calculated by fastq_scan |
| fastq_scan_raw1_json | File | JSON file output from `fastq-scan` containing summary stats about raw forward read quality and length |
| fastq_scan_raw2_json | File | JSON file output from `fastq-scan` containing summary stats about raw reverse read quality and length |
| r1_mean_q_clean | Float | Mean quality score of clean forward reads |
| r1_mean_q_raw | Float | Mean quality score of raw forward reads |
| r2_mean_q_clean | Float | Mean quality score of clean reverse reads |
| r2_mean_q_raw | Float | Mean quality score of raw reverse reads |
| fastq_scan_version | String | Version of fastq-scan software used |
| fastqc_clean1_html | File | Graphical visualization of clean forward read quality from fastqc to open in an internet browser |
| fastqc_clean2_html | File | Graphical visualization of clean reverse read quality from fastqc to open in an internet browser |
| fastqc_docker | String | Docker container used with fastqc |
| fastqc_num_reads_clean1 | Int | Number of forward reads after cleaning by fastqc |
| fastqc_num_reads_clean2 | Int | Number of reverse reads after cleaning by fastqc |
| fastqc_num_reads_clean_pairs | String | Number of read pairs after cleaning by fastqc |
| fastqc_num_reads_raw1 | Int | Number of input reverse reads by fastqc |
| fastqc_num_reads_raw2 | Int | Number of input reverse reads by fastqc |
| fastqc_num_reads_raw_pairs | String | Number of input read pairs by fastqc |
| fastqc_raw1_html | File | Graphical visualization of raw forward read quality from fastqc to open in an internet browser |
| fastqc_raw2_html | File | Graphical visualization of raw reverse read qualityfrom fastqc to open in an internet browser |
| fastqc_version | String | Version of fastqc software used |
| gambit_closest_genomes | File | CSV file listing genomes in the GAMBIT database that are most similar to the query assembly |
| gambit_db_version | String | Version of GAMBIT used |
| gambit_docker | String | GAMBIT docker file used |
| gambit_predicted_taxon | String | Taxon predicted by GAMBIT |
| gambit_predicted_taxon_rank | String | Taxon rank of GAMBIT taxon prediction |
| gambit_report | File | GAMBIT report in a machine-readable format |
| gambit_version | String | Version of GAMBIT software used |
| assembly_length | Int | Length of assembly (total contig length) as determined by QUAST |
| n50_value | Int | N50 of assembly calculated by QUAST |
| number_contigs | Int | Total number of contigs in assembly |
| qc_check | String | A string that indicates whether or not the sample passes a set of pre-determined and user-provided QC thresholds |
| qc_standard | File | The user-provided file that contains the QC thresholds used for the QC check |
| quast_gc_percent | Float | The GC percent of your sample |
| quast_report | File | TSV report from QUAST |
| quast_version | String | Software version of QUAST used |
| r1_mean_q_raw | Float | Mean quality score of raw forward reads |
| r1_mean_readlength_raw | Float | Mean read length of raw forward reads |
| r2_mean_q_raw | Float | Mean quality score of raw reverse reads |
| r2_mean_readlength_clean | Float | Mean read length of clean reverse reads |
| rasusa_version | String | Version of rasusa used |
| read1_subsampled | File | Subsampled read1 file |
| read2_subsampled | File | Subsampled read2 file |
| bbduk_docker | String | BBDuk docker image used |
| fastp_version | String | Version of fastp software used |
| read1_clean | File | Clean forward reads file |
| read1_subsampled | File | Subsampled read1 file |
| read2_clean | File | Clean reverse reads file |
| num_reads_clean_pairs | String | Number of read pairs after cleaning |
| num_reads_clean1 | Int | Number of forward reads after cleaning |
| num_reads_clean2 | Int | Number of reverse reads after cleaning |
| num_reads_raw_pairs | String | Number of input read pairs |
| num_reads_raw1 | Int | Number of input forward reads |
| num_reads_raw2 | Int | Number of input reverse reads |
| trimmomatic_version | String | Version of trimmomatic used |
| clean_read_screen | String | PASS or FAIL result from clean read screening; FAIL accompanied by the reason for failure |
| raw_read_screen | String | PASS or FAIL result from raw read screening; FAIL accompanied by thereason for failure |
| assembly_fasta | File | <https://github.com/tseemann/shovill#contigsfa> |
| contigs_fastg | File | Assembly graph if megahit used for genome assembly |
| contigs_gfa | File | Assembly graph if spades used for genome assembly |
| contigs_lastgraph | File | Assembly graph if velvet used for genome assembly |
| read2_subsampled | File | Subsampled read2 file |
| read_screen_clean | String | PASS or FAIL result from clean read screening; FAIL accompanied by the reason for failure | ONT, PE, SE |
| read_screen_raw | String | PASS or FAIL result from raw read screening; FAIL accompanied by thereason for failure |
| seq_platform | String | Sequencing platform input by the user |
| shovill_pe_version | String | Shovill version used |
| theiaeuk_snippy_variants_bam | File | BAM file produced by the snippy module |
| theiaeuk_illumina_pe_analysis_date | String | Date of TheiaEuk PE workflow execution |
| theiaeuk_illumina_pe_version | String | TheiaEuk PE workflow version used |
| theiaeuk_snippy_variants_bai | String | BAI file produced by the snippy module |
| theiaeuk_snippy_variants_bam | String | BAM file produced by the snippy module |
| theiaeuk_snippy_variants_coverage_tsv | String | TSV file containing coverage information for each base in the reference genome |
| theiaeuk_snippy_variants_gene_query_results | File | File containing all lines from variants file matching gene query terms |
| theiaeuk_snippy_variants_hits | String | String of all variant file entries matching gene query term |
| theiaeuk_snippy_variants_num_reads_aligned | String | Number of reads aligned by snippy |
| theiaeuk_snippy_variants_num_variants | Int | Number of variants detected by snippy |
| theiaeuk_snippy_variants_outdir_tarball | File | Tar compressed file containing full snippy output directory |
| theiaeuk_snippy_variants_percent_ref_coverage | String | Percent of reference genome covered by snippy |
| theiaeuk_snippy_variants_query | String | The gene query term(s) used to search variant |
| theiaeuk_snippy_variants_query_check | String | Were the gene query terms present in the refence annotated genome file |
| theiaeuk_snippy_variants_reference_genome | File | The reference genome used in the alignment and variant calling |
| theiaeuk_snippy_variants_results | File | The variants file produced by snippy |
| theiaeuk_snippy_variants_summary | File | A file summarizing the variants detected by snippy |
| theiaeuk_snippy_variants_version | String | The version of the snippy_variants module being used |
| seq_platform | String | Sequencing platform inout by the user |
| theiaeuk_illumina_pe_analysis_date | String | Date of TheiaProk workflow execution |
| theiaeuk_illumina_pe_version | String | TheiaProk workflow version used |
| trimmomatic_docker | String | Docker image used for trimmomatic |
| trimmomatic_version | String | Version of trimmomatic used |

</div>
76 changes: 70 additions & 6 deletions docs/workflows/standalone/cauris_cladetyper.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,86 @@
# Cauris_CladeTyper

!!! warning "NEEDS WORK!!!!"
This page is under construction and will be updated soon.

## Quick Facts

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
|---|---|---|---|---|
| [Standalone](../../workflows_overview/workflows_type.md/#standalone) | [Mycotics](../../workflows_overview/workflows_kingdom.md#mycotics) | PHB v1.0.0 | Yes | Sample-level |
| [Standalone](../../workflows_overview/workflows_type.md/#standalone) | [Mycotics](../../workflows_overview/workflows_kingdom.md#mycotics) | PHB vX.X.X | Yes | Sample-level |

## Cauris_CladeTyper_PHB

The Cauris_CladeTyper_PHB Workflow is designed to assign clade to _Candida auris_ Whole Genome Sequencing assemblies based on their genomic sequence similarity to the five clade-specific reference files. Clade typing is essential for understanding the epidemiology and evolutionary dynamics of this emerging multidrug-resistant fungal pathogen.
The Cauris_CladeTyper_PHB Workflow is designed to assign the clade to _Candida auris_ (also known as _Candidozyma auris_) WGS assemblies based on their genomic sequence similarity to the five clade-specific reference files. Clade typing is essential for understanding the epidemiology and evolutionary dynamics of this emerging multidrug-resistant fungal pathogen.

### Inputs

<div class="searchable-table" markdown="1">

| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| cauris_cladetyper | **assembly_fasta** | File | The input assembly file in FASTA format | | Required |
| cauris_cladetyper | **samplename** | String | The name of the sample being analyzed | | Required |
| cladetyper | **cpu** | Int | Number of CPUs to allocate to the task | 8 | Optional |
| cladetyper | **disk_size** | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional |
| cladetyper | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/biocontainers/hesslab-gambit:0.5.1--py37h8902056_0" | Optional |
| cladetyper | **kmer_size** | Int | The kmer size to use for generating the GAMBIT signatures file; see GAMBIT documentation for more details | 11 | Optional |
| cladetyper | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 16 | Optional |
| cladetyper | **ref_clade1** | File | The reference assembly for clade 1 | gs://theiagen-public-files/terra/candida_auris_refs/Cauris_Clade1_GCA_002759435.2_Cand_auris_B8441_V2_genomic.fasta | Optional |
| cladetyper | **ref_clade1_annotated** | String | The path to the annotated reference for clade 1 | "gs://theiagen-public-files/terra/candida_auris_refs/Cauris_Clade1_GCA_002759435_Cauris_B8441_V2_genomic.gbff" | Optional |
| cladetyper | **ref_clade2** | File | The reference assembly for clade 2 | gs://theiagen-public-files/terra/candida_auris_refs/Cauris_Clade2_GCA_003013715.2_ASM301371v2_genomic.fasta | Optional |
| cladetyper | **ref_clade2_annotated** | String | The path to the annotated reference for clade 2 | "gs://theiagen-public-files/terra/candida_auris_refs/Cauris_Clade2_GCA_003013715.2_ASM301371v2_genomic.gbff"| Optional |
| cladetyper | **ref_clade3** | File | The reference assembly for clade 3 | gs://theiagen-public-files/terra/candida_auris_refs/Cauris_Clade3_reference.fasta | Optional |
| cladetyper | **ref_clade3_annotated** | String | The path to the annotated reference for clade 3 | "gs://theiagen-public-files/terra/candida_auris_refs/Cauris_Clade3_GCF_002775015.1_Cand_auris_B11221_V1_genomic.gbff" | Optional |
| cladetyper | **ref_clade4** | File | The reference assembly for clade 4 | gs://theiagen-public-files/terra/candida_auris_refs/Cauris_Clade4_reference.fasta | Optional |
| cladetyper | **ref_clade4_annotated** | String | The path to the annotated reference for clade 4 | "gs://theiagen-public-files/terra/candida_auris_refs/Cauris_Clade4_GCA_003014415.1_Cand_auris_B11243_genomic.gbff" | Optional |
| cladetyper | **ref_clade5** | File | The reference assembly for clade 5 | gs://theiagen-public-files/terra/candida_auris_refs/Cauris_Clade5_GCA_016809505.1_ASM1680950v1_genomic.fasta | Optional |
| cladetyper | **ref_clade5_annotated** | String | The path to the annotated reference for clade 5 | "gs://theiagen-public-files/terra/candida_auris_refs/Cauris_Clade5_GCA_016809505.1_ASM1680950v1_genomic.gbff" | Optional |
| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional |
| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional |

</div>

### Workflow Tasks

The Cauris_Cladetyper Workflow for _Candida auris_ employs GAMBIT for taxonomic identification, comparing whole genome sequencing data against reference databases to accurately classify _Candida auris_ isolates. A custom database featuring five clade-specific _Candida auris_ reference genomes facilitates clade typing. Sequences undergo genomic signature comparison against the custom database, enabling assignment to one of the five _Candida auris_ clades (Clade I to Clade V) based on sequence similarity and phylogenetic relationships. This integrated approach ensures precise clade assignments, crucial for understanding the genetic diversity and epidemiology of _Candida auris_.
??? task "Cauris_Cladetyper"
The Cauris_Cladetyper Workflow for _Candida auris_ employs GAMBIT for taxonomic identification, comparing whole genome sequencing data against reference databases to accurately classify _Candida auris_ isolates.

A custom GAMBIT database is created using five clade-specific _Candida auris_ reference genomes. Sequences undergo genomic signature comparison against this database, which then enables assignment to one of the five _Candida auris_ clades (Clade I to Clade V) based on sequence similarity and phylogenetic relationships. This integrated approach ensures precise clade assignments, crucial for understanding the genetic diversity and epidemiology of _Candida auris_.

See more information on the reference information for the five clades below:

| Clade | Genome Accession | Assembly Name | Strain | BioSample Accession |
|---|---|---|---|---|
| Clade I | GCA_002759435.2 | Cand_auris_B8441_V2 | B8441 | SAMN05379624 |
| Clade II | GCA_003013715.2 | ASM301371v2 | B11220 | SAMN05379608 |
| Clade III | GCA_002775015.1 | Cand_auris_B11221_V1 | B11221 | SAMN05379609 |
| Clade IV | GCA_003014415.1 | Cand_auris_B11243 | B11243 | SAMN05379619 |
| Clade V | GCA_016809505.1 | ASM1680950v1 | IFRC2087 | SAMN11570381 |

!!! techdetails "Cauris_Cladetyper Technical Details"

| | Links |
| --- | --- |
| Task | [task_cauris_cladetyper.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/species_typing/candida/task_cauris_cladetyper.wdl) |
| Software Source Code | [GAMBIT on GitHub](https://github.com/jlumpe/gambit) |
| Software Documentation | [GAMBIT Overview](https://theiagen.notion.site/GAMBIT-7c1376b861d0486abfbc316480046bdc?pvs=4) |
| Original Publication(s) | [GAMBIT (Genomic Approximation Method for Bacterial Identification and Tracking): A methodology to rapidly leverage whole genome sequencing of bacterial isolates for clinical identification](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0277575) <br> [TheiaEuk: a species-agnostic bioinformatics workflow for fungal genomic characterization](https://doi.org/10.3389/fpubh.2023.1198213) |

### Outputs

<div class="searchable-table" markdown="1">

| **Variable** | **Type** | **Description** |
|---|---|---|
| cauris_cladetyper_wf_analysis_date | String | Date of analysis |
| cauris_cladetyper_wf_version | String | Version of PHB used for the analysis |
| cladetyper_annotated_reference | String | The annotated reference file for the identified clade, "None" if no clade was identified |
| cladetyper_clade | String | The clade assigned to the input assembly |
| cladetyper_docker_image | String | The Docker container used for the task |
| cladetyper_gambit_version | String | The version of GAMBIT used for the analysis |

</div>

## References

> Lumpe J, Gumbleton L, Gorzalski A, Libuit K, Varghese V, Lloyd T, et al. (2023) GAMBIT (Genomic Approximation Method for Bacterial Identification and Tracking): A methodology to rapidly leverage whole genome sequencing of bacterial isolates for clinical identification. PLoS ONE 18(2): e0277575. <https://doi.org/10.1371/journal.pone.0277575>
<!-- -->
> Ambrosio, Frank, Michelle Scribner, Sage Wright, James Otieno, Emma Doughty, Andrew Gorzalski, Danielle Siao, et al. 2023. "TheiaEuk: A Species-Agnostic Bioinformatics Workflow for Fungal Genomic Characterization." Frontiers in Public Health 11. <https://doi.org/10.3389/fpubh.2023.1198213>.
Loading
Loading