Skip to content

v2.0.0

Compare
Choose a tag to compare
@sage-wright sage-wright released this 22 Apr 18:15
· 198 commits to main since this release
880a66c

Public Health Bioinformatics v2.0.0 Release Notes

This major release simplifies the usage of the TheiaCoV workflows and does major restructuring on all inputs and outputs on several workflows, including TheiaCoV, TheiaProk, TheiaEuk, and TheiaMeta. Additionally, it introduces three new workflows, improves on several workflows, and resolves various bugs.

Full release notes can be found here.

All inputs and outputs have been standardized across all of PHB. More information can be found here.

Find our documentation here!

🆕 New workflows:

  • Kraken2_ONT_PHB

  • TBProfiler_tNGS_PHB

    • This workflow is still in a beta state; development is currently ongoing.
    • It is used to process targeted next-generation sequencing (tNGS) Mycobacterium tuberculosis data for antimicrobial resistance (AMR) characterization with TBProfiler and tbp-parser. It includes quality assessment and control with Trimmomatic.
    • Import the workflow from Dockstore
  • Find_Shared_Variants_PHB

    • Find_Shared_Variants_PHB is a workflow for concatenating the variant results produced by the Snippy_Variants_PHB workflow across multiple samples and reshaping the data to illustrate variants that are shared among multiple samples.
    • Import this workflow from Dockstore

🚀 Changes to existing workflows:

  • TheiaCoV, TheiaProk, TheiaEuk and TheiaMeta workflows

    • All inputs and outputs have been standardized across all workflow series
  • TheiaCoV Workflow Series

    • The workflow_parameters sub-workflow now controls all taxa-specific optional inputs in TheiaCov. The default value for the organism input is still set to "sars-cov-2".

    • VADR is now enabled for flu, rsv-a and rsv-b.

    • Nextclade has been updated to v3. Older dataset tags than the ones provided by default are not compatible with the current version. See below for the list of updated nextclade_dataset_tags.

    • Nextclade dataset names & their default values in TheiaCoV workflows have also changed. For example "sars-cov-2" is now called "nextstrain/sars-cov-2/wuhan-hu-1/orfs". The name "sars-cov-2" still works as an alias, but we recommend using the full name because it is more descriptive and clearer, and will be supported by Nextclade for the foreseeable future.

      Organism Old Dataset Name New Dataset Name New Dataset Tag
      SARS-CoV-2 "sars-cov-2" "nextstrain/sars-cov-2/wuhan-hu-1/orfs" 2024-04-15--15-08-22Z
      Mpox (specifically, Mpox lineage B.1 dataset) "hMPXV_B1" "nextstrain/mpox/lineage-b.1" 2024-01-16--20-31-02Z
      Influenza A H1N1 HA "flu_h1n1pdm_ha" "nextstrain/flu/h1n1pdm/ha/MW626062" 2024-01-16--20-31-02Z
      Influenza A H3N2 HA "flu_h3n2_ha" "nextstrain/flu/h3n2/ha/EPI1857216" 2024-02-22--16-12-03Z
      Influenza B Victoria HA "flu_vic_ha" "nextstrain/flu/vic/ha/KX058884" 2024-01-16--20-31-02Z
      Influenza B Yamagata HA "flu_yam_ha" "nextstrain/flu/yam/ha/JN993010" 2024-01-30--16-34-55Z
      Influenza A H1N1 NA "flu_h1n1pdm_na" "nextstrain/flu/h1n1pdm/na/MW626056" 2024-01-16--20-31-02Z
      Influenza A H3N2 NA "flu_h3n2_na" "nextstrain/flu/h3n2/na/EPI1857215" 2024-01-16--20-31-02Z
      Influenza B Victoria NA "flu_vic_na" "nextstrain/flu/vic/na/CY073894" 2024-01-16--20-31-02Z
      RSV-A "rsv_a" "nextstrain/rsv/a/EPI_ISL_412866" 2024-01-29--10-29-43Z
      RSV-B "rsv_b" "nextstrain/rsv/b/EPI_ISL_1653999" 2024-01-29--10-29-43Z
  • TheiaCoV Flu Track

    • For the flu track:
      • Tamiflu-resistance determination has been removed in favor of the oseltamivir nomenclature. Additionally, amantadine and rimantadide were added.
        • We now check for antiviral resistance mutations against the following 10 antiviral drugs: A_315675, amantadine, compound_367, favipiravir_resistanceflu_fludase, L_742_001, laninamivir, peramivir, pimodivir, rimantadine, oseltamivir, xofluza, zanamivir.
      • For TheiaCoV_Illumina_PE, assembly coverage is now computed for both HA and NA segments
      • Nexclade outputs are now computed for the NA fragment as well as HA
  • TheiaProk Workflow Series

    • Plasmidfinder can now be toggled off through the call_plasmidfinder optional input
    • Trimmomatic encoding is now set to 33 by default to avoid failures when processing SRA-Lite formatted FASTQ files
  • TheiaMeta

    • Automated binning has been integrated into TheiaMeta when a reference file is not provided. Binning is performed with SemiBin2
    • The assembly module optional inputs have been exposed, allowing the user to control the behavior of metaSPAdes and Pilon
  • SRA_Fetch

    • A new warning column has now been implemented indicating if the downloaded file is suspected to be in SRA-Lite format

Docker container updates:

  • Augur has been updated to commit hash cec4fa0ecd8612e4363d40662060a5a9c712d67e, from 2024-02-01
  • BUSCO has been updated to version v5.7.1. Due to memory issues when running eukaryotic assemblies, TheiaEuk was excluded from this update and still runs on version v5.3.2
  • pasty has been updated to version v1.3.0
  • tbp-parser has been updated to version v1.4.2
  • theiavalidate has been updated to version v0.1.0
  • ts_mlst database has been updated as of April 2024
  • VADR has been updated to version v1.6.3

🐛 Bug fixes and small improvements:

  • All workflows: Fastq_Scan outputs have been renamed (now prefixed with fastq_scan_*) to differentiate them from fastQC. Several outputs for FastP and fastQC are now exposed such as the respective report HTMLs.
  • TheiaCoV (all workflows): Edge-case bugs in QC_check and Pangolin have been resolved. The percent gene coverage task has been modularized.
  • TheiaCoV Illumina PE: read1_aligned, read1_unaligned, read2_aligned, read2_unaligned, sorted_bam_aligned, sorted_bam_aligned_bai, sorted_bam_unaligned, and sorted_bam_uanligned_bai are now outputted by the workflow.
  • TheiaProk (all workflows): midas_secondary_genus_coverage (the secondary genus absolute coverage) is now output.
  • TheiaEuk: Several outputs from the snippy_variants task have been exposed: snippy_variants_num_reads_aligned, snippy_variants_num_variants, snippy_variants_coverage_tsv, and snippy_variants_percent_ref_coverage.
  • BaseSpace_Fetch: A fix has been implemented that greatly speeds up the download of data from BaseSpace when using Basespace "Projects" to organize sequencing runs.
  • Snippy_Streamline: snippy_concatenated_variants and snippy_shared_variants are now exposed as Snippy_Streamline outputs. The snippy_snp_matix output has been deprecated in favor of snippy_wg_snp_matrix and snippy_cg_snp_matrix.
  • kSNP3: ksnp3_number_snps, ksnp3_number_core_snps and ksnp3_core_snp_table have been added to the collection of outputs.
  • Kraken2 Standalone (all workflows): Uncompressed read files can now be processed by all Kraken2 workflows
  • Freyja_FASTQ: A new optional input depth_cutoff has been added, giving the user the option to exclude sites with coverage depth below the provided value (by default no cutoff is performed). New outputs added: freyja_coverage and freyja_barcode_file

What's Changed

  • Adding assembly_mean_coverage metrics for flu in TheiaCoV_Illumina_PE_PHB by @jrotieno in #314
  • pangolin TMPDIR add and CI updates & improvements by @kapsakcj in #327
  • expose optional input parameter disk_size for kraken2 standalone wfs by @kapsakcj in #316
  • TheiaValidate: Compare file contents (#264) by @sage-wright in #335
  • Added Freyja coverage output to Terra table by @emmadoughty in #317
  • [TheiaMeta] Binning with SemiBin2 by @cimendes in #323
  • fix dockstore: add empty.json file and update .dockstore.yml with absolute paths to it by @kapsakcj in #347
  • [Freyja_FASTQ] output usher_barcodes file to table by @sage-wright in #338
  • theiacov_fasta_batch_PHB wf improvements by @kapsakcj in #319
  • [Kraken2] Add module to recalculate abundances based on fragment length - Kraken2_ont wf and TheiaCoV_ONT wf by @cimendes in #240
  • output unaligned FASTQ files TheiaCov_Illumina PE and SE by @kapsakcj in #275
  • [Augur_PHB] Update ncov repo commit and remove reference input to augur clades task by @sage-wright in #330
  • Expose midas secondary genus absolute coverage by @michellescribner in #257
  • Output whole genome SNP matrix for Snippy_Streamline and Snippy_Tree workflows even when core_genome used by @jrotieno in #351
  • [Kraken2_Standalone] Make task compatible with uncompressed FASTQ files by @cimendes in #331
  • TheiaCoV and TheiaProk workflows standardization & organism default subworkflow by @jrotieno in #310
  • bug fix on fastq-dl task/SRA_fetch workflow for cpu variable by @kapsakcj in #363
  • Additional Updates to Flu antiviral Calls in TheiaCoV Workflows by @jrotieno in #311
  • Fix Empty Mean Coverage for Flu HA and NA Assemblies by @jrotieno in #372
  • [TheiaProk] Enable plasmidfinder to be skipped by @sage-wright in #374
  • [FastP & FastQC] output FastP and FastQC reports & output name changes by @sage-wright in #378
  • Shared variants tasks and QC improvements for kSNP3 and Snippy by @michellescribner in #291
  • [Style Guide] Semantic adjustments throughout the PHB universe by @cimendes in #377
  • upgrade pasty to v1.0.3 by @kapsakcj in #379
  • Fix dockstore yml for Find Shared Variants wf by @michellescribner in #383
  • [TheiaMeta] Remove subworkflow for assembly with metaspades and pilon; remove exposed krakendb input from QC subworkflow by @cimendes in #380
  • bug fix to BaseSpace_Fetch_PHB by @kapsakcj in #385
  • adding option to remove reference sequence from alignment by @jrotieno in #382
  • [QC Check] Squish bug when vadr_num_alerts is a String by @sage-wright in #388
  • Percent Gene Coverage Task Modularization by @sage-wright in #341
  • Set trimmomatic_args "-phred33" as default by @kapsakcj in #389
  • [TheiaProk] update ts_mlst docker image to latest release available (2024-03-11) by @cimendes in #391
  • [Nextclade & TheiaCoV] remove tamiflu amino acid substitution detection since duplicated by @sage-wright in #393
  • adding optional arguments to freyja boot and demix calls by @jrotieno in #371
  • [SRA_Fetch] Add task to detect if a file is SRA-Lite by @cimendes in #387
  • Add VADR to flu and RSV TheiaCoV by @cimendes in #384
  • update default pangolin docker image to image with pangolin-data v1.26 by @kapsakcj in #394
  • Upgrade to nextclade v3 & update default dataset tags by @kapsakcj in #375
  • update BUSCO to v5.7.1 and small tweaks to WDL task by @kapsakcj in #401
  • [Mercury_Prep_N_Batch] add state to country by @sage-wright in #399
  • add Flu NA Nextclade outputs to theiacov_illumina_pe by @kapsakcj in #406
  • Ensure compliance with the PHA4GE Best Practices by @sage-wright in #408
  • [PHB v2.0.0] update CI by @cimendes in #411
  • TBProfiler_tNGS_PHB: Introduction of tNGS workflow for TB by @sage-wright in #272
  • update to latest nextclade_dataset release for SC2 2024-04-15--15-08-22Z by @kapsakcj in #414
  • pangolin_update wf bug fix and hiding one input param related to Flu and organism_parameters subwf by @kapsakcj in #415
  • theiacov_illumina_pe kraken output name change by @kapsakcj in #417
  • setting vadr_skip_length via organism_parameters subworkflow as this … by @kapsakcj in #418
  • pangolin_update wf: hide unused input params for vadr by @kapsakcj in #421
  • [TheiaEuk] revert busco container to busco:v5.3.2_cv1 for theiaeuk by @cimendes in #425
  • [tbp-parser] update version by @sage-wright in #428

New Contributors

Full Changelog: v1.3.0...v2.0.0