Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Workflow] Flye_denovo to replace DragonFlye #692

Open
wants to merge 58 commits into
base: main
Choose a base branch
from

Conversation

fraser-combe
Copy link
Contributor

@fraser-combe fraser-combe commented Dec 13, 2024

This PR closes #611, #585, and #565.

🗑️ This dev branch should be deleted after merging to main.

🧠 Summary

This PR introduces a new flye_denovo workflow as a replacement for the Dragonflye workflow. The updated workflow streamlines the assembly and polishing pipeline, focusing on being flexible and modular with the addition of assembly visualization through Bandage plots.

Notable enhancements include:

New -tasks, including optional read trimming with Porechop, enhanced assembly visualization with Bandage, and multiple polishing options. Supports ONT data, hybrid assemblies with Illumina reads, and multiple assembly polishing tools (Medaka, Racon, and Polypolish).
Medaka polishing is set at 1 round as recommended by Rwick, and ONT

⚡ Impacted Workflows/Tasks

  • New flye_denovo workflow.
  • Replaces and enhances functionality previously offered by the Dragonflye workflow.
  • Tasks impacted:
    • task_porechop.wdl
    • task_flye.wdl
    • task_bandageplot.wdl
    • task_bwa.wdl
    • task_medaka.wdl
    • task_racon.wdl
    • task_dnaapler.wdl
    • task_polypolish.wdl
    • task_filtercontigs.wdl
      removes task_dragonfly.wdl

This PR may lead to different results in pre-existing outputs: Yes

This PR uses an element that could cause duplicate runs to have different results: Yes

  • Due to the introduction of optional polishing tools and enhancements in assembly parameters, output may vary based on selected configurations.
  • This includes updated medaka (including most recent models), polypolish and racon polishing tools from Dragonflye versions
  • Updated dnaapler for contig reoirientation - faster run time tested for similar results by authors of the tool

🛠️ Changes

  • Added flye_denovo.wdl to replace Dragonflye. as a sub workflow
  • Enhanced modularity and task-level input definitions for flexibility.
  • Integrated multiple polishing and trimming options.
  • Introduced better documentation and metadata outputs for transparency and reproducibility.

⚙️ Algorithm

  1. Workflow Redesign: The flye_denovo workflow replaces the Dragonflye workflow, with a modular and flexible structure that separates tasks like trimming, assembly, polishing, and final orientation for clarity and maintainability.
  2. Polishing Enhancements:
    • Added support for Medaka, Racon, and Polypolish with configurable rounds of polishing and tool-specific parameters.
    • Support for hybrid assemblies using Illumina data with Polypolish.
  3. Medaka Model Selection:
    • Introduced automatic Medaka model selection based on the input reads or user-provided overrides.
    • defaults to a medaka model if auto fails otherwise user can override
    • Outputs the Medaka model used
  4. Version Tracking:
    • Outputs versions of Flye, Porechop, Medaka, Racon, Polypolish, Bandage, and Dnaapler.
  5. Outputs:
    • Outputs now include:
      • Final polished assembly.
      • Bandage plots for graph visualization.
      • Assembly graphs in GFA format.
      • Metadata for task versions
  6. Docker Updates:
    • Updated Docker images for Flye, Medaka, Racon, dnaapler and other tasks to their latest stable versions

➡️ Inputs

No

⬅️ Outputs

Added bandage plot png output
version outputs for task level software
medaka models used
Assembly_fasta output from dnaapler for downstream analyses

🧪 Testing

Scenarios tested within TheiaProk - Expected TheiaProk workflow to complete successfully for each task and specifically for flye_denovo workflow we expect to see successful creation of assembly fasta after any filtering or polishing conducted.

  1. Default path Flye>Medaka Polish>Filtercontige>dnaApler
    https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/586af547-03dd-4cb8-8877-8041d0064464
    medaka output model and version
    image

  2. Porechop run i.e skip_trim_reads = false
    https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/e5a645db-0e17-4c58-84ce-4e1f44ef9042

  3. Skip polishing skip_polishing = true
    https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/110a4ffa-c208-4849-a3b2-11d88ffddc90

  4. Racon polishing pathway (polishing_rounds = 2)
    https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/95694506-946f-4ef7-9b3d-657522cc7809

  5. Hybrid assembly ONT data and Illumina (Polypolish and BWA)
    https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/4fa9c915-4572-43b9-bd9b-28bed40c75a4

##Comparisons between DragonFlye and New Flye_denovo subworkflow##
Here we are looking for similarities in assemblies, statistics and downstream analyses. 8 bacterial samples selected

Both workflows produce assemblies of similar lengths for each sample, with minor variations (typically within ±1%).
Both workflows achieve high BUSCO completeness scores, generally above 90%.
Both workflows consistently predict the same taxa for each sample.

Example data comparisons table

Sample ID Workflow Assembly Length (bp) BUSCO Completeness (%) BUSCO Fragmentation (%) BUSCO Missing (%) Taxonomic Prediction
ERR8958704 Dragonflye 5,754,961 98.2 0.7 1.1 Klebsiella pneumoniae
ERR8958704 Flye_denovo 5,732,385 98.0 0.9 1.1 Klebsiella pneumoniae
ERR8958706 Dragonflye 5,727,717 98.0 0.2 1.8 Klebsiella pneumoniae
ERR8958706 Flye_denovo 5,718,792 98.5 0.2 1.3 Klebsiella pneumoniae
ERR8958833 Dragonflye 2,902,609 100.0 0.0 0.0 Staphylococcus aureus
ERR8958833 Flye_denovo 2,902,617 100.0 0.0 0.0 Staphylococcus aureus
ERR8958835 Dragonflye 2,902,603 99.8 0.2 0.0 Staphylococcus aureus
ERR8958835 Flye_denovo 2,902,618 99.8 0.2 0.0 Staphylococcus aureus
SAMN05250424 Dragonflye 4,774,436 93.8 3.9 2.3 Salmonella enterica
SAMN05250424 Flye_denovo 4,778,142 92.0 6.1 1.9 Salmonella enterica
SAMN05596277 Dragonflye 4,778,588 94.5 3.4 2.1 Salmonella enterica
SAMN05596277 Flye_denovo 4,773,705 92.9 4.3 2.8 Salmonella enterica
SAMN23569621 Dragonflye 5,281,055 83.6 13.6 2.8 Shigella sonnei
SAMN23569621 Flye_denovo 5,248,069 87.9 9.8 2.3 Shigella sonnei
SAMN23605158 Dragonflye 5,207,959 77.3 17.0 5.7 Shigella sonnei
SAMN23605158 Flye_denovo 5,190,940 81.6 13.0 5.4 Shigella sonnei

Downstream analyses
-Gambit Taxon: Identical predictions across workflows for each sample.
Add in comparison results
-Both workflows produce identical results for most downstream analyses, ensuring reliable serotype predictions, taxonomic classifications, and virulence gene identifications.

Finally the 44 validation ONT raw data samples were ran through Flye denovo and samples were checked manually to compare against previously ran Dragonflye submissions and we found similar comparable results
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/47b75bb4-2a1b-4e41-b5f3-2f421e4e38ed

Suggested Scenarios for Reviewer to Test

Parameters to test:
skip_trim_reads: true
skip_polishing: false
polishing_rounds: 1

Expected outputs: Final polished assembly in FASTA format.
Metadata output for versions used (e.g., Flye, Medaka).
No trimming or filtering applied.
Successful Bandage plot and GFA graph generation.

🔬 Final Developer Checklist

  • The workflow/task has been tested and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (Theiagen developers)
  • Code changes follow the style guide
  • Documentation and/or workflow diagrams have been updated if applicable
    • You have updated the latest version for any affected worklows in the respective workflow documentation page and for every entry in the three workflows_overview tables.

🎯 Reviewer Checklist

  • All changed results have been confirmed
  • You have tested the PR appropriately (see the testing guide for more information)
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments
  • The documentation has been updated

…der structure tasks and rename to skip polish and skip trim, medaka single polish
…n in racon, update flye param names passed miniwdl check
…rease maxRetries for Medaka task and capture selected model, update docs theiaprok
@fraser-combe fraser-combe marked this pull request as ready for review December 17, 2024 02:40
@fraser-combe fraser-combe requested a review from a team as a code owner December 17, 2024 02:40
@andrewjpage andrewjpage self-assigned this Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[TheiaProk] add Bandage plot visualization
3 participants