[New Workflow] Adding the TheiaMeta_Panel_Illumina_PE Workflow #656

sage-wright · 2024-10-21T17:54:47Z

This PR closes #553

🗑️ This dev branch should NOT be deleted after merging to main.

🧠 Summary

The Illumina VSP panel is one example of a panel-based sequencing assay. This particular panel contains over 200 target viruses,. This data cannot be analyzed using the traditional TheiaMeta analysis pathway due to the numerous target organisms which makes the assembly first-bin later approach undesirable. TheiaMeta_Panel performs taxonomic binning first with Kraken, then attempts to assemble those bins. If the organism to which the bin belongs is supported in TheiaCoV, we perform additional characterization steps, though the results of those characterizations are not always of high quality. The user must be cautious before reporting on the characterization results.

⚡ Impacted Workflows/Tasks

TheiaMeta_Illumina_PE

This PR may lead to different results in pre-existing outputs: No

This PR uses an element that could cause duplicate runs to have different results: No

🛠️ Changes

⚙️ Algorithm

TheiaMeta_Panel takes the following approach:

Perform general quality control; the minimum read length may need to be lowered to account for the shorter reads generated using this panel.
Kraken2 is run using (by default) the Viral database built on RefSeq.
After scattering on a list of taxonomic IDs (NCBI), the reads associated with each taxon are extracted
The binned reads run through fastq_scan
If the number of reads is greater than 1000, assembly is attempted.
Assembly is performed first with metaspades -> minimap2 asssembly correction-> samtools sam to bam conversion-> pilon assembly polishing
If an assembly was generated, the workflow runs morgana_magic which performs characterization depending on the organism.
Results for all shards are gathered and printed to a TSV file.

Assembly is fault resistant, meaning that if one shard fails, the workflow continues
This has impacts for TheiaMeta as workflows will no longer fail if assembly fails in either the metaspades or pilon tasks

➡️ Inputs

Many, please see documentation

⬅️ Outputs

Many, please see documentation

🧪 Testing

Tested regular TheiaMeta on 5 HAV samples here
Tested 21 Illumina VSP samples with TheiaMeta Panel here

Suggested Scenarios for Reviewer to Test

🔬 Final Developer Checklist

The workflow/task has been tested and results, including file contents, are as anticipated
The CI/CD has been adjusted and tests are passing (Theiagen developers)
Code changes follow the style guide
Documentation and/or workflow diagrams have been updated if applicable (Theiagen developers only)

🎯 Reviewer Checklist

All changed results have been confirmed
You have tested the PR appropriately (see the testing guide for more information)
All code adheres to the style guide
MD5 sums have been updated
The PR author has addressed all comments
The documentation has been updated

…na_pe

AndrewLangvt

This some really great work @sage-wright & @cimendes. Just a few points for us to discuss, but I think they're pretty minor overall. I think this workflow has potential to become really widely adopted by labs as multi-pathogen panels become more readily used for "bug screening."

tasks/assembly/task_metaspades.wdl

tasks/quality_control/read_filtering/task_pilon.wdl

tasks/taxon_id/contamination/task_kraken2.wdl

tasks/taxon_id/task_krakentools.wdl

tasks/utilities/data_handling/task_gather_scatter.wdl

workflows/theiameta/wf_theiameta_illumina_pe.wdl

workflows/theiameta/wf_theiameta_panel_illumina_pe.wdl

AndrewLangvt · 2024-11-04T13:53:56Z

workflows/theiameta/wf_theiameta_panel_illumina_pe.wdl

+          read1 = select_first([krakentools.extracted_read1]),
+          read2 = select_first([krakentools.extracted_read2])
+      }
+      if (fastq_scan_binned.read1_seq > minimum_read_number) {


does the kraken read extraction retain only mate pairs, or do we have chance for singletons in L & R read datasets? I.e. should this be checking if both read1 and read2 are above the threshold?

workflows/utilities/wf_morgana_magic.wdl

workflows/utilities/wf_organism_parameters.wdl

sage-wright · 2025-01-16T21:28:18Z

closing this pr to avoid clutter since development is stalled for the foreseeable future; i can always reopen this pr when development resumes

sage-wright and others added 30 commits September 24, 2024 19:19

make theiameta_panel

7b265b2

rename taxon id vars in org param

b3bd529

language

593e5b5

progress

01c1223

notes

2c63022

finish

83e8add

does this work?

e8e757d

set required for now

8416a8e

correct terrible spelling

8ddf11b

add runtime

236230f

start documentation

3d23bce

add information on workflow tasks to documentation

75c7224

Merge branch 'main' into smw-theiameta-panel-dev

c177267

remove krona

e8312a5

add + to everything????

b5260e4

remove from one array

aad8ec4

also remove from that one too

991d540

trying something cRaZy

f56ca68

it doesn't work

a04af99

more crazy ideas?

498d07c

maybe basename is a good idea

688f89b

change to json

fce8abb

sort of works but is ugly

cc97d70

IT WORKS

9a7086a

clean up

bc96474

add dummy genome length & logic block consensus qc

93bb88b

remove null values from identified_organisms otuput

c4cf61b

add versioning

4e1c373

up to 1000

148cb9d

make theiameta_panel fault-resistant, has impacts on theiameta_illumi…

8c7de78

…na_pe

sage-wright added 14 commits October 23, 2024 19:01

hide some optional inputs

c04fa48

add inputs and outputs to docs

43e0efd

Merge branch 'main' into smw-theiameta-panel-dev

4c8d373

enable searchable

547a920

set default, expand docs

82695a7

update contributors

033cbc0

input explosion

c8b658a

make good

92acb24

document the explosion

a3c7c52

optionalize extracted reads

b0498ae

add flu outputs to gather scatter

48a26a2

finish documentation

fcc17d9

clean up docs

8bc64fb

update md5sums

931e815

sage-wright marked this pull request as ready for review October 28, 2024 19:10

sage-wright requested a review from a team as a code owner October 28, 2024 19:10

AndrewLangvt reviewed Nov 4, 2024

View reviewed changes

sage-wright added 4 commits November 4, 2024 15:24

move taxon_id conversion to its own file; remove comment cruft

42e5503

name things lol

d5bebaf

typing issues, again

cf40557

remove output

c6ff0c1

sage-wright closed this Nov 7, 2024

sage-wright deleted the smw-theiameta-panel-dev branch November 7, 2024 20:18

sage-wright restored the smw-theiameta-panel-dev branch November 7, 2024 20:26

fraser-combe reopened this Nov 7, 2024

Merge branch 'main' into smw-theiameta-panel-dev

522a016

sage-wright marked this pull request as draft December 6, 2024 17:05

Merge branch 'main' into smw-theiameta-panel-dev

53d00c9

sage-wright closed this Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Workflow] Adding the TheiaMeta_Panel_Illumina_PE Workflow #656

[New Workflow] Adding the TheiaMeta_Panel_Illumina_PE Workflow #656

sage-wright commented Oct 21, 2024 •

edited

Loading

AndrewLangvt left a comment

AndrewLangvt Nov 4, 2024

sage-wright commented Jan 16, 2025

[New Workflow] Adding the TheiaMeta_Panel_Illumina_PE Workflow #656

[New Workflow] Adding the TheiaMeta_Panel_Illumina_PE Workflow #656

Conversation

sage-wright commented Oct 21, 2024 • edited Loading

🧠 Summary

⚡ Impacted Workflows/Tasks

🛠️ Changes

⚙️ Algorithm

➡️ Inputs

⬅️ Outputs

🧪 Testing

Suggested Scenarios for Reviewer to Test

🔬 Final Developer Checklist

🎯 Reviewer Checklist

AndrewLangvt left a comment

Choose a reason for hiding this comment

AndrewLangvt Nov 4, 2024

Choose a reason for hiding this comment

sage-wright commented Jan 16, 2025

sage-wright commented Oct 21, 2024 •

edited

Loading