Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Workflow] Adding the TheiaMeta_Panel_Illumina_PE Workflow #656

Draft
wants to merge 56 commits into
base: main
Choose a base branch
from

Conversation

sage-wright
Copy link
Member

@sage-wright sage-wright commented Oct 21, 2024

This PR closes #553

🗑️ This dev branch should NOT be deleted after merging to main.

🧠 Summary

The Illumina VSP panel is one example of a panel-based sequencing assay. This particular panel contains over 200 target viruses,. This data cannot be analyzed using the traditional TheiaMeta analysis pathway due to the numerous target organisms which makes the assembly first-bin later approach undesirable. TheiaMeta_Panel performs taxonomic binning first with Kraken, then attempts to assemble those bins. If the organism to which the bin belongs is supported in TheiaCoV, we perform additional characterization steps, though the results of those characterizations are not always of high quality. The user must be cautious before reporting on the characterization results.

⚡ Impacted Workflows/Tasks

  • TheiaMeta_Illumina_PE

This PR may lead to different results in pre-existing outputs: No

This PR uses an element that could cause duplicate runs to have different results: No

🛠️ Changes

⚙️ Algorithm

TheiaMeta_Panel takes the following approach:

  1. Perform general quality control; the minimum read length may need to be lowered to account for the shorter reads generated using this panel.
  2. Kraken2 is run using (by default) the Viral database built on RefSeq.
  3. After scattering on a list of taxonomic IDs (NCBI), the reads associated with each taxon are extracted
  4. The binned reads run through fastq_scan
  5. If the number of reads is greater than 1000, assembly is attempted.
  6. Assembly is performed first with metaspades -> minimap2 asssembly correction-> samtools sam to bam conversion-> pilon assembly polishing
  7. If an assembly was generated, the workflow runs morgana_magic which performs characterization depending on the organism.
  8. Results for all shards are gathered and printed to a TSV file.

Assembly is fault resistant, meaning that if one shard fails, the workflow continues
This has impacts for TheiaMeta as workflows will no longer fail if assembly fails in either the metaspades or pilon tasks

➡️ Inputs

Many, please see documentation

⬅️ Outputs

Many, please see documentation

🧪 Testing

Tested regular TheiaMeta on 5 HAV samples here
Tested 21 Illumina VSP samples with TheiaMeta Panel here

Suggested Scenarios for Reviewer to Test

🔬 Final Developer Checklist

  • The workflow/task has been tested and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (Theiagen developers)
  • Code changes follow the style guide
  • Documentation and/or workflow diagrams have been updated if applicable (Theiagen developers only)

🎯 Reviewer Checklist

  • All changed results have been confirmed
  • You have tested the PR appropriately (see the testing guide for more information)
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments
  • The documentation has been updated

@sage-wright sage-wright marked this pull request as ready for review October 28, 2024 19:10
@sage-wright sage-wright requested a review from a team as a code owner October 28, 2024 19:10
Copy link
Contributor

@AndrewLangvt AndrewLangvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This some really great work @sage-wright & @cimendes. Just a few points for us to discuss, but I think they're pretty minor overall. I think this workflow has potential to become really widely adopted by labs as multi-pathogen panels become more readily used for "bug screening."

tasks/assembly/task_metaspades.wdl Show resolved Hide resolved
tasks/taxon_id/contamination/task_kraken2.wdl Show resolved Hide resolved
tasks/taxon_id/task_krakentools.wdl Outdated Show resolved Hide resolved
workflows/theiameta/wf_theiameta_illumina_pe.wdl Outdated Show resolved Hide resolved
read1 = select_first([krakentools.extracted_read1]),
read2 = select_first([krakentools.extracted_read2])
}
if (fastq_scan_binned.read1_seq > minimum_read_number) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the kraken read extraction retain only mate pairs, or do we have chance for singletons in L & R read datasets? I.e. should this be checking if both read1 and read2 are above the threshold?

workflows/utilities/wf_morgana_magic.wdl Outdated Show resolved Hide resolved
workflows/utilities/wf_organism_parameters.wdl Outdated Show resolved Hide resolved
@sage-wright sage-wright closed this Nov 7, 2024
@sage-wright sage-wright deleted the smw-theiameta-panel-dev branch November 7, 2024 20:18
@sage-wright sage-wright restored the smw-theiameta-panel-dev branch November 7, 2024 20:26
@fraser-combe fraser-combe reopened this Nov 7, 2024
@sage-wright sage-wright marked this pull request as draft December 6, 2024 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make new workflow compatible with Illumina VSP (Viral Surveillance Panel)
4 participants