Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TheiaCoV] Split database from Kraken2_TheiaCoV task #670

Merged
merged 24 commits into from
Dec 5, 2024

Conversation

cimendes
Copy link
Member

@cimendes cimendes commented Nov 7, 2024

This PR closes #563, closes #520

πŸ—‘οΈ This dev branch should be deleted after merging to main.

🧠 Summary

This PR addresses several issues that came up when the RSV target organism was set differently for RSV-A and RSV-B, i.e. rsv_a_kraken_target_organism = "Respiratory syncytial virus" and rsv_b_kraken_target_organism = "Human orthopneumovirus" due to the use of an old database. As the existing docker image had an embedded database, it meant that each time the Kraken database would need to be updated, it would have to be done inside the docker container, which would not be efficient.

This PR simply updates the existing kraken2_theiacov task to split the database from the container. The new database is custom built with human + refseq viral sequences as of 2024-08-28.

No inputs or outputs have been altered other than what is strictly required to make the new code changes functional.

Behaviour change:

  • the percentage of sc2 is now only reported for TheiaCoV is the target organism is indeed SC2. To account for this the output type has been changed from Float to String.
  • Freyja now has a kraken2_target_organism input that by default is SC2, but is user modifyable

⚑ Impacted Workflows/Tasks

Impacted tasks

  • kraken2_theiacov
  • kraken2_parse_classified

Impacted workflows

  • TheiaCoV
    • Illumina_PE
    • Illumina_SE
    • Clearlabs
    • ONT
  • Freyja_FASTQ
  • NCBI_Scrub
    • SE
    • PE

This PR may lead to different results in pre-existing outputs: Yes

  • Use of an updated database may reflect changes in how taxa are identified.

This PR uses an element that could cause duplicate runs to have different results: No

πŸ› οΈ Changes

βš™οΈ Algorithm

As explained above, the only change in algorithm is that the percentage of sc2 is now only reported if the target organism is SC2, otherwise this column is blank. To accomplish this the data type has been changed from Float to String. Impacts column sorting as it is now alpha-numeric.

➑️ Inputs

freyja_fastq new input:

  • kraken2_target_organism

⬅️ Outputs

Percent sc2 is now a string output, rather than a float output

πŸ§ͺ Testing

Suggested Scenarios for Reviewer to Test

πŸ”¬ Final Developer Checklist

  • The workflow/task has been tested and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (Theiagen developers)
  • Code changes follow the style guide
  • Documentation and/or workflow diagrams have been updated if applicable (Theiagen developers only)

🎯 Reviewer Checklist

  • All changed results have been confirmed
  • You have tested the PR appropriately (see the testing guide for more information)
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments
  • The documentation has been updated

@sage-wright sage-wright closed this Nov 7, 2024
@sage-wright sage-wright deleted the im-theiacov-kraken2-db-dev branch November 7, 2024 20:18
@fraser-combe fraser-combe restored the im-theiacov-kraken2-db-dev branch November 7, 2024 20:27
@fraser-combe fraser-combe reopened this Nov 7, 2024
@cimendes cimendes marked this pull request as ready for review November 8, 2024 09:35
@cimendes cimendes requested a review from a team as a code owner November 8, 2024 09:35
@sage-wright sage-wright self-assigned this Nov 14, 2024
Copy link
Contributor

@Michal-Babins Michal-Babins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ncbi scrubber, please pass in workflow level variable and do not hard code on task level.

workflows/theiacov/updates/wf_ncbi_scrub_se.wdl Outdated Show resolved Hide resolved
@Michal-Babins
Copy link
Contributor

Testing illumina PE here for WNV, MPox, Flu, Sars-Cov-2.

@Michal-Babins
Copy link
Contributor

Everything looks good in the test and the output is properly identified by kraken from what we would expect. One of the flu samples returned nothing from kraken, but I ran it against PHBv2.2.1 and it was the same thing so something with that sample is just slightly off.

Run with isolated sample here

@cimendes
Copy link
Member Author

cimendes commented Dec 5, 2024

@Michal-Babins I've exposed the required input parameter, as requested! A little merge conflict has been solved and the CI has been updated.

@cimendes
Copy link
Member Author

cimendes commented Dec 5, 2024

Ops, forgot about the docs update! Now everything is done!

Copy link
Contributor

@Michal-Babins Michal-Babins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specified at the higher level now, looks good to me. Thank you.

@Michal-Babins
Copy link
Contributor

Tested ncbi scrubber PE here
Tested ncbi scrubber SE here

Both tests are passing. Ready to merge. 🌟 ⭐ 🌟

@Michal-Babins Michal-Babins merged commit be1247b into main Dec 5, 2024
28 checks passed
@Michal-Babins Michal-Babins deleted the im-theiacov-kraken2-db-dev branch December 5, 2024 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants