-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TheiaProk] Adds stxtyper to merlin_magic and TheiaProk wfs #525
Conversation
…ESTED locally. it may not work
…prok_illumina_pe wf. Also added code to catch when stxtyper output TSV is one line meaning no hits were found
…uts: String stxtyper_hits and Int stxtyper_num_hits. updated merlin_magic and theiaprok_illumina_PE workflow and tested successfully w miniwdl
…nd Shigella (including sonnei) are run through stxtyper
…t columns; downsize disk_size to 50 gb default instead of 100 (it doesn't need much as it runs on 1 assembly
…ed; preemptible on since this task doesn't run for much more than 2 min; created outputs in case no hits are found; started re-writing parsing code for when hits are detected & finished the bit for complete operons. tested successfully with miniwdl
…eeds work to iron out bugs. want to save progress and push a commit
…d to user; major changes to stxtyper parsing again; added a new optional input & 1 output, removed 1 output; still needs work
…ith new stxtyper outputs; updated theiaprok_illumina_pe with new stxtyper outputs; still need to rework part of parsing for stxtyper but theiaprok ran successfully w miniwdl
…iwdl successfully
other TODOs:
|
Waiting on user feedback prior to making more code changes Plan as of 2024-07-10:
This will allow organisms of any genus/species to be screen for stx genes since they can occur in more genera/species other than E. coli & Shigella |
Also - adjust conditional in merlin_magic code to so that user can "opt-in" to running stxtyper, regardless of the taxa (i.e. gambit_predicted_taxon). That way stxtyper is run automatically on all E. coli and Shigella and user has the ability to run it on other taxa. |
…dded boolean for enabling built-in stxtyper debugging; removed many outputs and added stxtyper_all_hits output; also renamed a few outputs for clarity and consistency; commented out unused portions of code which will be deleted later
…eiaprok ilmn pe wf outputs; adjusted order outputs in stxtyper task for clarity
…la call block. added optional input call_stxtyper so user can run tool regardless of merlin_tag and GAMBIT_predicted_taxon. tested successfully w miniwdl, need to test in Terra
…l to the theiaprok documentation. also fixed a minor comment typo in stxtyper task file
reminder to update the workflow diagram with this new tool and add it into the PR |
…and Shigella spp specific tasks
…nual conflict resolution on theiaprok test ymls
I expect the illumina single end tests to pass, so I'm setting this as ready for review EDIT: they ran successfully |
gah....there are merge conflicts with the test ymls, hopefully my last commit fixes it but I may need another commit to resolve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code changes look great -- one request: can you not add stxtyper_log
to the Terra table since it's not useful to users? We can still have it as a task output so we can find it in GC but if it's not going to be used, I'm not sure if we want it to be available.
Testing on some Klebsiella here |
good point, yes I'll remove it from workflow outputs |
…seful for debugging and not useful to end user. also removed mention in the docs
launched another test for theiaprok_illumina_pe now that the stxtyper_log has been removed: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-Aug2024/job_history/af3bef09-b239-43c3-8134-2988a2098232 I probs will have to update CI again, so commits incoming EDIT these test workflows ran successfully and did not output the |
thank you @sage-wright for help on updating this diagram!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick turnaround!
⭐
## Please keep this PR as draft for now as stxtyper is still undergoing validation/peer review/publicationUsing this draft PR for tracking stxtyper development as well as amrfinderplus (that runs stxtyper under the hood).NOTE: ignore the amrfinderplus mention, that update is NOT included in this PROur partners are actively using this branch for testing purposes. Let's keep it available for now
I'll try to keep this branch up-to-date withmain
to incorporate other changes & resolve merge conflicts if they arise.This PR closes #443
🗑️ This dev branch should NOT be deleted after merging to main.
2024-09-23 update: I expect to hear feedback from our partner soon, but updating PR message now with tests and info🧠 Aim, Context and Functionality
This PR adds Stxtyper to the TheiaProk workflows. Stxtyper is used to detect and type shiga toxin genes in bacterial genome assemblies. It also attempts to detect novel shiga toxin subtypes in cases where the detected sequences diverge from the reference sequences.
These genes are usually found in E. coli (STEC), but can also be found in Shigella species as well as some other genera more rarely, like Klebsiella. It is developed by NCBI in collaboration with a number of different groups including CDC, FDA, SSI, and others. A publication to fully describe the tool and it's validation is in the works but a software release has been made so the community may test the software further and begin using the tool.
This tool queries genome assemblies for 2 genes or subunits involved in shiga toxin production, stxA and stxB. The A subunit is longer than the B subunit. Stxtyper attempts to detect these, compare them to a database of known sequences, and type them based on amino acid composition. The typing algorithm will be described in the publication when it is published.
More info & source code found here: https://github.com/ncbi/stxtyper
To learn more about shiga toxin subtypes and the description of the latest subtypes, Stx2n, Stx2j, Stx2m, and Stx2o, see this publication (shamless plug): https://www.mdpi.com/2076-2607/11/10/2561
Eventually this tool will be incorporated into AMRFinderPlus and will run behind-the-scenes when the user provides the
amrfinder --organism Escherichia
option, but we wanted the functionality now and the ability to run separate from AMRFinderPlus.🛠️ Impacted Workflows/Tasks & Changes Being Made
This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes/No
Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : Yes/No
📋 Workflow/Task Step Changes
🔄 Data Processing
New task. StxTyper is run on assemblies and output TSV is parsed to generate a few string output columns
Docker/software or software versions changed: New
Databases or database versions changed: New
Data processing/commands changed: New
File processing changed: New
Compute resources changed: 1cpu, 4GB RAM. Doesn't need much. Also enabled preemptible since it usually runs in less than 3 minutes
➡️ Inputs
new TheiaProk wf inputs:
gambit_predicted_taxon
output. Useful if you suspect a non-E.coli or non-Shigella sample contains stx genes.us-docker.pkg.dev/general-theiagen/staphb/stxtyper:1.0.24
$TMPDIR
are not removed after running⬅️ Outputs
New TheiaProk wf outputs:
stxtyper_report
TSV file🧪 Testing
Test Dataset
A large Illumina PE dataset of 161 samples known to be pcr-positive for stx were used to test. There are a few with 0 hits and these are likely due to false PCR positives, or where the sequence data was not good enough to assemble the stx gene regions of the genome.
ONT data was from random Shigella samples from SRA, so no expectations to have stx genes in those
Commandline Testing with MiniWDL or Cromwell (optional)
Tested fine locally with miniwdl, not including output/results here in PR message. Lots of Terra testing was done.
Terra Testing
Need to review wf outputs, but should be OK.Suggested Scenarios for Reviewer to Test
Test any of the TheiaProk workflows. E. coli and Shigella spp samples would be ideal, but you can run other bacterial taxon through and set
call_stxtyper
optional input totrue
to test functionality.Here's ~5500+ samples on NCBI pathogen detection browser that contain a hit to
stx
in their genome, and would be good to test with: https://www.ncbi.nlm.nih.gov/pathogens/isolates/#virulence_genotypes:stx* There's even some S. aureus and Salmonella that would be interesting to test withTheiagen Version Release Testing (optional)
theiaprok_illumina_pe
andtheiaprok_illumina_se
in this PHB validation template Terra workspace: https://app.terra.bio/#workspaces/theiagen-validations/PHB_Validation_v2_TEMPLATE/data . I also copied the FASTQs to our public bucket. 2 new sample names are:SRR16579222_Ecoli_stxPos2completeOperons
andSRR28017575_Ecoli_stxPos1completeOperon
🔬 Final Developer Checklist
🎯 Reviewer Checklist
🗂️ Associated Documentation (to be completed by Theiagen developer)