Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TheiaProk] Adds stxtyper to merlin_magic and TheiaProk wfs #525

Merged
merged 44 commits into from
Oct 24, 2024

Conversation

kapsakcj
Copy link
Contributor

@kapsakcj kapsakcj commented Jun 28, 2024

## Please keep this PR as draft for now as stxtyper is still undergoing validation/peer review/publication

Using this draft PR for tracking stxtyper development as well as amrfinderplus (that runs stxtyper under the hood). NOTE: ignore the amrfinderplus mention, that update is NOT included in this PR

Our partners are actively using this branch for testing purposes. Let's keep it available for now

I'll try to keep this branch up-to-date with main to incorporate other changes & resolve merge conflicts if they arise.

This PR closes #443

🗑️ This dev branch should NOT be deleted after merging to main.

2024-09-23 update: I expect to hear feedback from our partner soon, but updating PR message now with tests and info

🧠 Aim, Context and Functionality

This PR adds Stxtyper to the TheiaProk workflows. Stxtyper is used to detect and type shiga toxin genes in bacterial genome assemblies. It also attempts to detect novel shiga toxin subtypes in cases where the detected sequences diverge from the reference sequences.

These genes are usually found in E. coli (STEC), but can also be found in Shigella species as well as some other genera more rarely, like Klebsiella. It is developed by NCBI in collaboration with a number of different groups including CDC, FDA, SSI, and others. A publication to fully describe the tool and it's validation is in the works but a software release has been made so the community may test the software further and begin using the tool.

This tool queries genome assemblies for 2 genes or subunits involved in shiga toxin production, stxA and stxB. The A subunit is longer than the B subunit. Stxtyper attempts to detect these, compare them to a database of known sequences, and type them based on amino acid composition. The typing algorithm will be described in the publication when it is published.

More info & source code found here: https://github.com/ncbi/stxtyper

To learn more about shiga toxin subtypes and the description of the latest subtypes, Stx2n, Stx2j, Stx2m, and Stx2o, see this publication (shamless plug): https://www.mdpi.com/2076-2607/11/10/2561

Eventually this tool will be incorporated into AMRFinderPlus and will run behind-the-scenes when the user provides the amrfinder --organism Escherichia option, but we wanted the functionality now and the ability to run separate from AMRFinderPlus.

🛠️ Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes/No

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : Yes/No

📋 Workflow/Task Step Changes

🔄 Data Processing

New task. StxTyper is run on assemblies and output TSV is parsed to generate a few string output columns

Docker/software or software versions changed: New

Databases or database versions changed: New

Data processing/commands changed: New

File processing changed: New

Compute resources changed: 1cpu, 4GB RAM. Doesn't need much. Also enabled preemptible since it usually runs in less than 3 minutes

➡️ Inputs

new TheiaProk wf inputs:

Terra Task name Variable Type Description Default value Terra Status Workflow
merlin_magic call_shigeifinder_reads_input Boolean If set to "true", the ShigEiFinder task will run again but using read files as input instead of the assembly file. Input is shown but not used for TheiaProk_FASTA. FALSE Optional FASTA, ONT, PE, SE
merlin_magic call_stxtyper Boolean If set to "true", the StxTyper task will run on all samples regardless of the gambit_predicted_taxon output. Useful if you suspect a non-E.coli or non-Shigella sample contains stx genes. FALSE Optional FASTA, ONT, PE, SE
merlin_magic stxtyper_cpu Int The number of CPU cores to allocate for the task. 1 Optional FASTA, ONT, PE, SE
merlin_magic stxtyper_disk_size Int Amount of storage (in GB) to allocate to the task 50 Optional FASTA, ONT, PE, SE
merlin_magic stxtyper_docker_image String The Docker container to use for the task us-docker.pkg.dev/general-theiagen/staphb/stxtyper:1.0.24 Optional FASTA, ONT, PE, SE
merlin_magic stxtyper_enable_debug Boolean When enabled, additional messages are printed and files in $TMPDIR are not removed after running FALSE Optional FASTA, ONT, PE, SE
merlin_magic stxtyper_memory Int Amount of memory (in GB) to allocate to the task 4 Optional FASTA, ONT, PE, SE

⬅️ Outputs

New TheiaProk wf outputs:

Variable Type Description Workflow
stxtyper_all_hits String Comma-separated list of matches of all types. Includes complete, partial, frameshift, internal stop, and novel hits. List is de-duplicated so multiple identical hits are only listed once. For example if 5 partial stx2 hits are detected in the genome, only 1 "stx2" will be listed in this field. To view the potential subtype for each partial hit, the user will need to view the stxtyper_report TSV file. FASTA, ONT, PE, SE
stxtyper_complete_operons String Comma-separated list of all COMPLETE operons detected by StxTyper. Show multiple hits if present in results. FASTA, ONT, PE, SE
stxtyper_docker String Name of docker image used by the stxtyper task. FASTA, ONT, PE, SE
stxtyper_novel_hits String Comma-separated list of matches that have the OPERON output of "COMPLETE_NOVEL". Possible outputs "stx1", "stx2", or "stx1,stx2" FASTA, ONT, PE, SE
stxtyper_num_hits Int Number of "hits" or rows present in the stxtyper_report TSV file FASTA, ONT, PE, SE
stxtyper_partial_hits String Possible outputs "stx1", "stx2", or "stx1,stx2". Tells the user that there was a partial hit to either the A or B subunit, but does not describe which subunit, only the possible types from the PARTIAL matches. FASTA, ONT, PE, SE
stxtyper_report File Raw results TSV file produced by StxTyper FASTA, ONT, PE, SE
stxtyper_stx_frameshifts_or_internal_stop_hits String Comma-separated list of matches that have the OPERON output of "FRAMESHIFT" or "INTERNAL_STOP". Possible outputs "stx1", "stx2", or "stx1,stx2" FASTA, ONT, PE, SE
stxtyper_version String Version of StxTyper used FASTA, ONT, PE, SE

🧪 Testing

Test Dataset

A large Illumina PE dataset of 161 samples known to be pcr-positive for stx were used to test. There are a few with 0 hits and these are likely due to false PCR positives, or where the sequence data was not good enough to assemble the stx gene regions of the genome.

ONT data was from random Shigella samples from SRA, so no expectations to have stx genes in those

Commandline Testing with MiniWDL or Cromwell (optional)

Tested fine locally with miniwdl, not including output/results here in PR message. Lots of Terra testing was done.

Terra Testing

Suggested Scenarios for Reviewer to Test

Test any of the TheiaProk workflows. E. coli and Shigella spp samples would be ideal, but you can run other bacterial taxon through and set call_stxtyper optional input to true to test functionality.

Here's ~5500+ samples on NCBI pathogen detection browser that contain a hit to stx in their genome, and would be good to test with: https://www.ncbi.nlm.nih.gov/pathogens/isolates/#virulence_genotypes:stx* There's even some S. aureus and Salmonella that would be interesting to test with

Theiagen Version Release Testing (optional)

  • Will changes require functional or validation testing (checking outputs etc) during the release? Yes, it would be good to add a couple STECs to our validation testing dataset. These data are public, so we could add a few to the theiaprok illumina pe validation dataset
  • Do new samples need to be added to validation datasets? If so, upload these to the appropriate validation workspace Google bucket (). Please describe the new samples here and why these have been chosen.
  • Are there any output files that should be checked after running the version release testing?

🔬 Final Developer Checklist

  • The workflow/task has been tested locally and results, including file contents, are as anticipated
  • The workflow/task has been tested on Terra and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (to be completed by Theiagen developer)
  • Code changes follow the style guide

🎯 Reviewer Checklist

  • All impacted workflows/tasks have been tested on Terra with a different dataset than used for development
  • All reviewer-suggested scenarios have been tested and any additional
  • All changed results have been confirmed to be accurate
  • All workflows/tasks impacted by change/s have been tested using a standard validation dataset to ensure no unintended change of functionality
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments

🗂️ Associated Documentation (to be completed by Theiagen developer)

  • Relevant documentation on the Public Health Resources "PHB Main" has been updated
  • Workflow diagrams have been updated to reflect changes

kapsakcj added 20 commits March 27, 2024 16:53
…prok_illumina_pe wf. Also added code to catch when stxtyper output TSV is one line meaning no hits were found
…uts: String stxtyper_hits and Int stxtyper_num_hits. updated merlin_magic and theiaprok_illumina_PE workflow and tested successfully w miniwdl
…nd Shigella (including sonnei) are run through stxtyper
…t columns; downsize disk_size to 50 gb default instead of 100 (it doesn't need much as it runs on 1 assembly
…ed; preemptible on since this task doesn't run for much more than 2 min; created outputs in case no hits are found; started re-writing parsing code for when hits are detected & finished the bit for complete operons. tested successfully with miniwdl
…eeds work to iron out bugs. want to save progress and push a commit
…d to user; major changes to stxtyper parsing again; added a new optional input & 1 output, removed 1 output; still needs work
…ith new stxtyper outputs; updated theiaprok_illumina_pe with new stxtyper outputs; still need to rework part of parsing for stxtyper but theiaprok ran successfully w miniwdl
@kapsakcj
Copy link
Contributor Author

kapsakcj commented Jul 1, 2024

other TODOs:

  • first - update outputs to theiaprok ILMN PE wf
  • second - update outputs to theiaprok FASTA wf
  • add new outputs to theiaprok ILMN SE wf
  • add new outputs to theiaprok ONT wf
  • LASTLY - fix CI

@kapsakcj
Copy link
Contributor Author

kapsakcj commented Jul 10, 2024

Waiting on user feedback prior to making more code changes

Plan as of 2024-07-10:

  • move the call block for stxtyper to the main TheiaProk workflow WDLs, instead of calling via merlin_magic sub workflow Nevermind, planning to keep in merlin_magic subwf
  • Keep stxtyper off by default, but user's can enable via boolean input call_stxtyper.

This will allow organisms of any genus/species to be screen for stx genes since they can occur in more genera/species other than E. coli & Shigella

@kapsakcj
Copy link
Contributor Author

Also - adjust conditional in merlin_magic code to so that user can "opt-in" to running stxtyper, regardless of the taxa (i.e. gambit_predicted_taxon).

That way stxtyper is run automatically on all E. coli and Shigella and user has the ability to run it on other taxa.

…dded boolean for enabling built-in stxtyper debugging; removed many outputs and added stxtyper_all_hits output; also renamed a few outputs for clarity and consistency; commented out unused portions of code which will be deleted later
…eiaprok ilmn pe wf outputs; adjusted order outputs in stxtyper task for clarity
…la call block. added optional input call_stxtyper so user can run tool regardless of merlin_tag and GAMBIT_predicted_taxon. tested successfully w miniwdl, need to test in Terra
@kapsakcj kapsakcj changed the title [experimental] Adds stxtyper to merlin_magic and TheiaProk wfs [TheiaProk] Adds stxtyper to merlin_magic and TheiaProk wfs Sep 13, 2024
…l to the theiaprok documentation. also fixed a minor comment typo in stxtyper task file
@kapsakcj
Copy link
Contributor Author

reminder to update the workflow diagram with this new tool and add it into the PR

@kapsakcj
Copy link
Contributor Author

kapsakcj commented Oct 21, 2024

I expect the illumina single end tests to pass, so I'm setting this as ready for review

EDIT: they ran successfully

@kapsakcj kapsakcj marked this pull request as ready for review October 21, 2024 15:36
@kapsakcj kapsakcj requested a review from a team as a code owner October 21, 2024 15:36
@kapsakcj
Copy link
Contributor Author

gah....there are merge conflicts with the test ymls, hopefully my last commit fixes it but I may need another commit to resolve

Copy link
Member

@sage-wright sage-wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes look great -- one request: can you not add stxtyper_log to the Terra table since it's not useful to users? We can still have it as a task output so we can find it in GC but if it's not going to be used, I'm not sure if we want it to be available.

@sage-wright
Copy link
Member

sage-wright commented Oct 23, 2024

Testing on some Klebsiella here

@kapsakcj
Copy link
Contributor Author

Code changes look great -- one request: can you not add stxtyper_log to the Terra table since it's not useful to users? We can still have it as a task output so we can find it in GC but if it's not going to be used, I'm not sure if we want it to be available.

good point, yes I'll remove it from workflow outputs

…seful for debugging and not useful to end user. also removed mention in the docs
@kapsakcj
Copy link
Contributor Author

kapsakcj commented Oct 23, 2024

launched another test for theiaprok_illumina_pe now that the stxtyper_log has been removed: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-Aug2024/job_history/af3bef09-b239-43c3-8134-2988a2098232

I probs will have to update CI again, so commits incoming

EDIT these test workflows ran successfully and did not output the stxtyper_log file to terra data table (although it is saved in execution directory in case we ever need it)

Copy link
Member

@sage-wright sage-wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turnaround!

@sage-wright sage-wright merged commit 3e8c447 into main Oct 24, 2024
12 checks passed
@sage-wright sage-wright deleted the cjk-stxtyper branch November 7, 2024 20:18
@kapsakcj kapsakcj restored the cjk-stxtyper branch November 7, 2024 20:34
@sage-wright sage-wright deleted the cjk-stxtyper branch January 13, 2025 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[experimental] Add StxTyper for Shiga toxin subtyping
2 participants