Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Db splitting #154

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

sander-willems-bruker
Copy link
Contributor

With this PR we allow the database to be split. This is especially helpful to control the memory for non-specific searches such as immuno, i.e. when setting "cleave_at": "" or many PTMs. The core concept is to iterate over fasta chunks rather than creating a db based on the whole fasta. Based on these chunks, only relevant peptides will be retained. After quickly filtering the database like this, all potential peptide hits are collected to make a final (severely) filtered database that is used for a native Sage search without any code alterations.

The memory seems to be very well controlled with this approach. The (identification) results are almost the same as doing searches without chunking, but for any specific chunk size there are some minor (generally <1%) differences as some scores (notably the Poisson score of features) rely on non-filtered databases.

New parameters (all optional) in the config are:

  "database": {
    "prefilter": true,
    "prefilter_chunk_size": 1000,
    "prefilter_low_memory": false,
   ...
  • Prefilter is optional and defaults to false. If false, other prefilter params are ignored.
  • Chunk size (i.e. number of fasta sequences to process simultaneously) defaults to 0, causing a panic when prefilter is set to true. Ideally, setting it to 0 automatically takes a reasonable chunk size to e.g. create chunks containing 1M-5M peptides which generally an be processed even on small machine with just 16GB RAM. This is not included in this PR though yet.
  • The low memory mode (defaults to false) filters even more stringent and just keeps the top N + 1 PSM ranks after filtering, rather than the default (hard-coded) 50 top initial hits per PSM per chunk.

NOTE: this PR is made as a draft, as we hope to receive community feedback before merging into master.

@grosenberger-bruker
Copy link

Dear @lazear,

FASTA DB splitting for very large search spaces has recently become quite a challenge for us (and others, as suggested by the discussions here). We thus thought it would be great if a first workaround would be available in Sage. While we are aware that this is not the most elegant or efficient solution, our focus was to minimize the intrusiveness to Sage's code base, at the cost of speed. Nevertheless, we found the workaround to be practical and we thus wanted to check-in with you whether you would be willing to consider this PR. Thanks for your time!

Best regards,
George

@lazear
Copy link
Owner

lazear commented Oct 3, 2024

Hi guys,

Sorry for the delayed response, and thanks for the contribution! - I have this slated up to review ASAP

Copy link
Owner

@lazear lazear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clever approach. I think the implementation looks solid, and I will run some tests over the next couple days.

I think the only concern I have is whether the low-memory filtering might significantly impact FDR calculations/entrapment under adverse conditions. This seems somewhat similar to what X!Tandem was doing with the dual search approach, and IIRC it violates some TDC constraints.

.collect::<Vec<_>>();

score_vector.sort_by(|a, b| b.0.hyperscore.total_cmp(&a.0.hyperscore));
score_vector
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are just keeping the top N+1 hits, let's use bounded_min_heapify(&mut hits.preliminary, k). Draining and re-collecting the iterator is still a good idea to trim/reallocate memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants