-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Db splitting #154
base: master
Are you sure you want to change the base?
Db splitting #154
Conversation
Dear @lazear, FASTA DB splitting for very large search spaces has recently become quite a challenge for us (and others, as suggested by the discussions here). We thus thought it would be great if a first workaround would be available in Sage. While we are aware that this is not the most elegant or efficient solution, our focus was to minimize the intrusiveness to Sage's code base, at the cost of speed. Nevertheless, we found the workaround to be practical and we thus wanted to check-in with you whether you would be willing to consider this PR. Thanks for your time! Best regards, |
Hi guys, Sorry for the delayed response, and thanks for the contribution! - I have this slated up to review ASAP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a clever approach. I think the implementation looks solid, and I will run some tests over the next couple days.
I think the only concern I have is whether the low-memory filtering might significantly impact FDR calculations/entrapment under adverse conditions. This seems somewhat similar to what X!Tandem was doing with the dual search approach, and IIRC it violates some TDC constraints.
.collect::<Vec<_>>(); | ||
|
||
score_vector.sort_by(|a, b| b.0.hyperscore.total_cmp(&a.0.hyperscore)); | ||
score_vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are just keeping the top N+1 hits, let's use bounded_min_heapify(&mut hits.preliminary, k)
. Draining and re-collecting the iterator is still a good idea to trim/reallocate memory.
With this PR we allow the database to be split. This is especially helpful to control the memory for non-specific searches such as immuno, i.e. when setting
"cleave_at": ""
or many PTMs. The core concept is to iterate over fasta chunks rather than creating a db based on the whole fasta. Based on these chunks, only relevant peptides will be retained. After quickly filtering the database like this, all potential peptide hits are collected to make a final (severely) filtered database that is used for a native Sage search without any code alterations.The memory seems to be very well controlled with this approach. The (identification) results are almost the same as doing searches without chunking, but for any specific chunk size there are some minor (generally <1%) differences as some scores (notably the Poisson score of features) rely on non-filtered databases.
New parameters (all optional) in the config are:
NOTE: this PR is made as a draft, as we hope to receive community feedback before merging into master.