-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Notched search #120
base: master
Are you sure you want to change the base?
Notched search #120
Conversation
Are there mzMLs in the wild that have multiple precursors annotated like this? What is the performance impact of this for "normal" searches? I suspect it is non-zero (allocations in For your use case - why not duplicate the entire spectrum and assign unique precursors to each? My hunch is that it be more efficient search-time and FDR wise. |
LMK what you think! If you feel like it is not within the scope of the project I can maintain a fork with the feature for my needs (I try to be good at maintaining my PRs but ultimately you are the maintainer of sage) |
I'm not opposed to adding features to support homebrewed mzMLs, but they should be "zero-cost" with respect to running standard searches - e.g. those features shouldn't impose a non-trivial cost on the other 95% of searches. Fussing over every byte is how you get fast :) |
I changed the implementation and now it uses a run-length encoded approach to store the precursor information (precursors are stored in order, the number of hits per precursor are stored). This currently makes it ~2% slower on open search using my system (+- 100da, human proteome) but closed search more than that (~8 in my tests) ... there might be some additional optimization calculating the precursor masses. |
// Match lengths is pre cumulative sum of the number of hits for each precursor | ||
let mut cum_match_lengths = Vec::with_capacity(query.precursors.len()); | ||
|
||
for precursor in &query.precursors { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an alternate approach to this loop that might be simpler?
I am thinking along the lines of:
- Run preliminary search for each precursor
- Build features for each precursor
- Combine all features across precursor, sorting by hyperscore
- Keep
report_psms
features
I think this should approximate what you have here, but with some reduced maintenance burden. Is there something this approach would substantially miss that is implemented here?
Made this guy a draft for now! I don't think its worth the current overhead (clear in open search, pretty negligible in closed). Might revisit it later if I see a strong reason for it. |
This PR adds support for multiple precursor isolation windows.
MORE ACCURATELY, right now if the spectrum has multiple precursors, sage uses the first one for the search and disregards the rest. With this PR, it will score candidates in all of the annotated precursors.
In my experiments (as expected) it has no effect on the results for files that have a single isolation window.
The main use I have in mind for this feature is for pseudo-generated spectra from DIA, where the assigned precursor might be ambiguous. In that case I could just annotate it as having both precursors and let them fight it out within the search engine.
TODO: add testing to make sure all of them are used + make sure it does not screw up open/wide window search, there might be the need to simplify the overlapping ranges.