Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notched search #120

Draft
wants to merge 15 commits into
base: master
Choose a base branch
from
Draft

Notched search #120

wants to merge 15 commits into from

Conversation

jspaezp
Copy link
Contributor

@jspaezp jspaezp commented Feb 7, 2024

This PR adds support for multiple precursor isolation windows.

MORE ACCURATELY, right now if the spectrum has multiple precursors, sage uses the first one for the search and disregards the rest. With this PR, it will score candidates in all of the annotated precursors.

In my experiments (as expected) it has no effect on the results for files that have a single isolation window.

The main use I have in mind for this feature is for pseudo-generated spectra from DIA, where the assigned precursor might be ambiguous. In that case I could just annotate it as having both precursors and let them fight it out within the search engine.

TODO: add testing to make sure all of them are used + make sure it does not screw up open/wide window search, there might be the need to simplify the overlapping ranges.

@lazear
Copy link
Owner

lazear commented Feb 7, 2024

Are there mzMLs in the wild that have multiple precursors annotated like this?

What is the performance impact of this for "normal" searches? I suspect it is non-zero (allocations in initial_hits, doubling of PreScore memory requirements), but it should be benchmarked.

For your use case - why not duplicate the entire spectrum and assign unique precursors to each? My hunch is that it be more efficient search-time and FDR wise.

@jspaezp
Copy link
Contributor Author

jspaezp commented Feb 7, 2024

  1. I am not sure! I know that for some MS3-TMT experiments the precursor is sometimes from multiple MS2 peaks and therefore the MS3 will have multiple precursors annotated (but in those cases, the MS3 is only used for the reporter ions and not for search). I can imagine some hacky methods that might make use of it but definitely in the experimental realm. Having said that, based on some of the issues I see in the repo, a non-negligible amount of people are using the project with self-packed mzml and mgf files, so I can imagine that someone else might make use of this feature if present.

  2. On my system using a human proteome, closed search and a random .d file it is pretty negligible. I believe it would increase the allocations of initial_hits and prescore BUT it would increase it by 1 per thread, since they are aggregated per spectrum.

/usr/bin/time -lh ./target/release/sage --write-pin sageconfig.json
# Master
        21.50s real             1m20.74s user           7.95s sys
          3602317312  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             1210225  page reclaims
                  33  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                  14  messages sent
                  18  messages received
                   0  signals received
               16047  voluntary context switches
              456758  involuntary context switches
        412891948273  instructions retired
        257530296274  cycles elapsed
          4162711744  peak memory footprint
          
# feature/notched_search
        21.64s real             1m21.62s user           7.93s sys
          3588833280  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             1200829  page reclaims
                  20  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                  14  messages sent
                  19  messages received
                   0  signals received
               16104  voluntary context switches
              469975  involuntary context switches
        413070797351  instructions retired
        259062984671  cycles elapsed
          4160171904  peak memory footprint
  1. I believe it would lead to some FDR issues I am not totally satisfied with ... since the poisson distribution and the number of scored candidates would represent the candidates scored for that notch and not the total number for that spectrum. (I have not done an entrapment to make sure it has an undesired effect, but 'it feels right').

LMK what you think! If you feel like it is not within the scope of the project I can maintain a fork with the feature for my needs (I try to be good at maintaining my PRs but ultimately you are the maintainer of sage)

crates/sage/src/scoring.rs Outdated Show resolved Hide resolved
crates/sage/src/scoring.rs Outdated Show resolved Hide resolved
crates/sage/src/scoring.rs Outdated Show resolved Hide resolved
@lazear
Copy link
Owner

lazear commented Feb 7, 2024

I'm not opposed to adding features to support homebrewed mzMLs, but they should be "zero-cost" with respect to running standard searches - e.g. those features shouldn't impose a non-trivial cost on the other 95% of searches. Fussing over every byte is how you get fast :)

crates/sage/src/scoring.rs Outdated Show resolved Hide resolved
@jspaezp jspaezp marked this pull request as ready for review March 30, 2024 18:15
@jspaezp
Copy link
Contributor Author

jspaezp commented Mar 30, 2024

I changed the implementation and now it uses a run-length encoded approach to store the precursor information (precursors are stored in order, the number of hits per precursor are stored). This currently makes it ~2% slower on open search using my system (+- 100da, human proteome) but closed search more than that (~8 in my tests) ... there might be some additional optimization calculating the precursor masses.
I guess another alternative is to have an option to disable the feature, which should be actually 0-cost (binary might be a bit larger, if I "understand" what the compiler will do).

@jspaezp jspaezp marked this pull request as draft March 31, 2024 03:49
@jspaezp jspaezp marked this pull request as ready for review April 2, 2024 07:11
@jspaezp jspaezp changed the title [WIP] Notched search Notched search Apr 4, 2024
// Match lengths is pre cumulative sum of the number of hits for each precursor
let mut cum_match_lengths = Vec::with_capacity(query.precursors.len());

for precursor in &query.precursors {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an alternate approach to this loop that might be simpler?

I am thinking along the lines of:

  • Run preliminary search for each precursor
  • Build features for each precursor
  • Combine all features across precursor, sorting by hyperscore
  • Keep report_psms features

I think this should approximate what you have here, but with some reduced maintenance burden. Is there something this approach would substantially miss that is implemented here?

@jspaezp
Copy link
Contributor Author

jspaezp commented Sep 8, 2024

Made this guy a draft for now! I don't think its worth the current overhead (clear in open search, pretty negligible in closed). Might revisit it later if I see a strong reason for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants