Notched search #120

jspaezp · 2024-02-07T17:43:08Z

This PR adds support for multiple precursor isolation windows.

MORE ACCURATELY, right now if the spectrum has multiple precursors, sage uses the first one for the search and disregards the rest. With this PR, it will score candidates in all of the annotated precursors.

In my experiments (as expected) it has no effect on the results for files that have a single isolation window.

The main use I have in mind for this feature is for pseudo-generated spectra from DIA, where the assigned precursor might be ambiguous. In that case I could just annotate it as having both precursors and let them fight it out within the search engine.

TODO: add testing to make sure all of them are used + make sure it does not screw up open/wide window search, there might be the need to simplify the overlapping ranges.

lazear · 2024-02-07T18:02:28Z

Are there mzMLs in the wild that have multiple precursors annotated like this?

What is the performance impact of this for "normal" searches? I suspect it is non-zero (allocations in initial_hits, doubling of PreScore memory requirements), but it should be benchmarked.

For your use case - why not duplicate the entire spectrum and assign unique precursors to each? My hunch is that it be more efficient search-time and FDR wise.

jspaezp · 2024-02-07T19:28:41Z

I am not sure! I know that for some MS3-TMT experiments the precursor is sometimes from multiple MS2 peaks and therefore the MS3 will have multiple precursors annotated (but in those cases, the MS3 is only used for the reporter ions and not for search). I can imagine some hacky methods that might make use of it but definitely in the experimental realm. Having said that, based on some of the issues I see in the repo, a non-negligible amount of people are using the project with self-packed mzml and mgf files, so I can imagine that someone else might make use of this feature if present.
On my system using a human proteome, closed search and a random .d file it is pretty negligible. I believe it would increase the allocations of initial_hits and prescore BUT it would increase it by 1 per thread, since they are aggregated per spectrum.

/usr/bin/time -lh ./target/release/sage --write-pin sageconfig.json
# Master
        21.50s real             1m20.74s user           7.95s sys
          3602317312  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             1210225  page reclaims
                  33  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                  14  messages sent
                  18  messages received
                   0  signals received
               16047  voluntary context switches
              456758  involuntary context switches
        412891948273  instructions retired
        257530296274  cycles elapsed
          4162711744  peak memory footprint
          
# feature/notched_search
        21.64s real             1m21.62s user           7.93s sys
          3588833280  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             1200829  page reclaims
                  20  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                  14  messages sent
                  19  messages received
                   0  signals received
               16104  voluntary context switches
              469975  involuntary context switches
        413070797351  instructions retired
        259062984671  cycles elapsed
          4160171904  peak memory footprint

I believe it would lead to some FDR issues I am not totally satisfied with ... since the poisson distribution and the number of scored candidates would represent the candidates scored for that notch and not the total number for that spectrum. (I have not done an entrapment to make sure it has an undesired effect, but 'it feels right').

LMK what you think! If you feel like it is not within the scope of the project I can maintain a fork with the feature for my needs (I try to be good at maintaining my PRs but ultimately you are the maintainer of sage)

crates/sage/src/scoring.rs

lazear · 2024-02-07T19:56:14Z

I'm not opposed to adding features to support homebrewed mzMLs, but they should be "zero-cost" with respect to running standard searches - e.g. those features shouldn't impose a non-trivial cost on the other 95% of searches. Fussing over every byte is how you get fast :)

crates/sage/src/scoring.rs

jspaezp · 2024-03-30T18:23:55Z

I changed the implementation and now it uses a run-length encoded approach to store the precursor information (precursors are stored in order, the number of hits per precursor are stored). This currently makes it ~2% slower on open search using my system (+- 100da, human proteome) but closed search more than that (~8 in my tests) ... there might be some additional optimization calculating the precursor masses.
I guess another alternative is to have an option to disable the feature, which should be actually 0-cost (binary might be a bit larger, if I "understand" what the compiler will do).

lazear · 2024-04-10T22:38:14Z

crates/sage/src/scoring.rs

+        // Match lengths is pre cumulative sum of the number of hits for each precursor
+        let mut cum_match_lengths = Vec::with_capacity(query.precursors.len());
+
+        for precursor in &query.precursors {


Is there an alternate approach to this loop that might be simpler?

I am thinking along the lines of:

Run preliminary search for each precursor

Build features for each precursor

Combine all features across precursor, sorting by hyperscore

Keep report_psms features

I think this should approximate what you have here, but with some reduced maintenance burden. Is there something this approach would substantially miss that is implemented here?

jspaezp · 2024-09-08T06:06:53Z

Made this guy a draft for now! I don't think its worth the current overhead (clear in open search, pretty negligible in closed). Might revisit it later if I see a strong reason for it.

first branch commit

840efab

lazear reviewed Feb 7, 2024

View reviewed changes

crates/sage/src/scoring.rs Outdated Show resolved Hide resolved

crates/sage/src/scoring.rs Outdated Show resolved Hide resolved

crates/sage/src/scoring.rs Outdated Show resolved Hide resolved

jspaezp and others added 10 commits March 30, 2024 01:23

fix

c7a94fd

added test

86c6061

fix: add missing parquet columns (lazear#118)

f35f0f0

chore: update deps

5ff21de

fix: mgf paths were being lowercased (lazear#124)

82e8dbf

chore: release v0.14.7

79fabdf

partial implementation of pr review

5a4120b

changed precursor tracing

2a32f58

added multi-precursor read test

f7c01d6

Merge branch 'lazear:master' into feature/notched_search

fb549cf

jspaezp commented Mar 30, 2024

View reviewed changes

crates/sage/src/scoring.rs Show resolved Hide resolved

jspaezp commented Mar 30, 2024

View reviewed changes

crates/sage/src/scoring.rs Outdated Show resolved Hide resolved

reorganized notched initial hits function

1da2f6b

jspaezp marked this pull request as ready for review March 30, 2024 18:15

jspaezp marked this pull request as draft March 31, 2024 03:49

jspaezp added 3 commits March 30, 2024 21:25

(experimental) lazy sort

040e6c1

(experimental) changed operation orders

3478b3d

(experimental) expanded map to loop+zip

6fbbcf4

jspaezp marked this pull request as ready for review April 2, 2024 07:11

jspaezp changed the title ~~[WIP] Notched search~~ Notched search Apr 4, 2024

lazear reviewed Apr 10, 2024

View reviewed changes

lazear mentioned this pull request May 24, 2024

Random questions regarding splitting mgfs, specifying charge states, multiple possible precursors.... #138

Closed

jspaezp marked this pull request as draft September 8, 2024 06:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notched search #120

Notched search #120

jspaezp commented Feb 7, 2024

lazear commented Feb 7, 2024

jspaezp commented Feb 7, 2024 •

edited

Loading

lazear commented Feb 7, 2024 •

edited

Loading

jspaezp commented Mar 30, 2024

lazear Apr 10, 2024

jspaezp commented Sep 8, 2024

Notched search #120

Are you sure you want to change the base?

Notched search #120

Conversation

jspaezp commented Feb 7, 2024

lazear commented Feb 7, 2024

jspaezp commented Feb 7, 2024 • edited Loading

lazear commented Feb 7, 2024 • edited Loading

jspaezp commented Mar 30, 2024

lazear Apr 10, 2024

Choose a reason for hiding this comment

jspaezp commented Sep 8, 2024

jspaezp commented Feb 7, 2024 •

edited

Loading

lazear commented Feb 7, 2024 •

edited

Loading