Db splitting #154

sander-willems-bruker · 2024-09-13T11:37:58Z

With this PR we allow the database to be split. This is especially helpful to control the memory for non-specific searches such as immuno, i.e. when setting "cleave_at": "" or many PTMs. The core concept is to iterate over fasta chunks rather than creating a db based on the whole fasta. Based on these chunks, only relevant peptides will be retained. After quickly filtering the database like this, all potential peptide hits are collected to make a final (severely) filtered database that is used for a native Sage search without any code alterations.

The memory seems to be very well controlled with this approach. The (identification) results are almost the same as doing searches without chunking, but for any specific chunk size there are some minor (generally <1%) differences as some scores (notably the Poisson score of features) rely on non-filtered databases.

New parameters (all optional) in the config are:

  "database": {
    "prefilter": true,
    "prefilter_chunk_size": 1000,
    "prefilter_low_memory": false,
   ...

Prefilter is optional and defaults to false. If false, other prefilter params are ignored.
Chunk size (i.e. number of fasta sequences to process simultaneously) defaults to 0, causing a panic when prefilter is set to true. Ideally, setting it to 0 automatically takes a reasonable chunk size to e.g. create chunks containing 1M-5M peptides which generally an be processed even on small machine with just 16GB RAM. This is not included in this PR though yet.
The low memory mode (defaults to false) filters even more stringent and just keeps the top N + 1 PSM ranks after filtering, rather than the default (hard-coded) 50 top initial hits per PSM per chunk.

NOTE: this PR is made as a draft, as we hope to receive community feedback before merging into master.

…f digest

grosenberger-bruker · 2024-10-01T06:27:01Z

Dear @lazear,

FASTA DB splitting for very large search spaces has recently become quite a challenge for us (and others, as suggested by the discussions here). We thus thought it would be great if a first workaround would be available in Sage. While we are aware that this is not the most elegant or efficient solution, our focus was to minimize the intrusiveness to Sage's code base, at the cost of speed. Nevertheless, we found the workaround to be practical and we thus wanted to check-in with you whether you would be willing to consider this PR. Thanks for your time!

Best regards,
George

lazear · 2024-10-03T01:07:45Z

Hi guys,

Sorry for the delayed response, and thanks for the contribution! - I have this slated up to review ASAP

lazear

This is a clever approach. I think the implementation looks solid, and I will run some tests over the next couple days.

I think the only concern I have is whether the low-memory filtering might significantly impact FDR calculations/entrapment under adverse conditions. This seems somewhat similar to what X!Tandem was doing with the dual search approach, and IIRC it violates some TDC constraints.

lazear · 2024-10-24T01:17:01Z

crates/sage/src/scoring.rs

+                .collect::<Vec<_>>();
+
+            score_vector.sort_by(|a, b| b.0.hyperscore.total_cmp(&a.0.hyperscore));
+            score_vector


If we are just keeping the top N+1 hits, let's use bounded_min_heapify(&mut hits.preliminary, k). Draining and re-collecting the iterator is still a good idea to trim/reallocate memory.

grosenberger-bruker · 2025-01-24T09:14:55Z

Thank you for your feedback, @lazear. Regarding the low-memory mode:

We acknowledge that the low-memory solution is inherently a lossy approach. However, our internal testing suggests that the practical impact may be negligible. We believe its performance and fidelity is at least comparable to the proposed Python workaround (#97 (comment)) or MSFragger's FASTA splitting method (https://github.com/Nesvilab/FragPipe/blob/develop/MSFragger-GUI/tools/msfragger_pep_split.py).

Would you prefer that we include a warning when the low-memory mode is activated? Alternatively, is there a specific validation experiment you would recommend we conduct?

fallonda · 2025-01-30T10:25:32Z

This is great, I tried splitting fastas outside of SAGE in a nextflow pipeline a year ago but was uneasy about it would affect FDR and number of unique psms etc. We currently use PEAKS Online 10 for our unspecific peptidomic searches so will do some exploratory benchmarking. In any case, commenting to bump the interest up in this feature! @Elendol

Db splitting merge

sander-willems-bruker · 2025-02-05T10:41:45Z

Dear Michael. We made some minor updates to

Incorporate your suggestions to use heapify
Resolve merge conflicts
Add auto chunk calculations

Is there anything else you would like us to do to improve this PR?

lazear · 2025-02-07T17:07:26Z

No - I think it looks great. Thank you for such a nice contribution. It's just on to me get my act together and merge it! I will get it done within the next week.

sander-willems-bruker added 12 commits September 6, 2024 16:52

FEAT: added option to iterate over fasta chunks

04a65d5

FEAT: Added cloning to Search and QuantSettings

1cb9238

CHORE: made reordering of target_decoys a seperate function outside o…

fcef97b

…f digest

CHORE: implemented build db from peptide list

f7f77dc

FIX: reorder_peptide function return

9a328e3

FIX: ambiguous target/decoy peptides are now always target

8fb9dd0

FEAT: added fasta chunking params to db

cc8d7c0

CHORE: refactored process_chunk from runner

3a38894

FEAT: added defaultto IndexedDatabase

caef638

FEAT: added quick_score option to quickly filter peptides

e92c8b7

FEAT: added parsing of the prefilter fasta by chunk parameters

d966e73

FEAT: added extra low memory option

4299a0d

sander-willems-bruker mentioned this pull request Sep 13, 2024

Memory consumption for extremely large search spaces #97

Open

lazear reviewed Oct 24, 2024

View reviewed changes

FEAT: using heap to retain best scores in low_mem mode

74f55bc

ErikHartman mentioned this pull request Jan 29, 2025

Unspecific search - peptidomics #167

Open

sander-willems-bruker marked this pull request as ready for review February 3, 2025 13:40

sander-willems-bruker and others added 5 commits February 4, 2025 12:13

Merge remote-tracking branch 'upstream/master' into db_splitting

196d27a

Merge remote-tracking branch 'upstream/master' into db_splitting_merge

95815a9

FIX: merge issues

92d3c94

Merge pull request #5 from sander-willems-bruker/db_splitting_merge

662f804

Db splitting merge

FEAT: auto calculate prefilter chunk size

57fa3a8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Db splitting #154

Db splitting #154

sander-willems-bruker commented Sep 13, 2024

grosenberger-bruker commented Oct 1, 2024

lazear commented Oct 3, 2024 •

edited

Loading

lazear left a comment •

edited

Loading

lazear Oct 24, 2024

grosenberger-bruker commented Jan 24, 2025

fallonda commented Jan 30, 2025

sander-willems-bruker commented Feb 5, 2025

lazear commented Feb 7, 2025

Db splitting #154

Are you sure you want to change the base?

Db splitting #154

Conversation

sander-willems-bruker commented Sep 13, 2024

grosenberger-bruker commented Oct 1, 2024

lazear commented Oct 3, 2024 • edited Loading

lazear left a comment • edited Loading

Choose a reason for hiding this comment

lazear Oct 24, 2024

Choose a reason for hiding this comment

grosenberger-bruker commented Jan 24, 2025

fallonda commented Jan 30, 2025

sander-willems-bruker commented Feb 5, 2025

lazear commented Feb 7, 2025

lazear commented Oct 3, 2024 •

edited

Loading

lazear left a comment •

edited

Loading