Initial implementation of manual ngram-based search in MongoDB #993

ml-evs · 2024-11-06T19:07:34Z

Closes #679 -- MongoDB text indexes tokenize using whitespace and punctuation only. This PR investigates whether we can build a manual ngram index, so when searching for refcode ABCDEF, you get results if you only ask for ABC, BCD etc.

This is done by making a separate collection called item_fts that is used only for this kind of search, by storing immutable_id, type and ngrams for all items, with an index over ngrams. Lookup is then done by ngrammifying the query string and doing array lookup and ordering by the number of matches.

Will have to fiddle around to see:

what the optimal value of N is, and whether we need to do all N+1-grams up to a fixed range
how expensive this is for realistic deployment sizes
whether it might be better to try an edit-distance based approach for some fields

cypress · 2024-11-06T19:19:27Z

datalab Run #2808

Run Properties: Passed #2808 • 2e4be020a6 ℹ️: Merge 4fc2c6e21a891fe6b78f13a356646433a87517b4 into ae09e24acc67e9795fe1ee542485...

Project	`datalab`
Branch Review	`ml-evs/mongo-fts-ngram`
Run status	`Passed #2808`
Run duration	`06m 30s`
Commit	`2e4be020a6 ℹ️: Merge 4fc2c6e21a891fe6b78f13a356646433a87517b4 into ae09e24acc67e9795fe1ee542485...`
Committer	`Matthew Evans`
View all properties for this run ↗︎

Test results
Failures	`0`
Flaky	`0`
Pending	`0`
Skipped	`0`
Passing	`405`
View all changes introduced in this branch ↗︎

codecov · 2024-11-06T19:23:09Z

Codecov Report

Attention: Patch coverage is 93.42105% with 5 lines in your changes missing coverage. Please review.

Project coverage is 68.92%. Comparing base (ae09e24) to head (4fc2c6e).

Files with missing lines	Patch %	Lines
pydatalab/src/pydatalab/mongo.py	91.66%	4 Missing ⚠️
pydatalab/src/pydatalab/routes/v0_1/items.py	96.29%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #993      +/-   ##
==========================================
+ Coverage   68.49%   68.92%   +0.43%     
==========================================
  Files          62       62              
  Lines        3955     4026      +71     
==========================================
+ Hits         2709     2775      +66     
- Misses       1246     1251       +5

Files with missing lines	Coverage Δ
pydatalab/src/pydatalab/main.py	`64.82% <100.00%> (+0.24%)`	⬆️
pydatalab/src/pydatalab/routes/v0_1/items.py	`83.61% <96.29%> (+0.96%)`	⬆️
pydatalab/src/pydatalab/mongo.py	`84.34% <91.66%> (+5.24%)`	⬆️

---- 🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests
JS Bundle Analysis - Avoid shipping oversized bundles

ml-evs · 2024-11-10T15:58:34Z

I think this is ready for review now, though we should not make it the default (yet) until we can test scaling and quality of search results.

ml-evs · 2025-02-07T11:39:20Z

pydatalab/src/pydatalab/main.py

@@ -206,6 +206,7 @@ def create_app(
        extension.init_app(app)

    pydatalab.mongo.create_default_indices()
+    pydatalab.mongo.create_ngram_item_index()


This should only be run on one of the API processes, or there should be a lock

ml-evs added the server label Nov 6, 2024

ml-evs requested review from jdbocarsly and BenjaminCharmes as code owners November 6, 2024 19:07

ml-evs changed the title ~~[WIP] Initial noodling with manual ngram index in MongoDB~~ [WIP] Initial noodling with manual ngram-based search in MongoDB Nov 6, 2024

ml-evs mentioned this pull request Nov 7, 2024

Fix and refactor FTS field generation #998

Merged

ml-evs force-pushed the ml-evs/mongo-fts-ngram branch from 79b2bcc to dbf241f Compare November 10, 2024 15:43

ml-evs changed the title ~~[WIP] Initial noodling with manual ngram-based search in MongoDB~~ Initial implementation of manual ngram-based search in MongoDB Nov 10, 2024

ml-evs force-pushed the ml-evs/mongo-fts-ngram branch from 63be944 to d4969ae Compare November 10, 2024 15:58

ml-evs added the enhancement New feature or request label Nov 10, 2024

ml-evs added 5 commits November 25, 2024 13:44

Initial noodling with manual ngram index in MongoDB

2945624

Add working tests

7cee59e

Implement rudimentary ngram-based search with item updates and add tests

bb82c64

Rebase

6b32637

Make sure find_one_and_update returns the updated doc

4fc2c6e

ml-evs force-pushed the ml-evs/mongo-fts-ngram branch from d4969ae to 4fc2c6e Compare November 25, 2024 13:44

ml-evs mentioned this pull request Nov 28, 2024

Secondary indices for vector search #1017

Open

ml-evs commented Feb 7, 2025

View reviewed changes

Add a temp. view to compare searches

34c7a7e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial implementation of manual ngram-based search in MongoDB #993

Initial implementation of manual ngram-based search in MongoDB #993

ml-evs commented Nov 6, 2024 •

edited

Loading

cypress bot commented Nov 6, 2024 •

edited

Loading

codecov bot commented Nov 6, 2024 •

edited

Loading

ml-evs commented Nov 10, 2024

ml-evs Feb 7, 2025

Initial implementation of manual ngram-based search in MongoDB #993

Are you sure you want to change the base?

Initial implementation of manual ngram-based search in MongoDB #993

Conversation

ml-evs commented Nov 6, 2024 • edited Loading

cypress bot commented Nov 6, 2024 • edited Loading

datalab Run #2808

codecov bot commented Nov 6, 2024 • edited Loading

Codecov Report

ml-evs commented Nov 10, 2024

ml-evs Feb 7, 2025

Choose a reason for hiding this comment

ml-evs commented Nov 6, 2024 •

edited

Loading

cypress bot commented Nov 6, 2024 •

edited

Loading

codecov bot commented Nov 6, 2024 •

edited

Loading