-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial implementation of manual ngram-based search in MongoDB #993
base: main
Are you sure you want to change the base?
Conversation
datalab
|
Project |
datalab
|
Branch Review |
ml-evs/mongo-fts-ngram
|
Run status |
|
Run duration | 06m 30s |
Commit |
|
Committer | Matthew Evans |
View all properties for this run ↗︎ |
Test results | |
---|---|
|
0
|
|
0
|
|
0
|
|
0
|
|
405
|
View all changes introduced in this branch ↗︎ |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #993 +/- ##
==========================================
+ Coverage 68.49% 68.92% +0.43%
==========================================
Files 62 62
Lines 3955 4026 +71
==========================================
+ Hits 2709 2775 +66
- Misses 1246 1251 +5
|
79b2bcc
to
dbf241f
Compare
63be944
to
d4969ae
Compare
I think this is ready for review now, though we should not make it the default (yet) until we can test scaling and quality of search results. |
d4969ae
to
4fc2c6e
Compare
@@ -206,6 +206,7 @@ def create_app( | |||
extension.init_app(app) | |||
|
|||
pydatalab.mongo.create_default_indices() | |||
pydatalab.mongo.create_ngram_item_index() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should only be run on one of the API processes, or there should be a lock
Closes #679 -- MongoDB text indexes tokenize using whitespace and punctuation only. This PR investigates whether we can build a manual ngram index, so when searching for refcode
ABCDEF
, you get results if you only ask forABC
,BCD
etc.This is done by making a separate collection called
item_fts
that is used only for this kind of search, by storingimmutable_id
,type
andngrams
for all items, with an index overngrams
. Lookup is then done by ngrammifying the query string and doing array lookup and ordering by the number of matches.Will have to fiddle around to see:
N
is, and whether we need to do all N+1-grams up to a fixed range