OptimisticKnnVectorQuery #14226

msokolov · 2025-02-12T20:35:01Z

Description

This is a WIP patch to work out an idea for Knn hit collection that is deterministic and efficient in the sense that the number of hits collected per leaf scales with the size of the leaf. The idea is:

Run pro-rated search assuming uniform random topK-document distribution among leaves (optimistically) by scaling k to kLeaf according to the proportion of documents in the leaf.
Examine the list of per-leaf results. Any leaves whose minimum score among its kLeaf results is >= global minimum score (in top K across all leaves, merged) is submitted for further exploration using seeded search starting with the previous best results.
repeat 2, 3 until all leaves min scores are worse than the global min score or other limiting conditions are reached.

When the presumption of uniform distribution is valid, we would be able to skip steps 3 and 4 so we should get similar performance as we do with simple pro-rated algorithm. Otherwise we pay the cost of re-entering search, but we are guaranteed to get global (approximate) top K in a deterministic way (without cross-thread communication during search), and hopefully the cost is minimized by seeding with the best results so far.

The key point is that we make an optimisitic assumption initially, but we check and if it proved to be incorrect, we make up for it with some more work. In many circumstances the original assumption will hold, and in any case the cost of further searching shouldn't be too much because of the seeding. I plan to do some perf tests; no results yet beyond unit testing. I did run TestOptimisticKnnVectorQuery with 1000 repeats and it didn't fail, including the new consistency test. in #14180.

This is a WIP only because the class design isn't really great. It re-uses a bunch of stuff from SeededKnnVectorQuery. It's for floats only as it is. My thinking is a better way might be to make this be the default and merge it directly into AbstractKnnVectorQuery? Or it could possilby be triggered by a KnnSearchStrategy?

benwtrent · 2025-02-12T20:52:38Z

This is indeed interesting. I eagerly wait some of the perf numbers ;).

But, this re-entering the graph makes me think that collectors will need to track visitation. This way the same vector path isn't visited multiple times. Its possible that once you enter the lower level, that you accidentally "back track" to previously hit vectors.

My thinking is a better way might be to make this be the default and merge it directly into AbstractKnnVectorQuery? Or it could possilby be triggered by a KnnSearchStrategy?

I do think that if this proves fruitful, it should go straight into the default behavior. As for a strategy or not, it depends on how extreme the performance & recall differences are :/.

The reason I am thinking a search strategy for the filter stuff I have been doing here: #14160 is that the recall behavior is significantly different for certain kinds of filters. Users might want to turn it off (if they have particular kinds of filters they run that behave wonky).

msokolov · 2025-02-12T20:53:38Z

Hmm github's test run failed with:

gradlew :lucene:core:test --tests "org.apache.lucene.search.TestByteVectorSimilarityQuery.testFallbackToExact" -Ptests.jvms=1 -Ptests.jvmargs= -Ptests.seed=9F91AA4B44864767 -Ptests.useSecurityManager=true -Ptests.gui=true -Ptests.file.encoding=ISO-8859-1 -Ptests.vectorsize=default -Ptests.forceintegervectors=false

however this did not repro locally. I'm going to kick off another run

msokolov · 2025-02-12T20:56:55Z

But, this re-entering the graph makes me think that collectors will need to track visitation. This way the same vector path isn't visited multiple times. Its possible that once you enter the lower level, that you accidentally "back track" to previously hit vectors.

That might not be needed? My fingers are crossed, buyt I'm thinking we skip most of the revisitation by seeding with the best hits and then we would quickly terminate if nothing better is found? I was hoping to manage this without any changes to the searcher that could be needed for retaining the visited lists

msokolov · 2025-02-13T17:41:04Z

Well I ran some tests, and surprisingly, I saw a significant different in both recall and latency (decreases in both). This surprised me: I expected to see more-or-less similar results, so I want todig a little further.

benwtrent · 2025-02-13T17:56:21Z

OK, I ran it on 8M data set with 128 segments.

Indeed it visits way fewer vectors (seemingly), and is consistent across multiple threads.

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 0.719       36.100  8000000   100     100    28015           128        1.000

8 threaded qps

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 0.719        5.700  8000000   100     100    28015           128        1.000

The current baseline (which isn't multi-threaded consistent, so I am just showing the single thread perf):

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 0.949       59.900  8000000   100     100    87299           128        1.000

Given the visited & latency, I wonder if the "visited" actually represents what this query is visiting?

But, maybe the answer is allowing more exploration per segment?

benwtrent · 2025-02-13T18:04:12Z

Ah, one other thought is that we are definitely scoring every seeded entry point twice. Once when they are gathered during initial query phase, then later through the seeded provisioning. We could adjust the internal API to allow cached scores to be provided (if we find that we can get the recall up).

msokolov · 2025-02-13T18:04:30Z

You can tune it by changing the magic number 3 to a bigger number. I found that with 15 I get slight better recall and slightly lower latencies than the baseline for my test case

msokolov · 2025-02-13T18:06:18Z

also - we are not really tracking "visited" properly I think

msokolov · 2025-02-13T18:08:37Z

I think maybe what happens here is that the K controls not only how many hits are returned from each segment, but also what the "beam width" is during search, so we could have gotten better hits even from the segments that appeared to have been exhausted (because their worst in the top K weas below the min competitive score)

benwtrent · 2025-02-13T20:10:55Z

I am not 100% sure whats up with the behavior. However, I switched to 16 (also happens to be the graph conn) instead of 3.

Its interesting how visited is lower, but recall is on par with the current non-consistent baseline. I don't like the magic "16" here, without knowing exactly why its working.

I also fixed visited in Lucene util (accounting for the multiple topdoc merges).

Here is NOT reusing the entry point scores

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 0.942       10.400  8000000   100     100    68605           128        1.000

Here IS reusing the entry point scores (have a local refactor). So, visited drops just a little.

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 0.942        7.400  8000000   100     100    68477           128        1.000

here is reusing entry point, single threaded:

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 0.942       49.300  8000000   100     100    68477           128        1.000

Note, the visited for a SINGLE graph is around 3000.

Also, as a further baseline, here is no information sharing (e.g. what it was before our in-consistent multi-threaded results). e.g. just fully exploring every graph:

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 1.000      213.400  8000000   100     100   412280           128        1.000

msokolov · 2025-02-14T19:54:31Z

The latency improvement from re-using scores is surprisingly large. I would have expected costs to be dominated by newly-explored nodes, but this is cool.

benwtrent · 2025-02-14T20:04:19Z

The latency improvement from re-using scores is surprisingly large. I would have expected costs to be dominated by newly-explored nodes, but this is cool.

@msokolov this was only over 10 different query iterations on my laptop 😅 so, I wouldn't necessarily trust my latency numbers :D.

But, the visited count is attractive. The magic numbers are just weird to me. Why is 16 so special?

FWIW, I haven't tested this on anyother datasets yet.

If we want to provide entry-point scores, I can do that refactor in a separate PR. It isn't so bad, and is generally useful (we score entry point twice on every search, which is a minimal cost, but we could prevent doing that).

msokolov · 2025-02-14T20:13:07Z

I don't believe 16 is "special" except in the sense that it happens to be a sweet spot is this context. We expect that as we increase that per-segment factor we will get increased recall because it is the equivalent of the old "fanout" parameter we used to have - it expands the beam of the search while keeping the number of results returned fixed. This is different from "fanout" because that was a global setting, and this one scales with the relative size of the segment.

What I'd like to understand is whether there is a fixed value for this sweet spot, or whether it changes with {data, graph construction parameters, something else}. It seems clear that my idea that it is a probabilistic thing is incorrect since 16 standard deviations is just crazy. But we can at least look at it empirically.

msokolov · 2025-02-17T21:22:42Z

@benwtrent I'm curious how you managed to re-use the scores. I poked around a little and I guess it requires some new plumbing / class casts since the API isn't really designed for it? Do you have a patch you wouldn't mind posting?

benwtrent · 2025-02-17T21:28:56Z

@msokolov I can provide a quick patch tomorrow against main. As I said, it would work for stuff as it is now.

dungba88 · 2025-02-19T04:37:58Z

lucene/core/src/java/org/apache/lucene/search/OptimisticKnnVectorQuery.java

+      while (ctxIter.hasNext()) {
+        LeafReaderContext ctx = ctxIter.next();
+        TopDocs perLeaf = perLeafResults.get(ctx.ord);
+        if (perLeaf.scoreDocs.length > 0


Can we publish the metrics for each case here to see how often/how many passes we usually need to take?

yes that would be critical to understanding what's going on here! Maybe we can extend TopK to include some metrics? Not sure what the best API is here

Maybe we can define it as an abstract method, and let the consumer to override it with their own metrics system?

Yes, OK, pass in some kind of abstract Counter? But maybe a simple int[] is enough? The consumer can always convert to their metric system.

dungba88 · 2025-02-21T06:57:08Z

lucene/core/src/java/org/apache/lucene/search/OptimisticKnnVectorQuery.java

+      while (ctxIter.hasNext()) {
+        LeafReaderContext ctx = ctxIter.next();
+        TopDocs perLeaf = perLeafResults.get(ctx.ord);
+        if (perLeaf.scoreDocs.length > 0


Maybe we can define it as an abstract method, and let the consumer to override it with their own metrics system?

dungba88 · 2025-02-21T07:08:23Z

lucene/core/src/java/org/apache/lucene/search/OptimisticKnnVectorQuery.java

+          assert perLeafTopK
+              > 0; // if we divided by zero above, leafProportion can be NaN and then this would be
+          // 0
+          return new TopKnnCollector(perLeafTopK, visitedLimit);


Would it make sense to cap perLeafTopK by the original k? I think k is double on every iteration, and perLeafTopK can theoretically go over the original k, which is excessive.

yes, we can do it in perLeafTopkCalculation

dungba88 · 2025-02-22T12:23:15Z

I ran some benchmark with Cohere 768 dataset for 3 different algorithms: (1) the baseline "greedy", (2) this PR "optimistic", and (3) with only "pro-rata". (2) and (3) will converge with fanout >= 20, it kinda makes sense because as we increase the ef parameters, it would already look deeper into the graph. I'll try the different value of stdev as well.

(2) still missed some matches (about 8%) when compared to (1).

Other params: ndoc=500K, topK=100

recall  latency_(ms)  fanout  quantized  greediness  index_s  index_docs/s  force_merge_s  num_segments  index_size_(MB)  vec_disk_(MB)  vec_RAM_(MB)   algorithm 
 0.936        13.285       0         no        0.00   117.24       4264.94           0.00             7          1490.05       1464.844      1464.844  greediness 
 0.933        12.766       0         no        0.10   116.45       4293.84           0.00             7          1490.14       1464.844      1464.844  greediness 
 0.927        12.045       0         no        0.30   116.69       4285.04           0.00             7          1490.12       1464.844      1464.844  greediness 
 0.923        11.709       0         no        0.50   117.74       4246.79           0.00             7          1490.06       1464.844      1464.844  greediness 
 0.917        10.950       0         no        0.70   117.68       4248.67           0.00             7          1490.06       1464.844      1464.844  greediness 
 0.907        10.996       0         no        0.90   117.40       4258.80           0.00             7          1489.98       1464.844      1464.844  greediness 
 0.836        10.349       0         no        1.00   117.07       4271.13           0.00             7          1490.03       1464.844      1464.844  greediness 
 0.846         6.682       0         no       -1.00   116.50       4291.77           0.00             7          1490.09       1464.844      1464.844  optimistic
 0.862         6.699       5         no       -1.00   118.33       4225.44           0.00             7          1489.97       1464.844      1464.844  optimistic
 0.871         7.172      10         no       -1.00   116.10       4306.63           0.00             7          1490.05       1464.844      1464.844  optimistic
 0.897         7.962      20         no       -1.00   116.11       4306.15           0.00             7          1490.06       1464.844      1464.844  optimistic
 0.925        10.975      50         no       -1.00   116.71       4284.23           0.00             7          1490.10       1464.844      1464.844  optimistic
 0.941        15.635     100         no       -1.00   116.58       4289.08           0.00             7          1490.13       1464.844      1464.844  optimistic
 0.786         5.727       0         no       -1.00   117.86       4242.25           0.00             7          1490.10       1464.844      1464.844    prorata
 0.831         6.176       5         no       -1.00   117.27       4263.70           0.00             7          1490.07       1464.844      1464.844    prorata
 0.857         6.746      10         no       -1.00   116.79       4281.04           0.00             7          1490.08       1464.844      1464.844    prorata
 0.891         7.763      20         no       -1.00   116.21       4302.59           0.00             7          1490.09       1464.844      1464.844    prorata
 0.921        11.083      50         no       -1.00   118.02       4236.46           0.00             7          1490.06       1464.844      1464.844    prorata
 0.940        15.694     100         no       -1.00   116.63       4286.91           0.00             7          1490.00       1464.844      1464.844    prorata

dungba88 · 2025-02-23T12:40:06Z

Here is the graph with various values of stdev (3, 9, 16, 25). Again both (2) and (3) converge with a stdev value of 9, and all 3 converge at a stdev value of 16 and greediness of 0.7. With this dataset and settings, using 9 seems to be the most efficient, but then there is virtually no difference between (2) and (3). I guess (2) would shine when the data is really skewed?

msokolov · 2025-02-24T13:59:28Z

I pushed a version that re-uses scores and limits per-leaf topK to global topK. The former didn't make very much difference, but the latter change did improve things quite a bit. Here are some numbers from cohere/768d:

mainline

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
0.954        12.919  500000    50       0       64        250         no    13786     0.00      Infinity             8          1501.70       1464.844      1464.844
0.981        18.488  500000    50      50       64        250         no    20371     0.00      Infinity             8          1501.70       1464.844      1464.844
0.989        22.948  500000    50     100       64        250         no    24963     0.00      Infinity             8          1501.70       1464.844      1464.844

wih reused scores and limiting perLeafK <= K

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index s  index docs/s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.959        11.375  500000    50       0       64        250         no    12086   308.23       1622.15             8          1501.70       1464.844      1464.844
 0.979        14.926  500000    50      50       64        250         no    16724     0.00      Infinity             8          1501.70       1464.844      1464.844
 0.987        17.858  500000    50     100       64        250         no    20277     0.00      Infinity             8          1501.70       1464.844      1464.844

it would be awesome if you could produce similar comparisons for this version, @dungba88 !

msokolov · 2025-02-24T16:14:30Z

There were some conflicts with other recent changes, so I rebased and removed the reuse-scores part of this since it was conflicting with other changes and doesn't seem worth preserving. Then I had to force-push, sorry.

benwtrent · 2025-02-24T16:37:27Z

so I rebased and removed the reuse-scores part of this since it was conflicting with other changes and doesn't seem worth preserving

@msokolov thanks for confirming and digging into that :). It was worth a shot! But its not super surprising as the dominate cost will be the further exploration.

dungba88 · 2025-02-25T04:26:46Z

@msokolov I didn't see the change to cap perLeafTopK in the latest commit, maybe it has not been published yet?

msokolov · 2025-02-25T16:23:49Z

oh darn, in all my rebasing and force-pushing I seem to have ended up with the wrong version, grr. I will check reflog and recover...

github-actions bot added the module:core/search label Feb 12, 2025

benwtrent mentioned this pull request Feb 18, 2025

Reuse entry point scores and provide mechanisms to provide scores for directly entry points #14256

Open

dungba88 reviewed Feb 19, 2025

View reviewed changes

dungba88 reviewed Feb 21, 2025

View reviewed changes

github-actions bot added the module:core/hnsw label Feb 24, 2025

Michael Sokolov added 2 commits February 24, 2025 11:07

OptimisticKnnVectorQuery

e218857

clean up comments a bit

1e30997

msokolov force-pushed the optimistic-knn-query branch from 8d16223 to d20f704 Compare February 24, 2025 16:12

SeededKnnVectorQuery caps perLeafTopK to K

5b06e16

msokolov force-pushed the optimistic-knn-query branch from d20f704 to 5b06e16 Compare February 24, 2025 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OptimisticKnnVectorQuery #14226

OptimisticKnnVectorQuery #14226

msokolov commented Feb 12, 2025 •

edited

Loading

benwtrent commented Feb 12, 2025

msokolov commented Feb 12, 2025

msokolov commented Feb 12, 2025 •

edited

Loading

msokolov commented Feb 13, 2025

benwtrent commented Feb 13, 2025

benwtrent commented Feb 13, 2025 •

edited

Loading

msokolov commented Feb 13, 2025

msokolov commented Feb 13, 2025

msokolov commented Feb 13, 2025

benwtrent commented Feb 13, 2025

msokolov commented Feb 14, 2025

benwtrent commented Feb 14, 2025

msokolov commented Feb 14, 2025

msokolov commented Feb 17, 2025

benwtrent commented Feb 17, 2025

dungba88 Feb 19, 2025

msokolov Feb 19, 2025

dungba88 Feb 21, 2025

msokolov Feb 21, 2025

dungba88 Feb 21, 2025

dungba88 Feb 21, 2025

msokolov Feb 21, 2025

dungba88 commented Feb 22, 2025 •

edited

Loading

dungba88 commented Feb 23, 2025 •

edited

Loading

msokolov commented Feb 24, 2025 •

edited

Loading

msokolov commented Feb 24, 2025

benwtrent commented Feb 24, 2025

dungba88 commented Feb 25, 2025

msokolov commented Feb 25, 2025

OptimisticKnnVectorQuery #14226

Are you sure you want to change the base?

OptimisticKnnVectorQuery #14226

Conversation

msokolov commented Feb 12, 2025 • edited Loading

Description

benwtrent commented Feb 12, 2025

msokolov commented Feb 12, 2025

msokolov commented Feb 12, 2025 • edited Loading

msokolov commented Feb 13, 2025

benwtrent commented Feb 13, 2025

benwtrent commented Feb 13, 2025 • edited Loading

msokolov commented Feb 13, 2025

msokolov commented Feb 13, 2025

msokolov commented Feb 13, 2025

benwtrent commented Feb 13, 2025

msokolov commented Feb 14, 2025

benwtrent commented Feb 14, 2025

msokolov commented Feb 14, 2025

msokolov commented Feb 17, 2025

benwtrent commented Feb 17, 2025

dungba88 Feb 19, 2025

Choose a reason for hiding this comment

msokolov Feb 19, 2025

Choose a reason for hiding this comment

dungba88 Feb 21, 2025

Choose a reason for hiding this comment

msokolov Feb 21, 2025

Choose a reason for hiding this comment

dungba88 Feb 21, 2025

Choose a reason for hiding this comment

dungba88 Feb 21, 2025

Choose a reason for hiding this comment

msokolov Feb 21, 2025

Choose a reason for hiding this comment

dungba88 commented Feb 22, 2025 • edited Loading

dungba88 commented Feb 23, 2025 • edited Loading

msokolov commented Feb 24, 2025 • edited Loading

mainline

wih reused scores and limiting perLeafK <= K

msokolov commented Feb 24, 2025

benwtrent commented Feb 24, 2025

dungba88 commented Feb 25, 2025

msokolov commented Feb 25, 2025

msokolov commented Feb 12, 2025 •

edited

Loading

msokolov commented Feb 12, 2025 •

edited

Loading

benwtrent commented Feb 13, 2025 •

edited

Loading

dungba88 commented Feb 22, 2025 •

edited

Loading

dungba88 commented Feb 23, 2025 •

edited

Loading

msokolov commented Feb 24, 2025 •

edited

Loading