Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OptimisticKnnVectorQuery #14226

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

msokolov
Copy link
Contributor

@msokolov msokolov commented Feb 12, 2025

Description

This is a WIP patch to work out an idea for Knn hit collection that is deterministic and efficient in the sense that the number of hits collected per leaf scales with the size of the leaf. The idea is:

  1. Run pro-rated search assuming uniform random topK-document distribution among leaves (optimistically) by scaling k to kLeaf according to the proportion of documents in the leaf.
  2. Examine the list of per-leaf results. Any leaves whose minimum score among its kLeaf results is >= global minimum score (in top K across all leaves, merged) is submitted for further exploration using seeded search starting with the previous best results.
  3. repeat 2, 3 until all leaves min scores are worse than the global min score or other limiting conditions are reached.

When the presumption of uniform distribution is valid, we would be able to skip steps 3 and 4 so we should get similar performance as we do with simple pro-rated algorithm. Otherwise we pay the cost of re-entering search, but we are guaranteed to get global (approximate) top K in a deterministic way (without cross-thread communication during search), and hopefully the cost is minimized by seeding with the best results so far.

The key point is that we make an optimisitic assumption initially, but we check and if it proved to be incorrect, we make up for it with some more work. In many circumstances the original assumption will hold, and in any case the cost of further searching shouldn't be too much because of the seeding. I plan to do some perf tests; no results yet beyond unit testing. I did run TestOptimisticKnnVectorQuery with 1000 repeats and it didn't fail, including the new consistency test. in #14180.

This is a WIP only because the class design isn't really great. It re-uses a bunch of stuff from SeededKnnVectorQuery. It's for floats only as it is. My thinking is a better way might be to make this be the default and merge it directly into AbstractKnnVectorQuery? Or it could possilby be triggered by a KnnSearchStrategy?

@benwtrent
Copy link
Member

This is indeed interesting. I eagerly wait some of the perf numbers ;).

But, this re-entering the graph makes me think that collectors will need to track visitation. This way the same vector path isn't visited multiple times. Its possible that once you enter the lower level, that you accidentally "back track" to previously hit vectors.

My thinking is a better way might be to make this be the default and merge it directly into AbstractKnnVectorQuery? Or it could possilby be triggered by a KnnSearchStrategy?

I do think that if this proves fruitful, it should go straight into the default behavior. As for a strategy or not, it depends on how extreme the performance & recall differences are :/.

The reason I am thinking a search strategy for the filter stuff I have been doing here: #14160 is that the recall behavior is significantly different for certain kinds of filters. Users might want to turn it off (if they have particular kinds of filters they run that behave wonky).

@msokolov
Copy link
Contributor Author

Hmm github's test run failed with:

gradlew :lucene:core:test --tests "org.apache.lucene.search.TestByteVectorSimilarityQuery.testFallbackToExact" -Ptests.jvms=1 -Ptests.jvmargs= -Ptests.seed=9F91AA4B44864767 -Ptests.useSecurityManager=true -Ptests.gui=true -Ptests.file.encoding=ISO-8859-1 -Ptests.vectorsize=default -Ptests.forceintegervectors=false

however this did not repro locally. I'm going to kick off another run

@msokolov
Copy link
Contributor Author

msokolov commented Feb 12, 2025

But, this re-entering the graph makes me think that collectors will need to track visitation. This way the same vector path isn't visited multiple times. Its possible that once you enter the lower level, that you accidentally "back track" to previously hit vectors.

That might not be needed? My fingers are crossed, buyt I'm thinking we skip most of the revisitation by seeding with the best hits and then we would quickly terminate if nothing better is found? I was hoping to manage this without any changes to the searcher that could be needed for retaining the visited lists

@msokolov
Copy link
Contributor Author

Well I ran some tests, and surprisingly, I saw a significant different in both recall and latency (decreases in both). This surprised me: I expected to see more-or-less similar results, so I want todig a little further.

@benwtrent
Copy link
Member

OK, I ran it on 8M data set with 128 segments.

Indeed it visits way fewer vectors (seemingly), and is consistent across multiple threads.

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 0.719       36.100  8000000   100     100    28015           128        1.000

8 threaded qps

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 0.719        5.700  8000000   100     100    28015           128        1.000

The current baseline (which isn't multi-threaded consistent, so I am just showing the single thread perf):

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 0.949       59.900  8000000   100     100    87299           128        1.000

Given the visited & latency, I wonder if the "visited" actually represents what this query is visiting?

But, maybe the answer is allowing more exploration per segment?

@benwtrent
Copy link
Member

benwtrent commented Feb 13, 2025

Ah, one other thought is that we are definitely scoring every seeded entry point twice. Once when they are gathered during initial query phase, then later through the seeded provisioning. We could adjust the internal API to allow cached scores to be provided (if we find that we can get the recall up).

@msokolov
Copy link
Contributor Author

You can tune it by changing the magic number 3 to a bigger number. I found that with 15 I get slight better recall and slightly lower latencies than the baseline for my test case

@msokolov
Copy link
Contributor Author

also - we are not really tracking "visited" properly I think

@msokolov
Copy link
Contributor Author

I think maybe what happens here is that the K controls not only how many hits are returned from each segment, but also what the "beam width" is during search, so we could have gotten better hits even from the segments that appeared to have been exhausted (because their worst in the top K weas below the min competitive score)

@benwtrent
Copy link
Member

I am not 100% sure whats up with the behavior. However, I switched to 16 (also happens to be the graph conn) instead of 3.

Its interesting how visited is lower, but recall is on par with the current non-consistent baseline. I don't like the magic "16" here, without knowing exactly why its working.

I also fixed visited in Lucene util (accounting for the multiple topdoc merges).

Here is NOT reusing the entry point scores

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 0.942       10.400  8000000   100     100    68605           128        1.000

Here IS reusing the entry point scores (have a local refactor). So, visited drops just a little.

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 0.942        7.400  8000000   100     100    68477           128        1.000

here is reusing entry point, single threaded:

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 0.942       49.300  8000000   100     100    68477           128        1.000

Note, the visited for a SINGLE graph is around 3000.

Also, as a further baseline, here is no information sharing (e.g. what it was before our in-consistent multi-threaded results). e.g. just fully exploring every graph:

recall  latency(ms)     nDoc  topK  fanout  visited  num segments  selectivity
 1.000      213.400  8000000   100     100   412280           128        1.000

@msokolov
Copy link
Contributor Author

The latency improvement from re-using scores is surprisingly large. I would have expected costs to be dominated by newly-explored nodes, but this is cool.

@benwtrent
Copy link
Member

The latency improvement from re-using scores is surprisingly large. I would have expected costs to be dominated by newly-explored nodes, but this is cool.

@msokolov this was only over 10 different query iterations on my laptop 😅 so, I wouldn't necessarily trust my latency numbers :D.

But, the visited count is attractive. The magic numbers are just weird to me. Why is 16 so special?

FWIW, I haven't tested this on anyother datasets yet.

If we want to provide entry-point scores, I can do that refactor in a separate PR. It isn't so bad, and is generally useful (we score entry point twice on every search, which is a minimal cost, but we could prevent doing that).

@msokolov
Copy link
Contributor Author

I don't believe 16 is "special" except in the sense that it happens to be a sweet spot is this context. We expect that as we increase that per-segment factor we will get increased recall because it is the equivalent of the old "fanout" parameter we used to have - it expands the beam of the search while keeping the number of results returned fixed. This is different from "fanout" because that was a global setting, and this one scales with the relative size of the segment.

What I'd like to understand is whether there is a fixed value for this sweet spot, or whether it changes with {data, graph construction parameters, something else}. It seems clear that my idea that it is a probabilistic thing is incorrect since 16 standard deviations is just crazy. But we can at least look at it empirically.

@msokolov
Copy link
Contributor Author

@benwtrent I'm curious how you managed to re-use the scores. I poked around a little and I guess it requires some new plumbing / class casts since the API isn't really designed for it? Do you have a patch you wouldn't mind posting?

@benwtrent
Copy link
Member

@msokolov I can provide a quick patch tomorrow against main. As I said, it would work for stuff as it is now.

while (ctxIter.hasNext()) {
LeafReaderContext ctx = ctxIter.next();
TopDocs perLeaf = perLeafResults.get(ctx.ord);
if (perLeaf.scoreDocs.length > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we publish the metrics for each case here to see how often/how many passes we usually need to take?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that would be critical to understanding what's going on here! Maybe we can extend TopK to include some metrics? Not sure what the best API is here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can define it as an abstract method, and let the consumer to override it with their own metrics system?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, OK, pass in some kind of abstract Counter? But maybe a simple int[] is enough? The consumer can always convert to their metric system.

while (ctxIter.hasNext()) {
LeafReaderContext ctx = ctxIter.next();
TopDocs perLeaf = perLeafResults.get(ctx.ord);
if (perLeaf.scoreDocs.length > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can define it as an abstract method, and let the consumer to override it with their own metrics system?

assert perLeafTopK
> 0; // if we divided by zero above, leafProportion can be NaN and then this would be
// 0
return new TopKnnCollector(perLeafTopK, visitedLimit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to cap perLeafTopK by the original k? I think k is double on every iteration, and perLeafTopK can theoretically go over the original k, which is excessive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can do it in perLeafTopkCalculation

@dungba88
Copy link
Contributor

dungba88 commented Feb 22, 2025

I ran some benchmark with Cohere 768 dataset for 3 different algorithms: (1) the baseline "greedy", (2) this PR "optimistic", and (3) with only "pro-rata". (2) and (3) will converge with fanout >= 20, it kinda makes sense because as we increase the ef parameters, it would already look deeper into the graph. I'll try the different value of stdev as well.

(2) still missed some matches (about 8%) when compared to (1).

Other params: ndoc=500K, topK=100

recall  latency_(ms)  fanout  quantized  greediness  index_s  index_docs/s  force_merge_s  num_segments  index_size_(MB)  vec_disk_(MB)  vec_RAM_(MB)   algorithm 
 0.936        13.285       0         no        0.00   117.24       4264.94           0.00             7          1490.05       1464.844      1464.844  greediness 
 0.933        12.766       0         no        0.10   116.45       4293.84           0.00             7          1490.14       1464.844      1464.844  greediness 
 0.927        12.045       0         no        0.30   116.69       4285.04           0.00             7          1490.12       1464.844      1464.844  greediness 
 0.923        11.709       0         no        0.50   117.74       4246.79           0.00             7          1490.06       1464.844      1464.844  greediness 
 0.917        10.950       0         no        0.70   117.68       4248.67           0.00             7          1490.06       1464.844      1464.844  greediness 
 0.907        10.996       0         no        0.90   117.40       4258.80           0.00             7          1489.98       1464.844      1464.844  greediness 
 0.836        10.349       0         no        1.00   117.07       4271.13           0.00             7          1490.03       1464.844      1464.844  greediness 
 0.846         6.682       0         no       -1.00   116.50       4291.77           0.00             7          1490.09       1464.844      1464.844  optimistic
 0.862         6.699       5         no       -1.00   118.33       4225.44           0.00             7          1489.97       1464.844      1464.844  optimistic
 0.871         7.172      10         no       -1.00   116.10       4306.63           0.00             7          1490.05       1464.844      1464.844  optimistic
 0.897         7.962      20         no       -1.00   116.11       4306.15           0.00             7          1490.06       1464.844      1464.844  optimistic
 0.925        10.975      50         no       -1.00   116.71       4284.23           0.00             7          1490.10       1464.844      1464.844  optimistic
 0.941        15.635     100         no       -1.00   116.58       4289.08           0.00             7          1490.13       1464.844      1464.844  optimistic
 0.786         5.727       0         no       -1.00   117.86       4242.25           0.00             7          1490.10       1464.844      1464.844    prorata
 0.831         6.176       5         no       -1.00   117.27       4263.70           0.00             7          1490.07       1464.844      1464.844    prorata
 0.857         6.746      10         no       -1.00   116.79       4281.04           0.00             7          1490.08       1464.844      1464.844    prorata
 0.891         7.763      20         no       -1.00   116.21       4302.59           0.00             7          1490.09       1464.844      1464.844    prorata
 0.921        11.083      50         no       -1.00   118.02       4236.46           0.00             7          1490.06       1464.844      1464.844    prorata
 0.940        15.694     100         no       -1.00   116.63       4286.91           0.00             7          1490.00       1464.844      1464.844    prorata
Screenshot 2025-02-22 at 21 17 40

@dungba88
Copy link
Contributor

dungba88 commented Feb 23, 2025

Here is the graph with various values of stdev (3, 9, 16, 25). Again both (2) and (3) converge with a stdev value of 9, and all 3 converge at a stdev value of 16 and greediness of 0.7. With this dataset and settings, using 9 seems to be the most efficient, but then there is virtually no difference between (2) and (3). I guess (2) would shine when the data is really skewed?

Screenshot 2025-02-23 at 21 31 42

@msokolov
Copy link
Contributor Author

msokolov commented Feb 24, 2025

I pushed a version that re-uses scores and limits per-leaf topK to global topK. The former didn't make very much difference, but the latter change did improve things quite a bit. Here are some numbers from cohere/768d:

mainline

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  index docs/s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
0.954        12.919  500000    50       0       64        250         no    13786     0.00      Infinity             8          1501.70       1464.844      1464.844
0.981        18.488  500000    50      50       64        250         no    20371     0.00      Infinity             8          1501.70       1464.844      1464.844
0.989        22.948  500000    50     100       64        250         no    24963     0.00      Infinity             8          1501.70       1464.844      1464.844

wih reused scores and limiting perLeafK <= K

recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index s  index docs/s  num segments  index size (MB)  vec disk (MB)  vec RAM (MB)
 0.959        11.375  500000    50       0       64        250         no    12086   308.23       1622.15             8          1501.70       1464.844      1464.844
 0.979        14.926  500000    50      50       64        250         no    16724     0.00      Infinity             8          1501.70       1464.844      1464.844
 0.987        17.858  500000    50     100       64        250         no    20277     0.00      Infinity             8          1501.70       1464.844      1464.844

it would be awesome if you could produce similar comparisons for this version, @dungba88 !

@msokolov msokolov force-pushed the optimistic-knn-query branch from 8d16223 to d20f704 Compare February 24, 2025 16:12
@msokolov
Copy link
Contributor Author

There were some conflicts with other recent changes, so I rebased and removed the reuse-scores part of this since it was conflicting with other changes and doesn't seem worth preserving. Then I had to force-push, sorry.

@msokolov msokolov force-pushed the optimistic-knn-query branch from d20f704 to 5b06e16 Compare February 24, 2025 16:21
@benwtrent
Copy link
Member

so I rebased and removed the reuse-scores part of this since it was conflicting with other changes and doesn't seem worth preserving

@msokolov thanks for confirming and digging into that :). It was worth a shot! But its not super surprising as the dominate cost will be the further exploration.

@dungba88
Copy link
Contributor

@msokolov I didn't see the change to cap perLeafTopK in the latest commit, maybe it has not been published yet?

@msokolov
Copy link
Contributor Author

oh darn, in all my rebasing and force-pushing I seem to have ended up with the wrong version, grr. I will check reflog and recover...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants