-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OptimisticKnnVectorQuery #14226
base: main
Are you sure you want to change the base?
OptimisticKnnVectorQuery #14226
Conversation
This is indeed interesting. I eagerly wait some of the perf numbers ;). But, this re-entering the graph makes me think that collectors will need to track visitation. This way the same vector path isn't visited multiple times. Its possible that once you enter the lower level, that you accidentally "back track" to previously hit vectors.
I do think that if this proves fruitful, it should go straight into the default behavior. As for a strategy or not, it depends on how extreme the performance & recall differences are :/. The reason I am thinking a search strategy for the filter stuff I have been doing here: #14160 is that the recall behavior is significantly different for certain kinds of filters. Users might want to turn it off (if they have particular kinds of filters they run that behave wonky). |
Hmm github's test run failed with:
however this did not repro locally. I'm going to kick off another run |
That might not be needed? My fingers are crossed, buyt I'm thinking we skip most of the revisitation by seeding with the best hits and then we would quickly terminate if nothing better is found? I was hoping to manage this without any changes to the searcher that could be needed for retaining the visited lists |
Well I ran some tests, and surprisingly, I saw a significant different in both recall and latency (decreases in both). This surprised me: I expected to see more-or-less similar results, so I want todig a little further. |
OK, I ran it on 8M data set with 128 segments. Indeed it visits way fewer vectors (seemingly), and is consistent across multiple threads.
8 threaded qps
The current baseline (which isn't multi-threaded consistent, so I am just showing the single thread perf):
Given the visited & latency, I wonder if the "visited" actually represents what this query is visiting? But, maybe the answer is allowing more exploration per segment? |
Ah, one other thought is that we are definitely scoring every seeded entry point twice. Once when they are gathered during initial query phase, then later through the seeded provisioning. We could adjust the internal API to allow cached scores to be provided (if we find that we can get the recall up). |
You can tune it by changing the magic number 3 to a bigger number. I found that with 15 I get slight better recall and slightly lower latencies than the baseline for my test case |
also - we are not really tracking "visited" properly I think |
I think maybe what happens here is that the K controls not only how many hits are returned from each segment, but also what the "beam width" is during search, so we could have gotten better hits even from the segments that appeared to have been exhausted (because their worst in the top K weas below the min competitive score) |
I am not 100% sure whats up with the behavior. However, I switched to Its interesting how visited is lower, but recall is on par with the current non-consistent baseline. I don't like the magic "16" here, without knowing exactly why its working. I also fixed visited in Lucene util (accounting for the multiple topdoc merges). Here is NOT reusing the entry point scores
Here IS reusing the entry point scores (have a local refactor). So, visited drops just a little.
here is reusing entry point, single threaded:
Note, the visited for a SINGLE graph is around 3000. Also, as a further baseline, here is no information sharing (e.g. what it was before our in-consistent multi-threaded results). e.g. just fully exploring every graph:
|
The latency improvement from re-using scores is surprisingly large. I would have expected costs to be dominated by newly-explored nodes, but this is cool. |
@msokolov this was only over 10 different query iterations on my laptop 😅 so, I wouldn't necessarily trust my latency numbers :D. But, the visited count is attractive. The magic numbers are just weird to me. Why is FWIW, I haven't tested this on anyother datasets yet. If we want to provide entry-point scores, I can do that refactor in a separate PR. It isn't so bad, and is generally useful (we score entry point twice on every search, which is a minimal cost, but we could prevent doing that). |
I don't believe 16 is "special" except in the sense that it happens to be a sweet spot is this context. We expect that as we increase that per-segment factor we will get increased recall because it is the equivalent of the old "fanout" parameter we used to have - it expands the beam of the search while keeping the number of results returned fixed. This is different from "fanout" because that was a global setting, and this one scales with the relative size of the segment. What I'd like to understand is whether there is a fixed value for this sweet spot, or whether it changes with {data, graph construction parameters, something else}. It seems clear that my idea that it is a probabilistic thing is incorrect since 16 standard deviations is just crazy. But we can at least look at it empirically. |
@benwtrent I'm curious how you managed to re-use the scores. I poked around a little and I guess it requires some new plumbing / class casts since the API isn't really designed for it? Do you have a patch you wouldn't mind posting? |
@msokolov I can provide a quick patch tomorrow against main. As I said, it would work for stuff as it is now. |
while (ctxIter.hasNext()) { | ||
LeafReaderContext ctx = ctxIter.next(); | ||
TopDocs perLeaf = perLeafResults.get(ctx.ord); | ||
if (perLeaf.scoreDocs.length > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we publish the metrics for each case here to see how often/how many passes we usually need to take?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that would be critical to understanding what's going on here! Maybe we can extend TopK to include some metrics? Not sure what the best API is here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can define it as an abstract method, and let the consumer to override it with their own metrics system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, OK, pass in some kind of abstract Counter? But maybe a simple int[]
is enough? The consumer can always convert to their metric system.
while (ctxIter.hasNext()) { | ||
LeafReaderContext ctx = ctxIter.next(); | ||
TopDocs perLeaf = perLeafResults.get(ctx.ord); | ||
if (perLeaf.scoreDocs.length > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can define it as an abstract method, and let the consumer to override it with their own metrics system?
assert perLeafTopK | ||
> 0; // if we divided by zero above, leafProportion can be NaN and then this would be | ||
// 0 | ||
return new TopKnnCollector(perLeafTopK, visitedLimit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to cap perLeafTopK
by the original k? I think k is double on every iteration, and perLeafTopK
can theoretically go over the original k, which is excessive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we can do it in perLeafTopkCalculation
I pushed a version that re-uses scores and limits per-leaf topK to global topK. The former didn't make very much difference, but the latter change did improve things quite a bit. Here are some numbers from cohere/768d: mainline
wih reused scores and limiting perLeafK <= K
it would be awesome if you could produce similar comparisons for this version, @dungba88 ! |
8d16223
to
d20f704
Compare
There were some conflicts with other recent changes, so I rebased and removed the reuse-scores part of this since it was conflicting with other changes and doesn't seem worth preserving. Then I had to force-push, sorry. |
d20f704
to
5b06e16
Compare
@msokolov thanks for confirming and digging into that :). It was worth a shot! But its not super surprising as the dominate cost will be the further exploration. |
@msokolov I didn't see the change to cap |
oh darn, in all my rebasing and force-pushing I seem to have ended up with the wrong version, grr. I will check reflog and recover... |
Description
This is a WIP patch to work out an idea for Knn hit collection that is deterministic and efficient in the sense that the number of hits collected per leaf scales with the size of the leaf. The idea is:
among
its kLeaf results is >= global minimum score (in top K across all leaves, merged) is submitted for further exploration using seeded search starting with the previous best results.When the presumption of uniform distribution is valid, we would be able to skip steps 3 and 4 so we should get similar performance as we do with simple pro-rated algorithm. Otherwise we pay the cost of re-entering search, but we are guaranteed to get global (approximate) top K in a deterministic way (without cross-thread communication during search), and hopefully the cost is minimized by seeding with the best results so far.
The key point is that we make an optimisitic assumption initially, but we check and if it proved to be incorrect, we make up for it with some more work. In many circumstances the original assumption will hold, and in any case the cost of further searching shouldn't be too much because of the seeding. I plan to do some perf tests; no results yet beyond unit testing. I did run TestOptimisticKnnVectorQuery with 1000 repeats and it didn't fail, including the new consistency test. in #14180.
This is a WIP only because the class design isn't really great. It re-uses a bunch of stuff from SeededKnnVectorQuery. It's for floats only as it is. My thinking is a better way might be to make this be the default and merge it directly into AbstractKnnVectorQuery? Or it could possilby be triggered by a KnnSearchStrategy?