-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try applying bipartite graph reordering to KNN graph node ids #13565
Comments
This is what got me to thinking of BP for HNSW search: intuitively, it could help a lot when the dataset size exceeds the size of the page cache? |
Yes, and as we keep compressing the vectors themselves, the graph takes up more and more space, relatively speaking |
this paper from SIGIR'24 seems to do exactly this as a first step in their Block Max Pruning technique. |
I think that gains might be astounding? Similar vectors would be stored near each other in the |
Thinking about the implementation a bit I realized that when we reorder the vector storage for the benefit of HNSW we will still need a way to iterate over vector values in docid order, and we need to map from vector ord to docid when searching. None of the existing vector formats handles this: they are optimized for vectors that are stored in docid order. To make some progress, I'd start with an initial implementation that stores these mappings in a naive way, eg as fully-populated arrays, and we can use that to measure how much improvement we see in graph storage size and search performance. Then we could revisit and use some more efficient data structure for the ord/doc mapping. Since the ordinals would no longer be increasing with docid we can't use DirectMonotonicReader/Writer any more, but it needs to be something more like what SortedNumericDocValues does. I'm not super familiar with what we have - I wonder if there is some reusable code that would help. |
This sounds to me like you are considering renumbering node IDs in the HNSW graph without renumbering doc IDs. I'm curious if you considered renumbering doc IDs like BPIndexReorderer? |
Yes, I'd like this to be compatible with a different index sort for docids. I think it makes sense when you have hybrid search; keywords + semantic in the same query. But I suppose if you have a purely semantic index you would ideally want the sorting to be congruent. Let me think about it - maybe we can achieve both somehow? |
One thing that confused me about the |
OK I'm beginning to see that it would be easier to start with a tool that rewrites the index as you did with the |
I made this tool; while testing it I ran into some unexpected wrinkles relating to our vector format. I created a new index from an existing one, with a new docid order by:
But this recreates the HNSW graph doing unneccessary work, when all we really want to do is to renumber it. And the new graph it creates ignores the parameters that were used to create the original graph, substituting defaults, such as M=16. It makes me wonder if we ought to write M to the index as per-field metadata so it can be reused by tools such as this that may not have access to a schema, or in general when merging the field in the future. I guess for my preliminary tests I can simply hack the update: I found we set this |
Could we at least factor this out so they share a single |
Yes, that seems incontrovertibly correct. I was planning to include it along with some other changes, but let's fix the easy stuff first. #13674 |
Another thing I stumbled on that surprised me is that SortingCodecReader now requires loading vectors into RAM in order to do its work. It used to rely on random access to avoid this, but we stopped doing that in #11964. I guess I was fully aware but now I think we need to find a way to provide random access when possible. In that issue it was stated we can't rely on readers supporting random access, but in fact I think they always do. Certainly in the main path random access is supported. Perhaps we can handle with a class cast in SortingCodecReader so it can use RA when available. But I still am not sure why we don't just make it always available. |
Sorry that this change is being annoying for BP, the goal of this change was to simplify the API, which I found a bit messy at the time with both a sequential and random-access API. Since the random-access API was only used internally by the file format, it felt cleaner to only expose the sequential access API to users, and also consistent with doc values. If this is being limiting for more interesting things we want to do, we could look into replacing the sequential-access API of vectors with a random-access API in Lucene 10. |
Hi thanks for that @jpountz, no worries; this was something we all agreed on. I'm able to continue with the "research" part of this by simply increasing heap size - it's not a blocker. At the same time I think we might want to reintroduce random-access vector readers as a first-class API for other reasons. Even the current case of merging multiple large segments containing vectors would be affected by this, wouldn't it? Since SortingCodecReader is used by IndexWriter when merging sorted indexes, it means that in that case all vector data of segments being merged is held on heap, potentially requiring quite a lot of heap allocation when instead we could read from "disk" at the cost of some random accesses. I guess disk random accesses are generally to be avoided, but given that the alternative is to "page in" every vector page to the heap, I would think we would prefer to let the OS do the paging as usual for our index data. |
In the meantime, just to let you know I do have a dirt path implementation of this (multithreading not yet working, totally recomputes centroids on every iteration, etc), but it isn't yet yielding the hoped-for improvements in hnsw graph (vex file) size. I augmented KnnGraphTester to print out the average delta between node ids, and this is being cut in half, as we would expect from the BP, but it doesn't yield much reduction in index size. It might just be that the indexes are too small for VInt encoding to be impacted much if node/docid deltas were previously averaging around 55000 and are now around 25000 (still takes 3 bytes per delta on average). In this case I saw vex file size go from 21048922 to 20624822; only a few % reduction. I'm continuing to test with larger indexes and different vector data sets. |
This change should also give a big improvement to cold indices (where the Lucene |
We should use group-varint, like for tail postings? |
Does this choose a single bit-width for a group of postings? That sounds like it would produce savings here, yes. Also it would be faster to decode, right? More data; on a 2M-document index over a different dataset I did see a larger size reduction, from 88970639 to 60536909, a 32% reduction, and it's looking like some good latency reduction too, even after index is warm, possibly due to the VInt decoding being cheaper? Again this seem to vary quite a bit with vector dataset and index size, but it is looking more promising now. |
No, each posting can still have a different byte width, but it does the decoding in a way that doesn't have unpredictable conditionals. |
Description
We recently added BPIndexReorderer to enable sorting indexed docids using the BP approach to cluster documents with similar postings in some chosen set of fields. The same approach can also be used to sort node ids in a graph, eg see https://dl.acm.org/doi/10.1145/2939672.2939862 and https://arxiv.org/pdf/2403.01797 applies similar ideas to shards. I think we could apply the algorithm we have already implemented to HNSW node ids using document/document dot products as the distant metric that drives the partitioning and we might get some compression benefits since nearby nodes will tend to be linked together and this should reduce the overall node id deltas. We might also get memory locality benefits; possibly the memory access patterns could get a bit less random.
The text was updated successfully, but these errors were encountered: