-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deletes #117
Conversation
Bits parameter to search() is not required to be non-null Add Bits.ALL and convenience methods No cleanup yet
- NN path can include self, so exclude it from candidates array
- NN path can include self, so exclude it from candidates array
Not yet done:
|
PR is a bit of a mess, largely because I started writing it on my desktop, wrote a different part on my macbook, and then had to merge both parts with the big threadlocal refactor, so apologies in advance for that. |
I think there's actually two different use cases. The first is Cassandra-like, where the caller wants to make the vector ordinals match some other data source (C* rowIds). For that we'll have the caller pass a a mapping function. But if you're building, say, vector embeddings of the files in an intellj project, you don't care about that, you just need to map vectors to files or methods. And there isn't really a state where the index is "done," you're going to keep modifying it as edits are made to the files. In that case it's a better fit to save the index, "holes" and all, and be able to reload it in the same state to continue modifying it. The most recent commit here adds save and load methods to handle this. |
jvector-base/src/main/java/io/github/jbellis/jvector/graph/ConcurrentNeighborSet.java
Show resolved
Hide resolved
jvector-base/src/main/java/io/github/jbellis/jvector/graph/ConcurrentNeighborSet.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, clicked approve on the wrong PR tab. Still reviewing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach and general implementation make sense to me. I left a few more comments inline about things I noticed on my last review pass. I think resolving these with words or code would conclude my review.
jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphIndexBuilder.java
Outdated
Show resolved
Hide resolved
jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphIndexBuilder.java
Show resolved
Hide resolved
jvector-base/src/main/java/io/github/jbellis/jvector/disk/OnDiskGraphIndex.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one tiny nit and one small question, but LGTM. Thanks!
jvector-base/src/main/java/io/github/jbellis/jvector/disk/OnDiskGraphIndex.java
Show resolved
Hide resolved
} | ||
|
||
@Test | ||
public void testReordingReumbering() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename? Not sure how much of this is a typo but at least Reumbering -> Renumbering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, two typos in the same method name. I don't think I was drunk when I wrote this ... :)
Since jbellis#117, acceptOrds should not be null. Instead, Bits.ALL should be used. Also updated GraphSearcher#search javadoc to match implementation.
Since #117, acceptOrds should not be null. Instead, Bits.ALL should be used. Also updated GraphSearcher#search javadoc to match implementation.
does this use the same algorithm for deletes as in https://arxiv.org/abs/2105.09613 or something different? looking at it, it seems to be doing something different. could you explain? |
JVector's approach is essentially "pretend each node that loses connections due to the deletion is newly added to the graph and rebuild its connections that way." FreshDiskANN's approach is definitely less expensive, if it actually maintains link quality I'd be happy to switch to that. |
i tried deleting nodes from the graph. the cleanup operation takes more time than rebuilding the entire index. with 1M vectors, building the index took 3 minutes. then i deleted 20% of the vectors and the cleanup took 17 minutes. |
That does sound excessive. |
Adds markNodeDeleted to GraphIndexBuilder, and adds functionality to cleanup() that removes deleted nodes and repairs the connections of their neighbors.
Fixes #115