Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deletes #117

Merged
merged 42 commits into from
Oct 19, 2023
Merged

Deletes #117

merged 42 commits into from
Oct 19, 2023

Conversation

jbellis
Copy link
Owner

@jbellis jbellis commented Oct 9, 2023

Adds markNodeDeleted to GraphIndexBuilder, and adds functionality to cleanup() that removes deleted nodes and repairs the connections of their neighbors.

Fixes #115

@jbellis
Copy link
Owner Author

jbellis commented Oct 9, 2023

Not yet done:

  • Tests that exercise RenumberingGraphIndex
  • A way to communicate to the caller how their vectors got renumbered
  • (Optional) integrate this with other renumbering the caller may want to perform (e.g. Cassandra prefers to have vector ids match row ids when there is a 1:1 correspondence)

@jbellis
Copy link
Owner Author

jbellis commented Oct 9, 2023

PR is a bit of a mess, largely because I started writing it on my desktop, wrote a different part on my macbook, and then had to merge both parts with the big threadlocal refactor, so apologies in advance for that.

@jbellis jbellis self-assigned this Oct 12, 2023
@jbellis
Copy link
Owner Author

jbellis commented Oct 12, 2023

A way to communicate to the caller how their vectors got renumbered

I think there's actually two different use cases.

The first is Cassandra-like, where the caller wants to make the vector ordinals match some other data source (C* rowIds). For that we'll have the caller pass a a mapping function.

But if you're building, say, vector embeddings of the files in an intellj project, you don't care about that, you just need to map vectors to files or methods. And there isn't really a state where the index is "done," you're going to keep modifying it as edits are made to the files. In that case it's a better fit to save the index, "holes" and all, and be able to reload it in the same state to continue modifying it.

The most recent commit here adds save and load methods to handle this.

Copy link
Collaborator

@jkni jkni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, clicked approve on the wrong PR tab. Still reviewing.

Copy link
Collaborator

@jkni jkni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach and general implementation make sense to me. I left a few more comments inline about things I noticed on my last review pass. I think resolving these with words or code would conclude my review.

Copy link
Collaborator

@jkni jkni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one tiny nit and one small question, but LGTM. Thanks!

}

@Test
public void testReordingReumbering() throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename? Not sure how much of this is a typo but at least Reumbering -> Renumbering

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, two typos in the same method name. I don't think I was drunk when I wrote this ... :)

@jbellis jbellis merged commit a3edcbe into main Oct 19, 2023
8 checks passed
@jbellis jbellis deleted the deletes branch October 19, 2023 23:15
vbekiaris pushed a commit to vbekiaris/jvector that referenced this pull request Nov 14, 2023
Since jbellis#117, acceptOrds should not be null. Instead, Bits.ALL should be used.
Also updated GraphSearcher#search javadoc to match implementation.
jbellis pushed a commit that referenced this pull request Nov 15, 2023
Since #117, acceptOrds should not be null. Instead, Bits.ALL should be used.
Also updated GraphSearcher#search javadoc to match implementation.
@siddhsql
Copy link

does this use the same algorithm for deletes as in https://arxiv.org/abs/2105.09613 or something different? looking at it, it seems to be doing something different. could you explain?

@jbellis
Copy link
Owner Author

jbellis commented Dec 16, 2023

JVector's approach is essentially "pretend each node that loses connections due to the deletion is newly added to the graph and rebuild its connections that way."

FreshDiskANN's approach is definitely less expensive, if it actually maintains link quality I'd be happy to switch to that.

@siddhsql
Copy link

i tried deleting nodes from the graph. the cleanup operation takes more time than rebuilding the entire index. with 1M vectors, building the index took 3 minutes. then i deleted 20% of the vectors and the cleanup took 17 minutes.

@jbellis
Copy link
Owner Author

jbellis commented Dec 20, 2023

That does sound excessive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How to remove vector from the index?
3 participants