-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate vector handling #39
Comments
Yes, since version 0.1.7 The HNSW algorithm doesn't work efficiently on duplicate vectors. We thought this was because the heuristic neighbor selection algorithm focused only on navigation. With the heuristic neighbor selection, duplicate or near-duplicate vectors are hidden and To solve this, we modified the heuristic neighbor selection algorithm and improved it in a form that has some nearest neighbors but does not degrade navigation performance. Below is one of the benchmarks measured for the 0.1.7 release, and GIST has duplicate vectors(about 2% of train vectors are duplicated)
Handling duplicate vectors have a tradeoff relationship with navigation performance, the way we handled it may not be optimal. So we are continuing to work to find if there is a better way. |
It's awesome that you're tackling this problem, thank you very much for the detailed response. Please feel free to close the issue if you wish. |
If we found another achievement for that problem, I'll comment here. |
Hi, thanks for open sourcing this great library!
This is more a question than an actual issue: does n2 handles duplicate vectors without performance degradation or recall issues?
Other ANN libraries tend to suffer with that. See:
The text was updated successfully, but these errors were encountered: