Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate vector handling #39

Closed
fjsj opened this issue Dec 31, 2020 · 3 comments
Closed

Duplicate vector handling #39

fjsj opened this issue Dec 31, 2020 · 3 comments

Comments

@fjsj
Copy link

fjsj commented Dec 31, 2020

Hi, thanks for open sourcing this great library!
This is more a question than an actual issue: does n2 handles duplicate vectors without performance degradation or recall issues?

Other ANN libraries tend to suffer with that. See:

@gony-noreply
Copy link
Member

does n2 handles duplicate vectors without performance degradation or recall issues?

Yes, since version 0.1.7

The HNSW algorithm doesn't work efficiently on duplicate vectors. We thought this was because the heuristic neighbor selection algorithm focused only on navigation. With the heuristic neighbor selection, duplicate or near-duplicate vectors are hidden and
search becomes difficult, resulting in a low recall.

To solve this, we modified the heuristic neighbor selection algorithm and improved it in a form that has some nearest neighbors but does not degrade navigation performance.

Below is one of the benchmarks measured for the 0.1.7 release, and GIST has duplicate vectors(about 2% of train vectors are duplicated)

image
You can see a high recall compared to N2 version 0.1.6

Handling duplicate vectors have a tradeoff relationship with navigation performance, the way we handled it may not be optimal. So we are continuing to work to find if there is a better way.

@fjsj
Copy link
Author

fjsj commented Dec 31, 2020

It's awesome that you're tackling this problem, thank you very much for the detailed response. Please feel free to close the issue if you wish.

@gony-noreply
Copy link
Member

If we found another achievement for that problem, I'll comment here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants