Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query with another index/ graph? #79

Open
ivirshup opened this issue Sep 11, 2019 · 5 comments
Open

Query with another index/ graph? #79

ivirshup opened this issue Sep 11, 2019 · 5 comments

Comments

@ivirshup
Copy link
Contributor

This is a question/ feature request.

My (cursory) understanding of the NNDescent algorithm makes me think it should be possible to increase query speed for a test set if we already had an index built on it. Is this the case?

If so, would this query be in scope for this project?

# a: NNDescent, b: NNDescent
a.query(b)
@lmcinnes
Copy link
Owner

There is a paper on that, and things related (arXiv:1908.00814v3); currently that is in the list of things for me to do, but I'm not sure when or if that would happen.

@ivirshup
Copy link
Contributor Author

In general, I think I'd be interested in contributing an implementation here. I've had some time to look over the paper (thanks for recommending that!), and had a couple questions – if you don't mind. First, I'd like to make sure I'm on the same page about what I'd like to be able to do, and what's described in the paper, then I had some questions about potential modifications.

The problem I would like to address is finding K neighbors from one dataset in another, essentially creating a bipartite graph. My understanding of this paper is that it only deals with building a joint graph containing all samples where all nodes have out-degree K. Is this right?

These main idea of using the predefined structure for faster search as in P-merge is definitely shared. I had a couple naive ideas for modification towards my ends: (1) Change the final step of P-merge. Instead of merging G- and H- into U, you could take the between dataset neighbors of U (i.e. 𝑈∩(𝐺+∪𝐻+)) replacing within dataset neighbors with sampled points, then optimize until convergence again. (2) Change the initialization so that you've got at least k samples from each dataset to be searched. Then remove all intra dataset edges and take top k remaining edges. Do you think these approaches might be worth pursuing?

@lmcinnes
Copy link
Owner

I would be happy to see any efforts you want to make on this. For your specific problem you could, in principle, build the index on each dataset and then query for each dataset against the index of the other -- that may well be faster than the merge style operations from the paper. Of course that assumes you've built an index on each -- perhaps you want to do the search duriong index construction to save time? Either way your idea has some merit, and I would be interested to see how it performs.

For my own purposes I have a somewhat related problem: suppose I have 2 datasets, potentially with (graph-based) index structures built on each of them; find the pair of points such that one point is from each dataset with the shortest distance -- ideally as quickly as possible. That would essentially be the minimum distance edge of the bipartite graph you want to construct I believe -- but given that I only need the one shortest edge is there anything that can be done to speed that up further?

@ljmartin
Copy link

ljmartin commented Jun 19, 2020

Hi lmcinnes,
Do you have any more thoughts or ideas on querying the nearest neighbor within a subset of the index? Do you think it can be done through some combination of clever numpy indexing of the NN-graph?

e.g. if one wants to find the nearest neighbor existing within 50 instances of an pynndescent-index of 200 items, then you might be able to mask out the 150-unwanted items from the neighbor graph, and re-index it somehow.

I'm happy to have a crack at this if you have a pointer, but I don't expect to make progress quickly. Right now Ive just been brute forcing the problem using an enormous memory-mapped pairwise distance matrix. Pulling samples from this is obviously very slow so I'd like to speed it up by fitting an pynndescent object and then selecting a subset of the instances.
cheers

edit - fitting multiple pynndescent-indices won't work in this case. The subsets change 10s or 100s of times since they arise from random sampling in cross validation.

@lmcinnes
Copy link
Owner

I think this is possible, but tricky. The catch is that you want to to restrict the potential neighbors to be only those in the subset. This is hard to do and still end up with an adequte number of neighbors for each sample by simply subselecting from an existing index. I don't actually see any easy ways to enable this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants