Query with another index/ graph? #79

ivirshup · 2019-09-11T11:07:37Z

This is a question/ feature request.

My (cursory) understanding of the NNDescent algorithm makes me think it should be possible to increase query speed for a test set if we already had an index built on it. Is this the case?

If so, would this query be in scope for this project?

# a: NNDescent, b: NNDescent
a.query(b)

lmcinnes · 2019-09-11T14:34:00Z

There is a paper on that, and things related (arXiv:1908.00814v3); currently that is in the list of things for me to do, but I'm not sure when or if that would happen.

ivirshup · 2019-09-20T07:05:48Z

In general, I think I'd be interested in contributing an implementation here. I've had some time to look over the paper (thanks for recommending that!), and had a couple questions – if you don't mind. First, I'd like to make sure I'm on the same page about what I'd like to be able to do, and what's described in the paper, then I had some questions about potential modifications.

The problem I would like to address is finding K neighbors from one dataset in another, essentially creating a bipartite graph. My understanding of this paper is that it only deals with building a joint graph containing all samples where all nodes have out-degree K. Is this right?

These main idea of using the predefined structure for faster search as in P-merge is definitely shared. I had a couple naive ideas for modification towards my ends: (1) Change the final step of P-merge. Instead of merging G- and H- into U, you could take the between dataset neighbors of U (i.e. 𝑈∩(𝐺⁺∪𝐻⁺)) replacing within dataset neighbors with sampled points, then optimize until convergence again. (2) Change the initialization so that you've got at least k samples from each dataset to be searched. Then remove all intra dataset edges and take top k remaining edges. Do you think these approaches might be worth pursuing?

lmcinnes · 2019-09-20T13:32:24Z

I would be happy to see any efforts you want to make on this. For your specific problem you could, in principle, build the index on each dataset and then query for each dataset against the index of the other -- that may well be faster than the merge style operations from the paper. Of course that assumes you've built an index on each -- perhaps you want to do the search duriong index construction to save time? Either way your idea has some merit, and I would be interested to see how it performs.

For my own purposes I have a somewhat related problem: suppose I have 2 datasets, potentially with (graph-based) index structures built on each of them; find the pair of points such that one point is from each dataset with the shortest distance -- ideally as quickly as possible. That would essentially be the minimum distance edge of the bipartite graph you want to construct I believe -- but given that I only need the one shortest edge is there anything that can be done to speed that up further?

ljmartin · 2020-06-19T05:45:19Z

Hi lmcinnes,
Do you have any more thoughts or ideas on querying the nearest neighbor within a subset of the index? Do you think it can be done through some combination of clever numpy indexing of the NN-graph?

e.g. if one wants to find the nearest neighbor existing within 50 instances of an pynndescent-index of 200 items, then you might be able to mask out the 150-unwanted items from the neighbor graph, and re-index it somehow.

I'm happy to have a crack at this if you have a pointer, but I don't expect to make progress quickly. Right now Ive just been brute forcing the problem using an enormous memory-mapped pairwise distance matrix. Pulling samples from this is obviously very slow so I'd like to speed it up by fitting an pynndescent object and then selecting a subset of the instances.
cheers

edit - fitting multiple pynndescent-indices won't work in this case. The subsets change 10s or 100s of times since they arise from random sampling in cross validation.

lmcinnes · 2020-06-23T15:26:40Z

I think this is possible, but tricky. The catch is that you want to to restrict the potential neighbors to be only those in the subset. This is hard to do and still end up with an adequte number of neighbors for each sample by simply subselecting from an existing index. I don't actually see any easy ways to enable this.

parashardhapola mentioned this issue Oct 14, 2019

Seeding index with a precalculated KNN graph #81

Open

ivirshup mentioned this issue Jan 20, 2021

Initialize queryable index with init_graph? #110

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query with another index/ graph? #79

Query with another index/ graph? #79

ivirshup commented Sep 11, 2019

lmcinnes commented Sep 11, 2019

ivirshup commented Sep 20, 2019

lmcinnes commented Sep 20, 2019

ljmartin commented Jun 19, 2020 •

edited

Loading

lmcinnes commented Jun 23, 2020

Query with another index/ graph? #79

Query with another index/ graph? #79

Comments

ivirshup commented Sep 11, 2019

lmcinnes commented Sep 11, 2019

ivirshup commented Sep 20, 2019

lmcinnes commented Sep 20, 2019

ljmartin commented Jun 19, 2020 • edited Loading

lmcinnes commented Jun 23, 2020

ljmartin commented Jun 19, 2020 •

edited

Loading