-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initialize queryable index with init_graph? #110
Comments
I certainly agree that the use case makes a lot of sense. A question is how best to actually handle this. Queries will be slower without a forest / search tree as the initialization for the graph search will be slower. In general one doesn't just have trees lying around to pass in as you did in your example, so one would need to generate them. Potentially we could just make a single search tree if |
Short answer: basic support is easy, so I added it. In principle the current master branch should work for you now. That leaves the harder questions about validation. I think "user beware" is reasonable to a degree here. Actually I'm surprised things worked as well as they did for you -- the whole On the other hand you do present some pretty reasonable use-cases, so it is not unlikely that others will have similar uses. That means making things a little more user friendly is probably a good idea. Some answers to specific questions:
|
Thanks for the quick addition!
For sure. My initial test of this was just to change this: pynndescent/pynndescent/pynndescent_.py Lines 715 to 718 in a0ca840
to if tree_init or n_trees != 0:
self.tree_init = True
else:
self.tree_init = False so a tree could be constructed even if the I guess I see there being multiple cases for passing an initial graph. I think I'd actually view this as making multiple ways to construct a
|
Yes, it is about the alternative metrics. The catch being that if and when any get added it is good to have as few places to make changes as possible (and ideally obvious ones). Potentially I can just add an extra parameter |
Could this be solved by having a Also, I've been looking at #79 again, and had a question related to both of these. How accurate is are the nearest neighbors found by nndecent? While this must vary with number of neighbors and iterations, I was wondering if there was some benchmarking that had been done for accuracy of graph construction (as opposed to querying, like the ANN benchmarks). My concern here was: if you're providing poor guesses for initial indices, how bad can the results get? |
Yes, I'm more concerned about the long term tail on maintaining such things for new distances, and not having something fall out of sync somewhere and creating hard to diagnose errors (I had similar concerns over adding the alternative distances initially, but there were some very significant performance wins for many standard use cases). Potentially this is a very niche use case, so it's a matter of balancing the amount of extra utility with long term overhead.
That's hard to answer because so much is dataset dependent. It can be very bad for certain metrics and data distributions, but largely it is pretty good. Anecdotally I generally expect 90%+ accuracy. I'm not aware of standard benchmarks for it however. Typically the way to check is to get the true nearest neighbors for a (small) subsample of points and get an estimate of the accuracy that way.
Pure NNDescent assumes you get an initial graph with random edges and does well, so you can provide quite poor guesses and still do well, it will just take longer to converge. The greater risk is in providing an initial graph with suitable pathological properties (e.g. distinct connected components that don't relate to neighbor structure). You are unlikely to get that with a generic noisy guess at nearest neighbors, but it is certainly possible to generate such graphs (take, say, the leaves of a single rp-tree for example). Again, this is subtle in the ways it can go astray, so a degree of user expertise is required. |
@lmcinnes this is tangentially related to the current discussion here about accuracy due to nearest neighbor descent, so can I just confirm that the current way pynndescent works is: To build the index:
To query the index:
Specifically, It seems like the The context for this question is that I continue to sporadically prod at creating a nearest neighbor descent package for R, and my implementation does not use any forest initialization. For very high-dimensional data, results are poor (#18 is still relevant here), in line with the accuracy determined by just using the I want to make sure I am not missing anything clever being done in the query routine that would work with a naive "use the NND results as initialization" approach. Unfortunately, it doesn't seem easy to replace the forest-initialized |
Rather than open an issue I guess I'll ask here: what format should Really I can figure out the answer to this myself, but it would be nice if there some hint in the documentation! 🙂 |
@jamestwebber : The docs are definitely lacking on that front. In reality I hacked something in for my own needs for some experiments, and never got around to making a user friendly version. The short answer is that, right now, it takes a 2D array of shape (n_samples, n_neighbors) such that the entry at (i, j) gives the index of the jth neighbour of the ith data sample. I would happily accept a PR for documentation on that, or perhaps even a user friendlier version (accepting, say, sparse adjacency matrices and networkX graphs as well maybe?), if you cared to try. @jlmelville : Wow, sorry, this totally slipped through the cracks on me! I apologise for not getting back to you. I think the major thing is that the |
Ah, so exactly what the output looks like! That's handy.
As you have no doubt noticed by now, I have a problem of saying I'll contribute PRs and then not getting to them, so I will demur this time. 😂 |
I would like to be able to more quickly construct an NNDescent object from a precomputed distance matrix, and potentially RP Forest. This is sorta possible right now with the
init_graph
argument, but theNNDescent
object constructed cannot be queried since there's no RP Forest (#103).Example of this failing
I think this would be fairly straightforward to make work. Here is a rough proof of concept:
Hacky example of this working
Use case
The use case is in single cell analysis, where we are storing the neighbor graph for our dataset (as we use it for multiple purposes) and wanting to be able speed up querying the dataset using reconstructed graphs.
Construction benchmark
Using fmnist data from ann benchmark
Questions
I'm not sure how much validation should/ could be done to make sure initialization values are valid. To some extent, I think it's also alright to say "user beware" here. Some thoughts on checks that could be made:
k
.Could values like
rp_forest
,_search_function
, be created on demand and cached via a property? That might make state easier to manage, and I believe all necessary parameters are already being stored in theNNDescent
object.This is more of an extension, but does the
init_graph
have to have the correct values ofk
? For instance, it would ostensibly speed up index creation even if you could only provide an initial graph for a smaller value of K.(ping @Koncopd)
The text was updated successfully, but these errors were encountered: