-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance on ann-benchmarks #78
Comments
Have you got a link to where it's failing? |
The latest round has pynndescent not looking good, while prior rounds had it doing reasonably well. Not sure about the dates on all this, except that the most recent run was from June 10 of this year. |
I have identified a lot of places for improvement and am working on it now. I think some of the major issues may have been related to the fact that it pulled directly from master and may have gotten a slightly broken build, and memory usage issues (which have since been fixed). I will try to get a pull request in for the various improvements soon. |
I have recently put in a bit more work to a C++ version of nearest neighbor descent, for eventual use with What I'm most interested in is the high memory setup. What is the purpose of the I think there is an opportunity to re-use the But why are only vertices that successfully make it into Also, I see that |
Thanks for all the questions. The As for the lack of rho -- there were actually some (subtle) bugs in the original At this point I am getting pretty good parallelism (not ideal, but decent) for both the trees and the nndescent. Ultimately a lot of the work is really going into the search phase (which is probably of less interest to you, as UMAP doesn't use it, except for the transform). This has actually gone surprisingly well, and the ann-benchmark results I see now are quite strong (at least on my local machine). The one remaining big holdout/difference is the distance computations themselves. It seems that HNSW and friends use a AVX2 (or SSE) specific distance computation that takes advantage of the 256bit wide computations for SIMD parallelism within the actual distance computation. I have so far not been able to coax numba to do the same. Since you are in straight C++ this is certainly something you can (and should) take advantage of. I would point you to https://github.com/nmslib/hnswlib/blob/master/hnswlib/space_l2.h as a reference of what they are doing. In my current experiments pynndescent actually outperforms hnswlib for some datasets when I turn the SIMD distance computation off, so this really is the remaining difference. |
Thank you for all these details, especially about the memory use. From the sound of things, I may have to abandon the set-based high memory approach entirely for now. In rnndescent, I have split the R-interfacing code from the main nearest neighbor routines, and I would like to eventually have the C++ code be a standalone project. So actually all the work going into pynndescent is of (eventual) interest to me, although improving uwot's nearest neighbor calculations is the highest priority. Ironically, the reason there are those In terms of parallelism, copying your approach has definitely been faster than anything I was trying. One thing that I've noticed is that although the local join approach is faster than a straight neighbor-of-a-neighbor search it does make writing to shared data extremely difficult because it's effectively impossible to organize the graph updates into thread-safe windows. So there are always bottlenecks where you have to be single-threaded, at least in C++/RcppParallel, where the cost of thread-related code and having to lock data structures to safely write doesn't save any time. The use of a shared set-like structure compounds the issue. I may have to experiment with creating a mutex pool to reduce locking, but I am bound to do this badly. This is one of the times that giving up, and moving to Python and Numba looks very tempting 😄 One final strategy I had for parallelism with a high memory mode was abandoning the local join approach for neighbor-of-a-neighbor (while retaining the incremental search strategy). Although you do more and more indirect array reading, you are always writing to the same part of the graph so you can be multi-threaded without locking, and you can get some advantages of a high-memory mode by storing a set of the currently seen indices in the neighbors of neighbors. To control memory usage this can be made local to the current iteration so that there would never be more than |
I may have spoken too soon; it is possible numba is magically doing all the AVX craziness for me, and my bottlenecks are elsewhere. In the meantime, if you are going to explore the alternative approach (instead of local join), you might want to look at: https://arxiv.org/abs/1804.03032 which does essentially something like that. I can't speak for the results quoted in the paper, but I do believe you can get something in a pretty decent ballpark with that sort of approach. |
It seems a recent round of ann-benchmarks has seen pynndescent fail quite badly. It is not so much a performance degradation as being somewhat broken. I'm looking into it, but any help would be appreciated.
The text was updated successfully, but these errors were encountered: