Poor performance on ann-benchmarks #78

lmcinnes · 2019-09-10T15:36:08Z

It seems a recent round of ann-benchmarks has seen pynndescent fail quite badly. It is not so much a performance degradation as being somewhat broken. I'm looking into it, but any help would be appreciated.

tomwhite · 2019-09-11T09:05:59Z

Have you got a link to where it's failing?

lmcinnes · 2019-09-11T14:16:45Z

The latest round has pynndescent not looking good, while prior rounds had it doing reasonably well. Not sure about the dates on all this, except that the most recent run was from June 10 of this year.

lmcinnes · 2019-09-18T18:44:31Z

I have identified a lot of places for improvement and am working on it now. I think some of the major issues may have been related to the fact that it pulled directly from master and may have gotten a slightly broken build, and memory usage issues (which have since been fixed). I will try to get a pull request in for the various improvements soon.

jlmelville · 2019-11-03T06:06:19Z

I have recently put in a bit more work to a C++ version of nearest neighbor descent, for eventual use with uwot. I have, of course, shamelessly copied the code in pynndescent. The development of the "parallel_everything" branch has there been of great interest. I realize it's a work in progress so I don't want to assume any of this is intended to be fully working, but I wanted to ask about some changes on that branch to help with my own dev work, and this issue seems like a good place to do it.

What I'm most interested in is the high memory setup. What is the purpose of the in_graph set in the high-memory code path? It looks like it keeps track of any indices that have ever been added successfully to the current_graph. For indices which are not in the set, they are unchecked_pushed into the heap, as opposed to a push for the low-memory path, which adds a check that the index isn't already in the heap via a linear scan of the list with the indices in. This seems to trade-off the cost of doing multiple set look ups and potential insertions versus the linear scan. In my C++ code, the cost of set look ups is definitely more expensive than an array scan, so the high memory code path is always slower than the low memory one, even with a large number of neighbors (like 150). Is there a particular scenario where this works out better?

I think there is an opportunity to re-use the in_graph set in generate_graph_updates (and hence create a _high_memory variant), because if the q in in_graph[p] and p in in_graph[q] is carried out there, you can skip the update and hence the distance calculation. As in_graph is only being read from here, it shouldn't ruin any parallelization. I think you would see speed improvements here with large n_neighbors.

But why are only vertices that successfully make it into current_graph stored in in_graph? Is it just to prevent run-away memory usage? The use of a set in the current master (at least with single threading) is to cache pretty much every pair of candidates that are tried. Has this been abandoned for being too memory intensive? Storing every pair and taking advantage of the fact that under those conditions you can just store one of p,q or q,p (e.g. where q > p), and so save on the number of insertions and look-ups, has been pretty much the only way I've got a noticeable performance improvement over a low-memory using MNIST and 15 nearest neighbors as my benchmark (and using multiple threads helps narrow the performance gap even there).

Also, I see that rho is no longer used. Is the thinking here that just controlling the general neighbor list size via max_candidates is effectively the same as sampling?

lmcinnes · 2019-11-03T14:58:14Z

Thanks for all the questions. The parallel_everything branch is indeed a bit of a playground for a wide variety of different experimental approaches; a lot of it started out as trying to take advantage of some newer numba features. You are correct that the in_graph was a replacement for the previous tried set of pairs with a goal of reducing excessive memory usage somewhat. I haven't done extensive benchmarking on this, but I do recall that the high_memory option was slightly faster (perhaps due to how numba processes things, which will be different to C++ in places). I am certainly not wedded to this, and may simply go with the low memory option as the default once I have some benchmarks run to test the differences. Ultimately the paired tried set was proving to memory intensive for various cases that cropped up, so I really wanted to improve the memory use, or move away from it entirely if possible. As noted, this was all a bit experimental.

As for the lack of rho -- there were actually some (subtle) bugs in the original new_build_candidates that meant it wasn't doing quite what one might want it to be doing. It was close enough that it worked well enough anyway, but in weeding out that issue it made it possible to ignore rho and use the max_candidates bound to, as you note, effectively do the same job.

At this point I am getting pretty good parallelism (not ideal, but decent) for both the trees and the nndescent. Ultimately a lot of the work is really going into the search phase (which is probably of less interest to you, as UMAP doesn't use it, except for the transform). This has actually gone surprisingly well, and the ann-benchmark results I see now are quite strong (at least on my local machine). The one remaining big holdout/difference is the distance computations themselves. It seems that HNSW and friends use a AVX2 (or SSE) specific distance computation that takes advantage of the 256bit wide computations for SIMD parallelism within the actual distance computation. I have so far not been able to coax numba to do the same. Since you are in straight C++ this is certainly something you can (and should) take advantage of. I would point you to https://github.com/nmslib/hnswlib/blob/master/hnswlib/space_l2.h as a reference of what they are doing. In my current experiments pynndescent actually outperforms hnswlib for some datasets when I turn the SIMD distance computation off, so this really is the remaining difference.

jlmelville · 2019-11-03T17:15:33Z

Thank you for all these details, especially about the memory use. From the sound of things, I may have to abandon the set-based high memory approach entirely for now. In rnndescent, I have split the R-interfacing code from the main nearest neighbor routines, and I would like to eventually have the C++ code be a standalone project. So actually all the work going into pynndescent is of (eventual) interest to me, although improving uwot's nearest neighbor calculations is the highest priority.

Ironically, the reason there are those USE_AVX flags in the HNSW distance calculations is because I added them in a PR (although most of the credit for the work should go to the ever-industrious @LTLA), except the goal was to turn them off! R doesn't allow C++ code to be built with architecture specific flags. The difference is indeed noticeable 😞 -- if I ever get the standalone C++ code up and running I will make it a project to add these optimizations back to my own distance calculations.

In terms of parallelism, copying your approach has definitely been faster than anything I was trying. One thing that I've noticed is that although the local join approach is faster than a straight neighbor-of-a-neighbor search it does make writing to shared data extremely difficult because it's effectively impossible to organize the graph updates into thread-safe windows. So there are always bottlenecks where you have to be single-threaded, at least in C++/RcppParallel, where the cost of thread-related code and having to lock data structures to safely write doesn't save any time. The use of a shared set-like structure compounds the issue. I may have to experiment with creating a mutex pool to reduce locking, but I am bound to do this badly. This is one of the times that giving up, and moving to Python and Numba looks very tempting 😄

One final strategy I had for parallelism with a high memory mode was abandoning the local join approach for neighbor-of-a-neighbor (while retaining the incremental search strategy). Although you do more and more indirect array reading, you are always writing to the same part of the graph so you can be multi-threaded without locking, and you can get some advantages of a high-memory mode by storing a set of the currently seen indices in the neighbors of neighbors. To control memory usage this can be made local to the current iteration so that there would never be more than n_threads being stored at a time. I may return to this and see if there are any scenarios where it outperforms local join without excessive memory usage.

lmcinnes · 2019-11-05T03:15:28Z

I may have spoken too soon; it is possible numba is magically doing all the AVX craziness for me, and my bottlenecks are elsewhere. In the meantime, if you are going to explore the alternative approach (instead of local join), you might want to look at: https://arxiv.org/abs/1804.03032 which does essentially something like that. I can't speak for the results quoted in the paper, but I do believe you can get something in a pretty decent ballpark with that sort of approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance on ann-benchmarks #78

Poor performance on ann-benchmarks #78

lmcinnes commented Sep 10, 2019

tomwhite commented Sep 11, 2019

lmcinnes commented Sep 11, 2019

lmcinnes commented Sep 18, 2019

jlmelville commented Nov 3, 2019

lmcinnes commented Nov 3, 2019

jlmelville commented Nov 3, 2019

lmcinnes commented Nov 5, 2019

Poor performance on ann-benchmarks #78

Poor performance on ann-benchmarks #78

Comments

lmcinnes commented Sep 10, 2019

tomwhite commented Sep 11, 2019

lmcinnes commented Sep 11, 2019

lmcinnes commented Sep 18, 2019

jlmelville commented Nov 3, 2019

lmcinnes commented Nov 3, 2019

jlmelville commented Nov 3, 2019

lmcinnes commented Nov 5, 2019