-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vp tree hash index #321
Vp tree hash index #321
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some quick initial comments. I'll have to get to your later commits another day.
Testing question: Did you get to get to testing the performance differences between the recursive, iterative and iterative-heapq variants?
Additionally, my benchmark for performance has been on the 1, 10 and 100 million entry scale. Any chance you can run your performance plotting up to those scales?
python/smqtk/utils/bit_utils.py
Outdated
:rtype: int | ||
""" | ||
pad = '\x00' * (v.dtype.itemsize - 1) | ||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some odd white-spacing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roger that. This was standard on one of my recent projects, so you'll probably find a bunch more of these. I'll run through and change them all.
python/smqtk/utils/bit_utils.py
Outdated
c = (c * 2L) + int(b) | ||
return c | ||
else: | ||
return _bit_vector_to_int_fast(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can numba (python module) jit compile this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can, but in investigating this more closely, I noticed that my earlier benchmarking hadn't taken into account the numba compilation time. Redoing my tests and ensuring that numba had already done its thing before the benchmarking started, I see that the numba-compiled for-loop outperforms _bit_vector_to_int_fast for this case (bit vectors <64 bits). I'll update this accordingly.
integer >>= 1 | ||
|
||
return v | ||
return _int_to_bit_vector_fast(integer, bits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, can numba jit compile this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you're absolutely right. numba actually imposes a performance penalty here. I also redid the benchmarking for this, and _int_to_bit_vector_fast still outperforms the numba-compiled original version, even with compilation time properly handled.
Thanks very much for the feedback so far! I did test the knn variants and didn't see a substantial performance difference among them. vp_knn_iterative was slightly faster than the recursive variant for vp trees, but vps_knn_recursive was slightly faster for vps trees. That being said, those tests were for 1000-10000 stored hashes, so I'll retest them at the scales you mentioned. I'll also be happy to run those plots for up to 100 million entries. I'll post them here as soon as I've got them. |
Awesome. New plots at the higher scales would be helpful. Thanks again for your work here. I just wanted to clarify: so you found that numba jit compilation of |
Ah, no, sorry- I wasn't very clear there. I thought you were asking about the wrapper functions that called the "_fast" versions and had the jit decorator already. To clarify, compiling either _bit_vector_to_int_fast or _int_to_bit_vector_fast with numba doesn't offer any performance improvement on either one (since neither function does the sort of thing that numba is good at optimizing). _int_to_bit_vector_fast is already faster than the previously implemented solution, even when that solution is sped up with numba. So, there we can just use _int_to_bit_vector_fast for any size vector. I've kept int_to_bit_vector and int_to_bit_vector_large separate to avoid breaking anything else that may depend on those functions, but they could indeed just be merged. _bit_vector_to_int_fast is not faster than the numba-compiled for-loop that was previously used for vectors of dimension <64. So, for dimension <64, we should continue to use that numba-compiled for-loop, but for higher dimension, _bit_vector_to_int_fast performs much better than the previous solution. There's also the additional complication that _bit_vector_to_int_fast cannot operate on non-contiguous numpy arrays (since it works by directly accessing the array's underlying buffer). So, that's why there's the fallback here for dimension >64. I'll push a commit in just a little bit that should show you exactly what I mean if that's still not clear. |
Ok. I all for there not being separate numba/large-int variations on those functions. I've never in practice used numba or those <64bit flavors. I will probably leave it for a separate PR for that removal. If you didn't already have those optimized functions implemented, I would have said that should have been a separate PR because its a separate feature/enhancement. |
I'd be happy to strip those out into a separate PR if you'd like. That change is isolated to a single commit, so it would be very easy to rebase. If that's what you'd prefer, I could eliminate the specialized <64bit flavors as part of the same PR. Even in the worst-case scenario for _bit_vector_to_int_fast, it's only about 2-3 times slower than the numba-compiled version, so it would probably be worth it to keep things simple. |
I ran into a bit of a hitch in testing. 1 million entries seems to be about the upper limit of what my hardware can handle before running out of memory for both BTs and VPTs of all flavors. Up to that limit, I can confirm that there isn't a huge performance difference among the knn functions for VPTs, but none of my other tests successfully completed. I can push some cleaner benchmarks that someone with access to better hardware could run. Is there a benchmarking framework that you'd prefer me to base those on? I've used pytest-benchmark in the past, but I imagine you don't want to introduce a dependency on pytest since everything is already based on nose. |
At this point, I'd say leave the added optimizations and create a separate PR/branch for the conversion function simplification (i.e. removal of <64bit version). I haven't used pytest yet (I know, shame on me) and so don't have any experience with pytest-benchmark. However, if this is an optional script to be run for developers/engineers, I think its OK to add a README or comment that states your tool requires one or more additional dependencies. How much RAM does your machine have? I may be able to run the benchmark here on my workstation (~50GB RAM). |
Sounds good, thanks! I'll clean up my benchmarks with pytest-benchmark and pass them your way (though it will probably have to wait till this weekend). The machine I'm running these on is a laptop with 8GB RAM, so I imagine you'll have more luck. |
I've just pushed my benchmarking scripts here. If it seems useful to retain these and build upon them for benchmarking other features, I'll work that into a separate PR. For now, if you'd like to try running these on your machine, do the following:
That will generate a json file in a
This should produce 16 png plots similar to what I've already posted, with trees of up to 100 million hashes. Alternatively, since it now appears that I'll also have access to a machine that can run these in the next few weeks, I'll be happy to take care of it myself if you're not in any particular hurry on this. |
Sorry for the slow response. I hope to take a pass at running this by the weekend. |
This branch also seems to need a rebase. |
Properly handle mu selection via partitioning of a single-item list
Provide methods to store VP trees as and reconstruct them from numpy arrays
Use 63 rather than 64 bit test vectors for conversion between bit vectors and integers
dbc3e5b
to
066cc28
Compare
Right! I keep forgetting. Please update the pending release notes file: |
There we go! Thanks for the reminder. On an unrelated note, whenever you get around to running those benchmarks, let's be sure to take a close look at the results for randomized query vectors. Over the weekend, I noticed a mistake (re-seeding when I shouldn't have) in my previous benchmarking script for that test. I don't know how much of a difference it will make, but the benchmarks I gave to you (based on pytest-benchmark) should be correct. So if there is disagreement, we should trust your results. |
Trying out your benchmarking now. I'll let you know if I see anything out of the ordinary. |
Ah, now that I'm looking at your benchmarking script I think we had a misunderstanding earlier. When you brought up dimensionality, I thought you were talking about base ML descriptor dimension, not the hash vector dimensionality. So far with LSH we have been using 256 or 512 bit hashes (i.e. 256/512 dimension boolean hash vectors). |
Oh whoops! My fault- looking back on it, I did say "descriptor vectors." My apologies! Well, in that case, let me take another whack at this and investigate what's going on at the proper regime. I think my machine should be able to handle it, but if not, I'll pass an updated benchmarking script back to you. |
No problem. I'm still running the benchmarking script from last night (sklearn's ball trees take forever to build for large sizes like 1m+). My first couple runs to make sure I could run the benchmarking did not show very good results for vp-tree random vector queries. I'll post what I generate from this full run, but II hypothesize that I'm seeing a brute-force degradation, but I'm only running on a hunch right now due to past experience with my implementation. |
This PR integrates @Purg's VP tree implementations as a hash index (issue #177), which performs better (in terms of runtime of nearest-neighbor queries) than the existing Ball Tree index at almost any descriptor vector dimension and for almost any size tree, as shown below:
These plots compare runtime for searching BT and VP indices with a query that differs from a stored descriptor vector by only a single bit (a worst-case scenario for VP trees). Similar plots for random queries (or queries that exactly match a stored descriptor vector) show even greater performance gains. Nearest neighbor results for these tests were also verified against a linear search to ensure the VP tree index searches are accurate.
The performance story for building VP vs BT indices is a bit more complicated, but BTs seem to build faster than VPs at lower dimension (< 4096) or smaller numbers (< 10000) of stored descriptors:
There are parts of the VP tree implementations that could be further "numpyized," but at this point, profiling shows that the biggest bottleneck is calculating hamming distance between Python long ints. I'm pretty sure that improving on that will require a C implementation of popcount or something similar.