Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vp tree hash index #321

Closed
wants to merge 18 commits into from
Closed

Conversation

wphicks
Copy link
Contributor

@wphicks wphicks commented Oct 16, 2017

This PR integrates @Purg's VP tree implementations as a hash index (issue #177), which performs better (in terms of runtime of nearest-neighbor queries) than the existing Ball Tree index at almost any descriptor vector dimension and for almost any size tree, as shown below:

near_neighbors_1-runtime-10000-dimension
near_neighbors_1-runtime-4096-n

These plots compare runtime for searching BT and VP indices with a query that differs from a stored descriptor vector by only a single bit (a worst-case scenario for VP trees). Similar plots for random queries (or queries that exactly match a stored descriptor vector) show even greater performance gains. Nearest neighbor results for these tests were also verified against a linear search to ensure the VP tree index searches are accurate.

The performance story for building VP vs BT indices is a bit more complicated, but BTs seem to build faster than VPs at lower dimension (< 4096) or smaller numbers (< 10000) of stored descriptors:

build-runtime-10000-dimension
build-runtime-4096-n

There are parts of the VP tree implementations that could be further "numpyized," but at this point, profiling shows that the biggest bottleneck is calculating hamming distance between Python long ints. I'm pretty sure that improving on that will require a C implementation of popcount or something similar.

Copy link
Member

@Purg Purg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some quick initial comments. I'll have to get to your later commits another day.

Testing question: Did you get to get to testing the performance differences between the recursive, iterative and iterative-heapq variants?

Additionally, my benchmark for performance has been on the 1, 10 and 100 million entry scale. Any chance you can run your performance plotting up to those scales?

:rtype: int
"""
pad = '\x00' * (v.dtype.itemsize - 1)
if (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some odd white-spacing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger that. This was standard on one of my recent projects, so you'll probably find a bunch more of these. I'll run through and change them all.

c = (c * 2L) + int(b)
return c
else:
return _bit_vector_to_int_fast(v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can numba (python module) jit compile this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can, but in investigating this more closely, I noticed that my earlier benchmarking hadn't taken into account the numba compilation time. Redoing my tests and ensuring that numba had already done its thing before the benchmarking started, I see that the numba-compiled for-loop outperforms _bit_vector_to_int_fast for this case (bit vectors <64 bits). I'll update this accordingly.

integer >>= 1

return v
return _int_to_bit_vector_fast(integer, bits)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, can numba jit compile this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you're absolutely right. numba actually imposes a performance penalty here. I also redid the benchmarking for this, and _int_to_bit_vector_fast still outperforms the numba-compiled original version, even with compilation time properly handled.

@wphicks
Copy link
Contributor Author

wphicks commented Oct 20, 2017

Thanks very much for the feedback so far!

I did test the knn variants and didn't see a substantial performance difference among them. vp_knn_iterative was slightly faster than the recursive variant for vp trees, but vps_knn_recursive was slightly faster for vps trees. That being said, those tests were for 1000-10000 stored hashes, so I'll retest them at the scales you mentioned.

I'll also be happy to run those plots for up to 100 million entries. I'll post them here as soon as I've got them.

@Purg
Copy link
Member

Purg commented Oct 20, 2017

Awesome. New plots at the higher scales would be helpful. Thanks again for your work here.

I just wanted to clarify: so you found that numba jit compilation of _bit_vector_to_int_fast did improve performance but jit compilation of _int_to_bit_vector_fast did not? If your method is showing faster performance than the numba jit version (which I basically have never used because of the 64-bit limit), I would say just remove the numba-jit version and have a single function.

@wphicks
Copy link
Contributor Author

wphicks commented Oct 20, 2017

Ah, no, sorry- I wasn't very clear there. I thought you were asking about the wrapper functions that called the "_fast" versions and had the jit decorator already.

To clarify, compiling either _bit_vector_to_int_fast or _int_to_bit_vector_fast with numba doesn't offer any performance improvement on either one (since neither function does the sort of thing that numba is good at optimizing). _int_to_bit_vector_fast is already faster than the previously implemented solution, even when that solution is sped up with numba. So, there we can just use _int_to_bit_vector_fast for any size vector. I've kept int_to_bit_vector and int_to_bit_vector_large separate to avoid breaking anything else that may depend on those functions, but they could indeed just be merged.

_bit_vector_to_int_fast is not faster than the numba-compiled for-loop that was previously used for vectors of dimension <64. So, for dimension <64, we should continue to use that numba-compiled for-loop, but for higher dimension, _bit_vector_to_int_fast performs much better than the previous solution.

There's also the additional complication that _bit_vector_to_int_fast cannot operate on non-contiguous numpy arrays (since it works by directly accessing the array's underlying buffer). So, that's why there's the fallback here for dimension >64.

I'll push a commit in just a little bit that should show you exactly what I mean if that's still not clear.

@Purg
Copy link
Member

Purg commented Oct 20, 2017

Ok. I all for there not being separate numba/large-int variations on those functions. I've never in practice used numba or those <64bit flavors. I will probably leave it for a separate PR for that removal. If you didn't already have those optimized functions implemented, I would have said that should have been a separate PR because its a separate feature/enhancement.

@wphicks
Copy link
Contributor Author

wphicks commented Oct 21, 2017

I'd be happy to strip those out into a separate PR if you'd like. That change is isolated to a single commit, so it would be very easy to rebase. If that's what you'd prefer, I could eliminate the specialized <64bit flavors as part of the same PR. Even in the worst-case scenario for _bit_vector_to_int_fast, it's only about 2-3 times slower than the numba-compiled version, so it would probably be worth it to keep things simple.

@wphicks
Copy link
Contributor Author

wphicks commented Oct 22, 2017

I ran into a bit of a hitch in testing. 1 million entries seems to be about the upper limit of what my hardware can handle before running out of memory for both BTs and VPTs of all flavors. Up to that limit, I can confirm that there isn't a huge performance difference among the knn functions for VPTs, but none of my other tests successfully completed.

I can push some cleaner benchmarks that someone with access to better hardware could run. Is there a benchmarking framework that you'd prefer me to base those on? I've used pytest-benchmark in the past, but I imagine you don't want to introduce a dependency on pytest since everything is already based on nose.

@Purg
Copy link
Member

Purg commented Oct 24, 2017

At this point, I'd say leave the added optimizations and create a separate PR/branch for the conversion function simplification (i.e. removal of <64bit version).

I haven't used pytest yet (I know, shame on me) and so don't have any experience with pytest-benchmark. However, if this is an optional script to be run for developers/engineers, I think its OK to add a README or comment that states your tool requires one or more additional dependencies. How much RAM does your machine have? I may be able to run the benchmark here on my workstation (~50GB RAM).

@wphicks
Copy link
Contributor Author

wphicks commented Oct 25, 2017

Sounds good, thanks! I'll clean up my benchmarks with pytest-benchmark and pass them your way (though it will probably have to wait till this weekend). The machine I'm running these on is a laptop with 8GB RAM, so I imagine you'll have more luck.

@wphicks
Copy link
Contributor Author

wphicks commented Oct 30, 2017

I've just pushed my benchmarking scripts here. If it seems useful to retain these and build upon them for benchmarking other features, I'll work that into a separate PR. For now, if you'd like to try running these on your machine, do the following:

pip install -r requirements.benchmark.txt
pytest python/smqtk/benchmarks/algorithms/nn_index/hash_index/test_hash_indices.py --benchmark-warmup --benchmark-autosave --benchmark-sort=mean --benchmark-group-by=group,param:sample_size,param:dimension

That will generate a json file in a .benchmarks directory. You can now run the following, where $FILE is the name of that generated json file:

python python/smqtk/benchmarks/plot_benchmarks.py $FILE

This should produce 16 png plots similar to what I've already posted, with trees of up to 100 million hashes.

Alternatively, since it now appears that I'll also have access to a machine that can run these in the next few weeks, I'll be happy to take care of it myself if you're not in any particular hurry on this.

@Purg
Copy link
Member

Purg commented Nov 2, 2017

Sorry for the slow response. I hope to take a pass at running this by the weekend.

@Purg
Copy link
Member

Purg commented Nov 2, 2017

This branch also seems to need a rebase.

@wphicks wphicks force-pushed the vp-tree-integration branch from dbc3e5b to 066cc28 Compare November 2, 2017 20:27
@Purg
Copy link
Member

Purg commented Nov 2, 2017

Right! I keep forgetting. Please update the pending release notes file: docs/release_notes/v0.7.0.

@wphicks
Copy link
Contributor Author

wphicks commented Nov 6, 2017

There we go! Thanks for the reminder.

On an unrelated note, whenever you get around to running those benchmarks, let's be sure to take a close look at the results for randomized query vectors. Over the weekend, I noticed a mistake (re-seeding when I shouldn't have) in my previous benchmarking script for that test. I don't know how much of a difference it will make, but the benchmarks I gave to you (based on pytest-benchmark) should be correct. So if there is disagreement, we should trust your results.

@Purg
Copy link
Member

Purg commented Nov 6, 2017

Trying out your benchmarking now. I'll let you know if I see anything out of the ordinary.

@Purg
Copy link
Member

Purg commented Nov 6, 2017

Ah, now that I'm looking at your benchmarking script I think we had a misunderstanding earlier. When you brought up dimensionality, I thought you were talking about base ML descriptor dimension, not the hash vector dimensionality. So far with LSH we have been using 256 or 512 bit hashes (i.e. 256/512 dimension boolean hash vectors).

@wphicks
Copy link
Contributor Author

wphicks commented Nov 7, 2017

Oh whoops! My fault- looking back on it, I did say "descriptor vectors." My apologies!

Well, in that case, let me take another whack at this and investigate what's going on at the proper regime. I think my machine should be able to handle it, but if not, I'll pass an updated benchmarking script back to you.

@Purg
Copy link
Member

Purg commented Nov 7, 2017

No problem. I'm still running the benchmarking script from last night (sklearn's ball trees take forever to build for large sizes like 1m+). My first couple runs to make sure I could run the benchmarking did not show very good results for vp-tree random vector queries. I'll post what I generate from this full run, but II hypothesize that I'm seeing a brute-force degradation, but I'm only running on a hunch right now due to past experience with my implementation.

@wphicks wphicks closed this Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants