Vp tree hash index #321

wphicks · 2017-10-16T14:35:32Z

This PR integrates @Purg's VP tree implementations as a hash index (issue #177), which performs better (in terms of runtime of nearest-neighbor queries) than the existing Ball Tree index at almost any descriptor vector dimension and for almost any size tree, as shown below:

These plots compare runtime for searching BT and VP indices with a query that differs from a stored descriptor vector by only a single bit (a worst-case scenario for VP trees). Similar plots for random queries (or queries that exactly match a stored descriptor vector) show even greater performance gains. Nearest neighbor results for these tests were also verified against a linear search to ensure the VP tree index searches are accurate.

The performance story for building VP vs BT indices is a bit more complicated, but BTs seem to build faster than VPs at lower dimension (< 4096) or smaller numbers (< 10000) of stored descriptors:

There are parts of the VP tree implementations that could be further "numpyized," but at this point, profiling shows that the biggest bottleneck is calculating hamming distance between Python long ints. I'm pretty sure that improving on that will require a C implementation of popcount or something similar.

Purg

Just some quick initial comments. I'll have to get to your later commits another day.

Testing question: Did you get to get to testing the performance differences between the recursive, iterative and iterative-heapq variants?

Additionally, my benchmark for performance has been on the 1, 10 and 100 million entry scale. Any chance you can run your performance plotting up to those scales?

Purg · 2017-10-19T21:04:08Z

python/smqtk/utils/bit_utils.py

+    :rtype: int
+    """
+    pad = '\x00' * (v.dtype.itemsize - 1)
+    if (


Some odd white-spacing here.

Roger that. This was standard on one of my recent projects, so you'll probably find a bunch more of these. I'll run through and change them all.

Purg · 2017-10-19T21:06:08Z

python/smqtk/utils/bit_utils.py

+            c = (c * 2L) + int(b)
+        return c
+    else:
+        return _bit_vector_to_int_fast(v)


Can numba (python module) jit compile this function?

It can, but in investigating this more closely, I noticed that my earlier benchmarking hadn't taken into account the numba compilation time. Redoing my tests and ensuring that numba had already done its thing before the benchmarking started, I see that the numba-compiled for-loop outperforms _bit_vector_to_int_fast for this case (bit vectors <64 bits). I'll update this accordingly.

Purg · 2017-10-19T21:07:30Z

python/smqtk/utils/bit_utils.py

-        integer >>= 1
-
-    return v
+    return _int_to_bit_vector_fast(integer, bits)


Same as above, can numba jit compile this?

No, you're absolutely right. numba actually imposes a performance penalty here. I also redid the benchmarking for this, and _int_to_bit_vector_fast still outperforms the numba-compiled original version, even with compilation time properly handled.

wphicks · 2017-10-20T16:19:42Z

Thanks very much for the feedback so far!

I did test the knn variants and didn't see a substantial performance difference among them. vp_knn_iterative was slightly faster than the recursive variant for vp trees, but vps_knn_recursive was slightly faster for vps trees. That being said, those tests were for 1000-10000 stored hashes, so I'll retest them at the scales you mentioned.

I'll also be happy to run those plots for up to 100 million entries. I'll post them here as soon as I've got them.

Purg · 2017-10-20T16:43:14Z

Awesome. New plots at the higher scales would be helpful. Thanks again for your work here.

I just wanted to clarify: so you found that numba jit compilation of _bit_vector_to_int_fast did improve performance but jit compilation of _int_to_bit_vector_fast did not? If your method is showing faster performance than the numba jit version (which I basically have never used because of the 64-bit limit), I would say just remove the numba-jit version and have a single function.

wphicks · 2017-10-20T19:20:45Z

Ah, no, sorry- I wasn't very clear there. I thought you were asking about the wrapper functions that called the "_fast" versions and had the jit decorator already.

To clarify, compiling either _bit_vector_to_int_fast or _int_to_bit_vector_fast with numba doesn't offer any performance improvement on either one (since neither function does the sort of thing that numba is good at optimizing). _int_to_bit_vector_fast is already faster than the previously implemented solution, even when that solution is sped up with numba. So, there we can just use _int_to_bit_vector_fast for any size vector. I've kept int_to_bit_vector and int_to_bit_vector_large separate to avoid breaking anything else that may depend on those functions, but they could indeed just be merged.

_bit_vector_to_int_fast is not faster than the numba-compiled for-loop that was previously used for vectors of dimension <64. So, for dimension <64, we should continue to use that numba-compiled for-loop, but for higher dimension, _bit_vector_to_int_fast performs much better than the previous solution.

There's also the additional complication that _bit_vector_to_int_fast cannot operate on non-contiguous numpy arrays (since it works by directly accessing the array's underlying buffer). So, that's why there's the fallback here for dimension >64.

I'll push a commit in just a little bit that should show you exactly what I mean if that's still not clear.

Purg · 2017-10-20T20:45:15Z

Ok. I all for there not being separate numba/large-int variations on those functions. I've never in practice used numba or those <64bit flavors. I will probably leave it for a separate PR for that removal. If you didn't already have those optimized functions implemented, I would have said that should have been a separate PR because its a separate feature/enhancement.

wphicks · 2017-10-21T01:31:06Z

I'd be happy to strip those out into a separate PR if you'd like. That change is isolated to a single commit, so it would be very easy to rebase. If that's what you'd prefer, I could eliminate the specialized <64bit flavors as part of the same PR. Even in the worst-case scenario for _bit_vector_to_int_fast, it's only about 2-3 times slower than the numba-compiled version, so it would probably be worth it to keep things simple.

wphicks · 2017-10-22T23:46:41Z

I ran into a bit of a hitch in testing. 1 million entries seems to be about the upper limit of what my hardware can handle before running out of memory for both BTs and VPTs of all flavors. Up to that limit, I can confirm that there isn't a huge performance difference among the knn functions for VPTs, but none of my other tests successfully completed.

I can push some cleaner benchmarks that someone with access to better hardware could run. Is there a benchmarking framework that you'd prefer me to base those on? I've used pytest-benchmark in the past, but I imagine you don't want to introduce a dependency on pytest since everything is already based on nose.

Purg · 2017-10-24T17:54:00Z

At this point, I'd say leave the added optimizations and create a separate PR/branch for the conversion function simplification (i.e. removal of <64bit version).

I haven't used pytest yet (I know, shame on me) and so don't have any experience with pytest-benchmark. However, if this is an optional script to be run for developers/engineers, I think its OK to add a README or comment that states your tool requires one or more additional dependencies. How much RAM does your machine have? I may be able to run the benchmark here on my workstation (~50GB RAM).

wphicks · 2017-10-25T19:07:32Z

Sounds good, thanks! I'll clean up my benchmarks with pytest-benchmark and pass them your way (though it will probably have to wait till this weekend). The machine I'm running these on is a laptop with 8GB RAM, so I imagine you'll have more luck.

wphicks · 2017-10-30T21:15:12Z

I've just pushed my benchmarking scripts here. If it seems useful to retain these and build upon them for benchmarking other features, I'll work that into a separate PR. For now, if you'd like to try running these on your machine, do the following:

pip install -r requirements.benchmark.txt
pytest python/smqtk/benchmarks/algorithms/nn_index/hash_index/test_hash_indices.py --benchmark-warmup --benchmark-autosave --benchmark-sort=mean --benchmark-group-by=group,param:sample_size,param:dimension

That will generate a json file in a .benchmarks directory. You can now run the following, where $FILE is the name of that generated json file:

python python/smqtk/benchmarks/plot_benchmarks.py $FILE

This should produce 16 png plots similar to what I've already posted, with trees of up to 100 million hashes.

Alternatively, since it now appears that I'll also have access to a machine that can run these in the next few weeks, I'll be happy to take care of it myself if you're not in any particular hurry on this.

Purg · 2017-11-02T19:39:24Z

Sorry for the slow response. I hope to take a pass at running this by the weekend.

Purg · 2017-11-02T19:39:42Z

This branch also seems to need a rebase.

Properly handle mu selection via partitioning of a single-item list

Provide methods to store VP trees as and reconstruct them from numpy arrays

Use 63 rather than 64 bit test vectors for conversion between bit vectors and integers

Purg · 2017-11-02T22:30:23Z

Right! I keep forgetting. Please update the pending release notes file: docs/release_notes/v0.7.0.

wphicks · 2017-11-06T15:26:18Z

There we go! Thanks for the reminder.

On an unrelated note, whenever you get around to running those benchmarks, let's be sure to take a close look at the results for randomized query vectors. Over the weekend, I noticed a mistake (re-seeding when I shouldn't have) in my previous benchmarking script for that test. I don't know how much of a difference it will make, but the benchmarks I gave to you (based on pytest-benchmark) should be correct. So if there is disagreement, we should trust your results.

Purg · 2017-11-06T22:55:16Z

Trying out your benchmarking now. I'll let you know if I see anything out of the ordinary.

Purg · 2017-11-06T23:01:24Z

Ah, now that I'm looking at your benchmarking script I think we had a misunderstanding earlier. When you brought up dimensionality, I thought you were talking about base ML descriptor dimension, not the hash vector dimensionality. So far with LSH we have been using 256 or 512 bit hashes (i.e. 256/512 dimension boolean hash vectors).

wphicks · 2017-11-07T15:49:19Z

Oh whoops! My fault- looking back on it, I did say "descriptor vectors." My apologies!

Well, in that case, let me take another whack at this and investigate what's going on at the proper regime. I think my machine should be able to handle it, but if not, I'll pass an updated benchmarking script back to you.

Purg · 2017-11-07T15:53:27Z

No problem. I'm still running the benchmarking script from last night (sklearn's ball trees take forever to build for large sizes like 1m+). My first couple runs to make sure I could run the benchmarking did not show very good results for vp-tree random vector queries. I'll post what I generate from this full run, but II hypothesize that I'm seeing a brute-force degradation, but I'm only running on a hunch right now due to past experience with my implementation.

Purg requested changes Oct 19, 2017

View reviewed changes

Purg and others added 17 commits November 2, 2017 16:05

Add VpTree creation and knn functions for vp, vps, vpsb trees

7eba5e2

Fix incorrect imports

d7fb143

PEP8 cleanup for VP tree source files

0c01bdb

Handle single item partitioning in vpsb trees

7e3ed66

Properly handle mu selection via partitioning of a single-item list

Integrate vp trees as hash index

b7111e5

Allow selection of VP tree type

941acdf

Optimize bit vector/integer conversions

2c06f1a

Allow construction of VP trees from arrays

f448209

Provide methods to store VP trees as and reconstruct them from numpy arrays

Add tests for VP tree array conversions

ed0570c

Add serialization methods for VP tree hash index

043915e

Add tests for VP tree hash index

1752a5c

Provide consistent interface to vp child nodes

281972e

Add class for tracking vp tree search state

9acdeed

PEP8 and stylistic cleanup for VP tree source

3edd2f5

Use fastest conversion for short bit vectors

01862fd

Test 63-bit vector conversion for "small" variants

bb63942

Use 63 rather than 64 bit test vectors for conversion between bit vectors and integers

Fix whitespace to prefer visual indentation

066cc28

wphicks force-pushed the vp-tree-integration branch from dbc3e5b to 066cc28 Compare November 2, 2017 20:27

Update release notes for VP trees

4b5c248

wphicks closed this Oct 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vp tree hash index #321

Vp tree hash index #321

wphicks commented Oct 16, 2017

Purg left a comment

Purg Oct 19, 2017

wphicks Oct 20, 2017

Purg Oct 19, 2017

wphicks Oct 20, 2017

Purg Oct 19, 2017

wphicks Oct 20, 2017

wphicks commented Oct 20, 2017

Purg commented Oct 20, 2017

wphicks commented Oct 20, 2017

Purg commented Oct 20, 2017

wphicks commented Oct 21, 2017

wphicks commented Oct 22, 2017

Purg commented Oct 24, 2017

wphicks commented Oct 25, 2017

wphicks commented Oct 30, 2017

Purg commented Nov 2, 2017

Purg commented Nov 2, 2017

Purg commented Nov 2, 2017

wphicks commented Nov 6, 2017

Purg commented Nov 6, 2017

Purg commented Nov 6, 2017

wphicks commented Nov 7, 2017

Purg commented Nov 7, 2017

Vp tree hash index #321

Vp tree hash index #321

Conversation

wphicks commented Oct 16, 2017

Purg left a comment

Choose a reason for hiding this comment

Purg Oct 19, 2017

Choose a reason for hiding this comment

wphicks Oct 20, 2017

Choose a reason for hiding this comment

Purg Oct 19, 2017

Choose a reason for hiding this comment

wphicks Oct 20, 2017

Choose a reason for hiding this comment

Purg Oct 19, 2017

Choose a reason for hiding this comment

wphicks Oct 20, 2017

Choose a reason for hiding this comment

wphicks commented Oct 20, 2017

Purg commented Oct 20, 2017

wphicks commented Oct 20, 2017

Purg commented Oct 20, 2017

wphicks commented Oct 21, 2017

wphicks commented Oct 22, 2017

Purg commented Oct 24, 2017

wphicks commented Oct 25, 2017

wphicks commented Oct 30, 2017

Purg commented Nov 2, 2017

Purg commented Nov 2, 2017

Purg commented Nov 2, 2017

wphicks commented Nov 6, 2017

Purg commented Nov 6, 2017

Purg commented Nov 6, 2017

wphicks commented Nov 7, 2017

Purg commented Nov 7, 2017