-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix incorrect results for large numbers of points #135
Conversation
- add `index_bits` optional argument to KDTree - default value is 32: preserves existing behaviour & performance - user can specify 64 instead to use 64-bit integers - this ensures correct results when n_points * k > 2^32 - uses approx 50% more RAM than 32 bit option - resolves storpipfugl#38 - in 32-bit int mode add checks to avoid returning incorrect results - KDTree checks that `n_points < 2^32` - KDTree.query checks that `n_points * k < 2^32` - update tests - parametrize all existing tests to test 32-bit and 64-bit int mode - add a query test with n_points * k too large - didn't add a test with n_points too large as it would require 16GB RAM to run
Here are the benchmarks from #121 using this PR with index_bits = 32 and 64 (running locally, not on colab): Full notebook output here: https://gist.github.com/lkeegan/d9b5174298b6321dff90e8b05cb09ed2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, really nice job. This is way less gross than I thought it was going to be.
So the largest issue for me is the index_bits
being an integer. I suppose this leaves us with the possibility of allowing 128 or even 16 bit (to save space I guess) integer types in the future. But it seems odd to have an arbitrary integer for this value when it is used as a boolean.
My other thought was "oh why can't you determine it automatically", but now I see there is the self.n > UINT32_MAX
case in __init__
and the self.n * k > UINT32_MAX
case in the query
. Is it possible these are two separate cases in the C code? Like, would it theoretically work to have the tree be constructed as 32-bit, but due to a high k
value the resulting query operation would need to be 64-bit and the necessary C types would "just work"?
I know people don't like them but this kind of seems like the mako template should be rewritten in C++ and use templates. I'm not suggesting you throw this work away, I just wanted to mention it.
Happy to change this to a boolean if preferred (assuming another option is unlikely to be added - as you probably want avoid having to change the API if another type is added later)
This is a nice idea, initially I thought it wouldn't work without adding a lot more code, but I think this should be possible if we make the
I agree, but I would keep a c++ rewrite as a separate issue to this one which is fixing a known bug. |
- add `_use_int32_t` boolean to KDTree - `true` if number of points < `UINT32_MAX` - int32 used in calculations - results returned as `np.uint32` - otherwise `false` - int64 used - results returned as `np.uint64` - make `i`, `j` int64 in `search_tree_*_int32_t` functions - avoids expressions like `i*k` overflowing if `i` is close to `UINT32_MAX` - remove `index_bits` argument - add tests - skip test with n>2^32 points by default due to ram/runtime requirements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lkeegan Could you update the description and title of the PR so it describes the new implementation? You could either remove what you had in the description before or put it under an "### Old Implementation" section.
Otherwise this looks pretty good to me. I'm a little scared about how many changes there are, but theoretically that should just be binary code, not runtime execution. @mraspaud what are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a lot for tackling this. It’s a lot of changes indeed, but all in order from what I can see.
index_bits
option to support large datasets
_use_int32_t
boolean to KDTreetrue
if number of points <UINT32_MAX
np.uint32
false
np.uint64
i
,j
int64 insearch_tree_*_int32_t
functionsi*k
overflowing ifi
is close toUINT32_MAX
Old implementation
index_bits
optional argument to KDTreen_points < 2^32
n_points * k < 2^32