Fix incorrect results for large numbers of points #135

lkeegan · 2025-01-17T10:20:02Z

automatically use 64bit ints for large datasets
- add _use_int32_t boolean to KDTree
  - true if number of points < UINT32_MAX
    - int32 used in calculations
    - results returned as np.uint32
  - otherwise false
    - int64 used
    - results returned as np.uint64
  - resolves pydktree breaks down for very large number of data points #38
make i, j int64 in search_tree_*_int32_t functions
- avoids expressions like i*k overflowing if i is close to UINT32_MAX
add tests
- skip test with n>2^32 points by default due to ram/runtime requirements

Old implementation

add index_bits optional argument to KDTree
- default value is 32: preserves existing behaviour & performance
- user can specify 64 instead to use 64-bit integers
  - this ensures correct results when n_points * k > 2^32
  - uses approx 50% more RAM than 32 bit option
  - resolves pydktree breaks down for very large number of data points #38
in 32-bit int mode add checks to avoid returning incorrect results
- KDTree checks that n_points < 2^32
- KDTree.query checks that n_points * k < 2^32
update tests
- parametrize all existing tests to test 32-bit and 64-bit int mode
- add a query test with n_points * k too large
- didn't add a test with n_points too large as it would require 16GB RAM to run

- add `index_bits` optional argument to KDTree - default value is 32: preserves existing behaviour & performance - user can specify 64 instead to use 64-bit integers - this ensures correct results when n_points * k > 2^32 - uses approx 50% more RAM than 32 bit option - resolves storpipfugl#38 - in 32-bit int mode add checks to avoid returning incorrect results - KDTree checks that `n_points < 2^32` - KDTree.query checks that `n_points * k < 2^32` - update tests - parametrize all existing tests to test 32-bit and 64-bit int mode - add a query test with n_points * k too large - didn't add a test with n_points too large as it would require 16GB RAM to run

lkeegan · 2025-01-17T10:28:07Z

Here are the benchmarks from #121 using this PR with index_bits = 32 and 64 (running locally, not on colab):

Full notebook output here: https://gist.github.com/lkeegan/d9b5174298b6321dff90e8b05cb09ed2

djhoese

Wow, really nice job. This is way less gross than I thought it was going to be.

So the largest issue for me is the index_bits being an integer. I suppose this leaves us with the possibility of allowing 128 or even 16 bit (to save space I guess) integer types in the future. But it seems odd to have an arbitrary integer for this value when it is used as a boolean.

My other thought was "oh why can't you determine it automatically", but now I see there is the self.n > UINT32_MAX case in __init__ and the self.n * k > UINT32_MAX case in the query. Is it possible these are two separate cases in the C code? Like, would it theoretically work to have the tree be constructed as 32-bit, but due to a high k value the resulting query operation would need to be 64-bit and the necessary C types would "just work"?

I know people don't like them but this kind of seems like the mako template should be rewritten in C++ and use templates. I'm not suggesting you throw this work away, I just wanted to mention it.

lkeegan · 2025-01-20T07:59:20Z

Wow, really nice job. This is way less gross than I thought it was going to be.

So the largest issue for me is the index_bits being an integer. I suppose this leaves us with the possibility of allowing 128 or even 16 bit (to save space I guess) integer types in the future. But it seems odd to have an arbitrary integer for this value when it is used as a boolean.

Happy to change this to a boolean if preferred (assuming another option is unlikely to be added - as you probably want avoid having to change the API if another type is added later)

My other thought was "oh why can't you determine it automatically", but now I see there is the self.n > UINT32_MAX case in __init__ and the self.n * k > UINT32_MAX case in the query. Is it possible these are two separate cases in the C code? Like, would it theoretically work to have the tree be constructed as 32-bit, but due to a high k value the resulting query operation would need to be 64-bit and the necessary C types would "just work"?

This is a nice idea, initially I thought it wouldn't work without adding a lot more code, but I think this should be possible if we make the i,j,k variables always be int64 types in the search_tree function to avoid them overflowing (they're only used as loop counters so wouldn't affect the RAM use or performance) - from a quick look through I think the rest of the code should be ok. I'll see it this works, as it would be a nicer user interface if the index size could be automatically determined (then index_bits could even be removed completely)

I know people don't like them but this kind of seems like the mako template should be rewritten in C++ and use
templates. I'm not suggesting you throw this work away, I just wanted to mention it.

I agree, but I would keep a c++ rewrite as a separate issue to this one which is fixing a known bug.

- add `_use_int32_t` boolean to KDTree - `true` if number of points < `UINT32_MAX` - int32 used in calculations - results returned as `np.uint32` - otherwise `false` - int64 used - results returned as `np.uint64` - make `i`, `j` int64 in `search_tree_*_int32_t` functions - avoids expressions like `i*k` overflowing if `i` is close to `UINT32_MAX` - remove `index_bits` argument - add tests - skip test with n>2^32 points by default due to ram/runtime requirements

pykdtree/kdtree.pyx

djhoese

@lkeegan Could you update the description and title of the PR so it describes the new implementation? You could either remove what you had in the description before or put it under an "### Old Implementation" section.

Otherwise this looks pretty good to me. I'm a little scared about how many changes there are, but theoretically that should just be binary code, not runtime execution. @mraspaud what are your thoughts?

mraspaud

LGTM, thanks a lot for tackling this. It’s a lot of changes indeed, but all in order from what I can see.

lkeegan · 2025-01-22T12:01:24Z

@djhoese @mraspaud thanks for reviewing - as requested I updated the PR title and description to match the final implementation

lkeegan mentioned this pull request Jan 17, 2025

Fix incorrect results for large numbers of points #134

Closed

djhoese added the enhancement label Jan 17, 2025

djhoese requested review from mraspaud and djhoese January 17, 2025 16:19

djhoese reviewed Jan 17, 2025

View reviewed changes

lkeegan requested a review from djhoese January 20, 2025 13:37

djhoese reviewed Jan 20, 2025

View reviewed changes

pykdtree/kdtree.pyx Show resolved Hide resolved

djhoese approved these changes Jan 21, 2025

View reviewed changes

mraspaud approved these changes Jan 22, 2025

View reviewed changes

mraspaud assigned lkeegan Jan 22, 2025

mraspaud merged commit 0fd5ce7 into storpipfugl:master Jan 22, 2025
6 checks passed

lkeegan changed the title ~~add index_bits option to support large datasets~~ Fix incorrect results for large numbers of points Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrect results for large numbers of points #135

Fix incorrect results for large numbers of points #135

lkeegan commented Jan 17, 2025 •

edited

Loading

lkeegan commented Jan 17, 2025

djhoese left a comment

lkeegan commented Jan 20, 2025 •

edited

Loading

djhoese left a comment

mraspaud left a comment

lkeegan commented Jan 22, 2025

Fix incorrect results for large numbers of points #135

Fix incorrect results for large numbers of points #135

Conversation

lkeegan commented Jan 17, 2025 • edited Loading

lkeegan commented Jan 17, 2025

djhoese left a comment

Choose a reason for hiding this comment

lkeegan commented Jan 20, 2025 • edited Loading

djhoese left a comment

Choose a reason for hiding this comment

mraspaud left a comment

Choose a reason for hiding this comment

lkeegan commented Jan 22, 2025

lkeegan commented Jan 17, 2025 •

edited

Loading

lkeegan commented Jan 20, 2025 •

edited

Loading