Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect results for large numbers of points #135

Merged
merged 2 commits into from
Jan 22, 2025

Conversation

lkeegan
Copy link
Contributor

@lkeegan lkeegan commented Jan 17, 2025

  • automatically use 64bit ints for large datasets
  • make i, j int64 in search_tree_*_int32_t functions
    • avoids expressions like i*k overflowing if i is close to UINT32_MAX
  • add tests
    • skip test with n>2^32 points by default due to ram/runtime requirements
Old implementation
  • add index_bits optional argument to KDTree
  • in 32-bit int mode add checks to avoid returning incorrect results
    • KDTree checks that n_points < 2^32
    • KDTree.query checks that n_points * k < 2^32
  • update tests
    • parametrize all existing tests to test 32-bit and 64-bit int mode
    • add a query test with n_points * k too large
    • didn't add a test with n_points too large as it would require 16GB RAM to run

- add `index_bits` optional argument to KDTree
  - default value is 32: preserves existing behaviour & performance
  - user can specify 64 instead to use 64-bit integers
    - this ensures correct results when n_points * k > 2^32
    - uses approx 50% more RAM than 32 bit option
    - resolves storpipfugl#38
- in 32-bit int mode add checks to avoid returning incorrect results
  - KDTree checks that `n_points < 2^32`
  - KDTree.query checks that `n_points * k < 2^32`
- update tests
  - parametrize all existing tests to test 32-bit and 64-bit int mode
  - add a query test with n_points * k too large
  - didn't add a test with n_points too large as it would require 16GB RAM to run
@lkeegan
Copy link
Contributor Author

lkeegan commented Jan 17, 2025

Here are the benchmarks from #121 using this PR with index_bits = 32 and 64 (running locally, not on colab):

image

image

image

image

Full notebook output here: https://gist.github.com/lkeegan/d9b5174298b6321dff90e8b05cb09ed2

Copy link
Collaborator

@djhoese djhoese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, really nice job. This is way less gross than I thought it was going to be.

So the largest issue for me is the index_bits being an integer. I suppose this leaves us with the possibility of allowing 128 or even 16 bit (to save space I guess) integer types in the future. But it seems odd to have an arbitrary integer for this value when it is used as a boolean.

My other thought was "oh why can't you determine it automatically", but now I see there is the self.n > UINT32_MAX case in __init__ and the self.n * k > UINT32_MAX case in the query. Is it possible these are two separate cases in the C code? Like, would it theoretically work to have the tree be constructed as 32-bit, but due to a high k value the resulting query operation would need to be 64-bit and the necessary C types would "just work"?

I know people don't like them but this kind of seems like the mako template should be rewritten in C++ and use templates. I'm not suggesting you throw this work away, I just wanted to mention it.

@lkeegan
Copy link
Contributor Author

lkeegan commented Jan 20, 2025

Wow, really nice job. This is way less gross than I thought it was going to be.

So the largest issue for me is the index_bits being an integer. I suppose this leaves us with the possibility of allowing 128 or even 16 bit (to save space I guess) integer types in the future. But it seems odd to have an arbitrary integer for this value when it is used as a boolean.

Happy to change this to a boolean if preferred (assuming another option is unlikely to be added - as you probably want avoid having to change the API if another type is added later)

My other thought was "oh why can't you determine it automatically", but now I see there is the self.n > UINT32_MAX case in __init__ and the self.n * k > UINT32_MAX case in the query. Is it possible these are two separate cases in the C code? Like, would it theoretically work to have the tree be constructed as 32-bit, but due to a high k value the resulting query operation would need to be 64-bit and the necessary C types would "just work"?

This is a nice idea, initially I thought it wouldn't work without adding a lot more code, but I think this should be possible if we make the i,j,k variables always be int64 types in the search_tree function to avoid them overflowing (they're only used as loop counters so wouldn't affect the RAM use or performance) - from a quick look through I think the rest of the code should be ok. I'll see it this works, as it would be a nicer user interface if the index size could be automatically determined (then index_bits could even be removed completely)

I know people don't like them but this kind of seems like the mako template should be rewritten in C++ and use
templates. I'm not suggesting you throw this work away, I just wanted to mention it.

I agree, but I would keep a c++ rewrite as a separate issue to this one which is fixing a known bug.

- add `_use_int32_t` boolean to KDTree
  - `true` if number of points < `UINT32_MAX`
    - int32 used in calculations
    - results returned as `np.uint32`
  - otherwise `false`
    - int64 used
    - results returned as `np.uint64`
- make `i`, `j` int64 in `search_tree_*_int32_t` functions
  - avoids expressions like `i*k` overflowing if `i` is close to `UINT32_MAX`
- remove `index_bits` argument
- add tests
  - skip test with n>2^32 points by default due to ram/runtime requirements
@lkeegan lkeegan requested a review from djhoese January 20, 2025 13:37
Copy link
Collaborator

@djhoese djhoese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lkeegan Could you update the description and title of the PR so it describes the new implementation? You could either remove what you had in the description before or put it under an "### Old Implementation" section.

Otherwise this looks pretty good to me. I'm a little scared about how many changes there are, but theoretically that should just be binary code, not runtime execution. @mraspaud what are your thoughts?

Copy link
Collaborator

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot for tackling this. It’s a lot of changes indeed, but all in order from what I can see.

@mraspaud mraspaud merged commit 0fd5ce7 into storpipfugl:master Jan 22, 2025
6 checks passed
@lkeegan lkeegan changed the title add index_bits option to support large datasets Fix incorrect results for large numbers of points Jan 22, 2025
@lkeegan
Copy link
Contributor Author

lkeegan commented Jan 22, 2025

@djhoese @mraspaud thanks for reviewing - as requested I updated the PR title and description to match the final implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pydktree breaks down for very large number of data points
3 participants