-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pydktree breaks down for very large number of data points #38
Comments
It's a pointer arithmetics overflow problem: https://github.com/storpipfugl/pykdtree/blob/master/pykdtree/_kdtree_core.c#L1410 I'll look into giving pykdtree an overhaul to support contemporary data set sizes |
lkeegan
added a commit
to lkeegan/pykdtree
that referenced
this issue
Jan 16, 2025
- replace all 32-bit integers with 64-bit integers - previously results would silently be incorrect if `num_points * k > 2^32-1` - caused by integer overflow (see storpipfugl#38) - for e.g. k=20 this occurred for ~200 million points (~a few GB of RAM) - now results are correct for any (practical) number of points - overflow would now only occur for k=20 with ~10^17 points (> trillions of GB of RAM) - perf implications??
lkeegan
added a commit
to lkeegan/pykdtree
that referenced
this issue
Jan 16, 2025
- replace all 32-bit integers with 64-bit integers - previously the results would silently be incorrect if `num_points * k > 2^32-1` - caused by integer overflow (see storpipfugl#38) - for e.g. k=20 this occurred for ~200 million points (~a few GB of RAM) - now the results are correct for any (practical) number of points - overflow would now only occur for k=20 with ~10^17 points (> trillions of GB of RAM) - resolves storpipfugl#38
lkeegan
added a commit
to lkeegan/pykdtree
that referenced
this issue
Jan 17, 2025
- add `index_bits` optional argument to KDTree - default value is 32: preserves existing behaviour & performance - user can specify 64 instead to use 64-bit integers - this ensures correct results when n_points * k > 2^32 - uses approx 50% more RAM than 32 bit option - resolves storpipfugl#38 - in 32-bit int mode add checks to avoid returning incorrect results - KDTree checks that `n_points < 2^32` - KDTree.query checks that `n_points * k < 2^32` - update tests - parametrize all existing tests to test 32-bit and 64-bit int mode - add a query test with n_points * k too large - didn't add a test with n_points too large as it would require 16GB RAM to run
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I have been using pykdtree to obtain nearest neighbours and it seems that it breaks down for a very large dataset. I managed to reproduce the problem with the following example:
The result of the previous code is that the minimum distance to the 32nd neighbouring particles to some particle is zero, which is incorrect and, indeed very unlikely. It turns out that zero is assigned to many more than just one particle. I fact, it is zero for a large fraction of the particles. Doing
returns 365782272. I.e., it is 0 for ~73 % of the whole sample. This is clearly the wrong answer.
I discovered the problem when using py-sphviewer, which relies on pyktree to find the smoothing length of particles in cosmological simulations. When the number of particles within the simulated volumes is very large (several hundreds of millions), pykdtree assigns a wrong distance of 0 between individual particles and their 32nd neighbours.
Any idea on what might be causing this weird behaviour? I also checked with either single and double precision.
The text was updated successfully, but these errors were encountered: