Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pydktree breaks down for very large number of data points #38

Closed
alejandrobll opened this issue Oct 22, 2018 · 1 comment · Fixed by #135
Closed

pydktree breaks down for very large number of data points #38

alejandrobll opened this issue Oct 22, 2018 · 1 comment · Fixed by #135

Comments

@alejandrobll
Copy link

alejandrobll commented Oct 22, 2018

I have been using pykdtree to obtain nearest neighbours and it seems that it breaks down for a very large dataset. I managed to reproduce the problem with the following example:

from pykdtree.kdtree import KDTree
import numpy as np

pos = np.random.rand(int(5e8),3)
nb = 32
tree = KDTree(pos)
d, idx = tree.query(pos, k=nb)
h = d[:,nb-1]
print np.min(h)

The result of the previous code is that the minimum distance to the 32nd neighbouring particles to some particle is zero, which is incorrect and, indeed very unlikely. It turns out that zero is assigned to many more than just one particle. I fact, it is zero for a large fraction of the particles. Doing

import numpy as np

k, = np.where(h == 0)
print(len(k))

returns 365782272. I.e., it is 0 for ~73 % of the whole sample. This is clearly the wrong answer.

I discovered the problem when using py-sphviewer, which relies on pyktree to find the smoothing length of particles in cosmological simulations. When the number of particles within the simulated volumes is very large (several hundreds of millions), pykdtree assigns a wrong distance of 0 between individual particles and their 32nd neighbours.

Any idea on what might be causing this weird behaviour? I also checked with either single and double precision.

@storpipfugl
Copy link
Owner

It's a pointer arithmetics overflow problem: https://github.com/storpipfugl/pykdtree/blob/master/pykdtree/_kdtree_core.c#L1410

I'll look into giving pykdtree an overhaul to support contemporary data set sizes

lkeegan added a commit to lkeegan/pykdtree that referenced this issue Jan 16, 2025
- replace all 32-bit integers with 64-bit integers
- previously results would silently be incorrect if `num_points * k > 2^32-1`
  - caused by integer overflow (see storpipfugl#38)
  - for e.g. k=20 this occurred for ~200 million points (~a few GB of RAM)
- now results are correct for any (practical) number of points
  - overflow would now only occur for k=20 with ~10^17 points (> trillions of GB of RAM)
- perf implications??
lkeegan added a commit to lkeegan/pykdtree that referenced this issue Jan 16, 2025
- replace all 32-bit integers with 64-bit integers
- previously the results would silently be incorrect if `num_points * k > 2^32-1`
  - caused by integer overflow (see storpipfugl#38)
  - for e.g. k=20 this occurred for ~200 million points (~a few GB of RAM)
- now the results are correct for any (practical) number of points
  - overflow would now only occur for k=20 with ~10^17 points (> trillions of GB of RAM)
  - resolves storpipfugl#38
lkeegan added a commit to lkeegan/pykdtree that referenced this issue Jan 17, 2025
- add `index_bits` optional argument to KDTree
  - default value is 32: preserves existing behaviour & performance
  - user can specify 64 instead to use 64-bit integers
    - this ensures correct results when n_points * k > 2^32
    - uses approx 50% more RAM than 32 bit option
    - resolves storpipfugl#38
- in 32-bit int mode add checks to avoid returning incorrect results
  - KDTree checks that `n_points < 2^32`
  - KDTree.query checks that `n_points * k < 2^32`
- update tests
  - parametrize all existing tests to test 32-bit and 64-bit int mode
  - add a query test with n_points * k too large
  - didn't add a test with n_points too large as it would require 16GB RAM to run
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants