Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated benchmarks against scipy.spatial.cKDTree #121

Open
VeckoTheGecko opened this issue Aug 5, 2024 · 3 comments
Open

Updated benchmarks against scipy.spatial.cKDTree #121

VeckoTheGecko opened this issue Aug 5, 2024 · 3 comments

Comments

@VeckoTheGecko
Copy link

We're deciding whether to keep pykdtree as a dependency, or rely on scipy.spatial.cKDTree.

Do you have access to updated benchmarks (or can otherwise provide some insight into the usecase of pykdtree over scipy's implementation?). The ones in the readme look to be 12 years old.

@djhoese
Copy link
Collaborator

djhoese commented Aug 5, 2024

Yes, benchmarks should definitely be rerun. CPUs have definitely changed and scipy has probably seen many rounds of optimizations to their implementation since the original benchmarks. There are also alternatives that use CUDA or other GPU methods that have popped up since then.

We in the PyTroll community use pykdtree for our nearest neighbor resampling of geolocated atmospheric satellite data.

Related: #100 for another project and #92 for my brainstorming on a rewrite for kdtree to be faster for CPU caches.

@VPetukhov
Copy link

Just as a reference, benchmarks here show that pykdtree is still the fastest. And my own benchmarks show that its single-threaded performance is ~3 times better than scipy cKDTree with 32 threads.

@GiacomoNassif
Copy link

https://colab.research.google.com/drive/1tRE3fXE7wIFnvw91UzwCHeo9DglFAGx7?usp=sharing

I made some benchmarks on google colab, it seems very clear the pykdtree is the fastest for build (by far) compared to sklearn + scipy. But for query it seems slower on the Colab machine

image
image
image

It's a pretty heavy query. But its similar to my usecase.

import numpy as np
from scipy.spatial import KDTree as ScipyKDTree
from sklearn.neighbors import KDTree as SklearnKDTree
from pykdtree.kdtree import KDTree as PyKDTree
import time
import matplotlib.pyplot as plt
import gc

# Configurations
data_size = 1_000_000  # Fixed data size for testing
query_points = 200  # Number of query points
k_neighbors = 1000  # Number of nearest neighbors
num_trials = 3  # Number of trials per configuration
leaf_sizes = [1, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10_000]  # Range of leaf sizes to test

# Results storage
scipy_build_times = []
pykdtree_build_times = []
sklearn_build_times = []
scipy_query_times = []
pykdtree_query_times = []
sklearn_query_times = []
scipy_total_times = []
pykdtree_total_times = []
sklearn_total_times = []

# Generate random data and queries
data = np.random.rand(data_size, 2).astype(np.float32)
queries = np.random.rand(query_points, 2).astype(np.float32)

# Function to flush cache
def flush_cache():
    dummy = np.random.rand(100000, 2).astype(np.float32)
    _ = np.sum(dummy)  # Perform some operation to ensure memory is used
    gc.collect()  # Explicitly invoke garbage collection

# Optimization loop over leaf sizes
for leaf_size in leaf_sizes:
    print(f"Processing leaf size: {leaf_size}")
    scipy_min_build_time = float('inf')
    scipy_min_query_time = float('inf')
    pykdtree_min_build_time = float('inf')
    pykdtree_min_query_time = float('inf')
    sklearn_min_build_time = float('inf')
    sklearn_min_query_time = float('inf')
    
    for _ in range(num_trials):
        flush_cache()  # Flush cache before each trial
        
        # Test ScipyKDTree
        start = time.time()
        scipy_tree = ScipyKDTree(data, leafsize=leaf_size)
        build_time = time.time() - start
        start = time.time()
        scipy_tree.query(queries, k=k_neighbors)
        query_time = time.time() - start
        scipy_min_build_time = min(scipy_min_build_time, build_time)
        scipy_min_query_time = min(scipy_min_query_time, query_time)
        
        flush_cache()
        
        # Test PyKDTree
        start = time.time()
        pykd_tree = PyKDTree(data, leafsize=leaf_size)
        build_time = time.time() - start
        start = time.time()
        pykd_tree.query(queries, k=k_neighbors)
        query_time = time.time() - start
        pykdtree_min_build_time = min(pykdtree_min_build_time, build_time)
        pykdtree_min_query_time = min(pykdtree_min_query_time, query_time)
        
        flush_cache()
        
        # Test SklearnKDTree
        start = time.time()
        sklearn_tree = SklearnKDTree(data, leaf_size=leaf_size)
        build_time = time.time() - start
        start = time.time()
        sklearn_tree.query(queries, k=k_neighbors)
        query_time = time.time() - start
        sklearn_min_build_time = min(sklearn_min_build_time, build_time)
        sklearn_min_query_time = min(sklearn_min_query_time, query_time)
    
    scipy_build_times.append(scipy_min_build_time)
    scipy_query_times.append(scipy_min_query_time)
    scipy_total_times.append(scipy_min_build_time + scipy_min_query_time)
    
    pykdtree_build_times.append(pykdtree_min_build_time)
    pykdtree_query_times.append(pykdtree_min_query_time)
    pykdtree_total_times.append(pykdtree_min_build_time + pykdtree_min_query_time)
    
    sklearn_build_times.append(sklearn_min_build_time)
    sklearn_query_times.append(sklearn_min_query_time)
    sklearn_total_times.append(sklearn_min_build_time + sklearn_min_query_time)

# Plot total times
plt.figure(figsize=(10, 6))
plt.plot(leaf_sizes, scipy_total_times, label="Scipy Total Time", marker="o")
plt.plot(leaf_sizes, pykdtree_total_times, label="PyKDTree Total Time", marker="^")
plt.plot(leaf_sizes, sklearn_total_times, label="Sklearn Total Time", marker="s")
plt.xscale("log")
plt.xlabel("Leaf Size (log scale)")
plt.ylabel("Total Time (s)")
plt.title("Total Time (Build + Query) vs Leaf Size")
plt.legend()
plt.grid(True)
plt.show()

# Plot build times
plt.figure(figsize=(10, 6))
plt.plot(leaf_sizes, scipy_build_times, label="Scipy Build Time", marker="o")
plt.plot(leaf_sizes, pykdtree_build_times, label="PyKDTree Build Time", marker="^")
plt.plot(leaf_sizes, sklearn_build_times, label="Sklearn Build Time", marker="s")
plt.xscale("log")
plt.xlabel("Leaf Size (log scale)")
plt.ylabel("Build Time (s)")
plt.title("Build Time vs Leaf Size")
plt.legend()
plt.grid(True)
plt.show()

# Plot query times
plt.figure(figsize=(10, 6))
plt.plot(leaf_sizes, scipy_query_times, label="Scipy Query Time", marker="o")
plt.plot(leaf_sizes, pykdtree_query_times, label="PyKDTree Query Time", marker="^")
plt.plot(leaf_sizes, sklearn_query_times, label="Sklearn Query Time", marker="s")
plt.xscale("log")
plt.xlabel("Leaf Size (log scale)")
plt.ylabel("Query Time (s)")
plt.title("Query Time vs Leaf Size")
plt.legend()
plt.grid(True)
plt.show()

# Calculate relative speedups
relative_scipy_pykdtree = np.array(scipy_total_times) / np.array(pykdtree_total_times)
relative_sklearn_pykdtree = np.array(sklearn_total_times) / np.array(pykdtree_total_times)

# Plot relative speedups
plt.figure(figsize=(10, 6))
plt.plot(leaf_sizes, relative_scipy_pykdtree, label="Scipy/PyKDTree Total Time Ratio", marker="d")
plt.plot(leaf_sizes, relative_sklearn_pykdtree, label="Sklearn/PyKDTree Total Time Ratio", marker="x")
plt.xscale("log")
plt.xlabel("Leaf Size (log scale)")
plt.ylabel("Relative Speedup")
plt.axhline(1, color='gray', linestyle='--', label="Equal Performance")
plt.title("Relative Speedup: PyKDTree vs Scipy and Sklearn")
plt.legend()
plt.grid(True)
plt.show()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants