Updated benchmarks against `scipy.spatial.cKDTree` #121

VeckoTheGecko · 2024-08-05T15:00:33Z

We're deciding whether to keep pykdtree as a dependency, or rely on scipy.spatial.cKDTree.

Do you have access to updated benchmarks (or can otherwise provide some insight into the usecase of pykdtree over scipy's implementation?). The ones in the readme look to be 12 years old.

The text was updated successfully, but these errors were encountered:

djhoese · 2024-08-05T15:20:13Z

Yes, benchmarks should definitely be rerun. CPUs have definitely changed and scipy has probably seen many rounds of optimizations to their implementation since the original benchmarks. There are also alternatives that use CUDA or other GPU methods that have popped up since then.

We in the PyTroll community use pykdtree for our nearest neighbor resampling of geolocated atmospheric satellite data.

Related: #100 for another project and #92 for my brainstorming on a rewrite for kdtree to be faster for CPU caches.

VPetukhov · 2024-12-07T15:31:11Z

Just as a reference, benchmarks here show that pykdtree is still the fastest. And my own benchmarks show that its single-threaded performance is ~3 times better than scipy cKDTree with 32 threads.

GiacomoNassif · 2024-12-29T09:54:02Z

https://colab.research.google.com/drive/1tRE3fXE7wIFnvw91UzwCHeo9DglFAGx7?usp=sharing

I made some benchmarks on google colab, it seems very clear the pykdtree is the fastest for build (by far) compared to sklearn + scipy. But for query it seems slower on the Colab machine

It's a pretty heavy query. But its similar to my usecase.

import numpy as np
from scipy.spatial import KDTree as ScipyKDTree
from sklearn.neighbors import KDTree as SklearnKDTree
from pykdtree.kdtree import KDTree as PyKDTree
import time
import matplotlib.pyplot as plt
import gc

# Configurations
data_size = 1_000_000  # Fixed data size for testing
query_points = 200  # Number of query points
k_neighbors = 1000  # Number of nearest neighbors
num_trials = 3  # Number of trials per configuration
leaf_sizes = [1, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10_000]  # Range of leaf sizes to test

# Results storage
scipy_build_times = []
pykdtree_build_times = []
sklearn_build_times = []
scipy_query_times = []
pykdtree_query_times = []
sklearn_query_times = []
scipy_total_times = []
pykdtree_total_times = []
sklearn_total_times = []

# Generate random data and queries
data = np.random.rand(data_size, 2).astype(np.float32)
queries = np.random.rand(query_points, 2).astype(np.float32)

# Function to flush cache
def flush_cache():
    dummy = np.random.rand(100000, 2).astype(np.float32)
    _ = np.sum(dummy)  # Perform some operation to ensure memory is used
    gc.collect()  # Explicitly invoke garbage collection

# Optimization loop over leaf sizes
for leaf_size in leaf_sizes:
    print(f"Processing leaf size: {leaf_size}")
    scipy_min_build_time = float('inf')
    scipy_min_query_time = float('inf')
    pykdtree_min_build_time = float('inf')
    pykdtree_min_query_time = float('inf')
    sklearn_min_build_time = float('inf')
    sklearn_min_query_time = float('inf')
    
    for _ in range(num_trials):
        flush_cache()  # Flush cache before each trial
        
        # Test ScipyKDTree
        start = time.time()
        scipy_tree = ScipyKDTree(data, leafsize=leaf_size)
        build_time = time.time() - start
        start = time.time()
        scipy_tree.query(queries, k=k_neighbors)
        query_time = time.time() - start
        scipy_min_build_time = min(scipy_min_build_time, build_time)
        scipy_min_query_time = min(scipy_min_query_time, query_time)
        
        flush_cache()
        
        # Test PyKDTree
        start = time.time()
        pykd_tree = PyKDTree(data, leafsize=leaf_size)
        build_time = time.time() - start
        start = time.time()
        pykd_tree.query(queries, k=k_neighbors)
        query_time = time.time() - start
        pykdtree_min_build_time = min(pykdtree_min_build_time, build_time)
        pykdtree_min_query_time = min(pykdtree_min_query_time, query_time)
        
        flush_cache()
        
        # Test SklearnKDTree
        start = time.time()
        sklearn_tree = SklearnKDTree(data, leaf_size=leaf_size)
        build_time = time.time() - start
        start = time.time()
        sklearn_tree.query(queries, k=k_neighbors)
        query_time = time.time() - start
        sklearn_min_build_time = min(sklearn_min_build_time, build_time)
        sklearn_min_query_time = min(sklearn_min_query_time, query_time)
    
    scipy_build_times.append(scipy_min_build_time)
    scipy_query_times.append(scipy_min_query_time)
    scipy_total_times.append(scipy_min_build_time + scipy_min_query_time)
    
    pykdtree_build_times.append(pykdtree_min_build_time)
    pykdtree_query_times.append(pykdtree_min_query_time)
    pykdtree_total_times.append(pykdtree_min_build_time + pykdtree_min_query_time)
    
    sklearn_build_times.append(sklearn_min_build_time)
    sklearn_query_times.append(sklearn_min_query_time)
    sklearn_total_times.append(sklearn_min_build_time + sklearn_min_query_time)

# Plot total times
plt.figure(figsize=(10, 6))
plt.plot(leaf_sizes, scipy_total_times, label="Scipy Total Time", marker="o")
plt.plot(leaf_sizes, pykdtree_total_times, label="PyKDTree Total Time", marker="^")
plt.plot(leaf_sizes, sklearn_total_times, label="Sklearn Total Time", marker="s")
plt.xscale("log")
plt.xlabel("Leaf Size (log scale)")
plt.ylabel("Total Time (s)")
plt.title("Total Time (Build + Query) vs Leaf Size")
plt.legend()
plt.grid(True)
plt.show()

# Plot build times
plt.figure(figsize=(10, 6))
plt.plot(leaf_sizes, scipy_build_times, label="Scipy Build Time", marker="o")
plt.plot(leaf_sizes, pykdtree_build_times, label="PyKDTree Build Time", marker="^")
plt.plot(leaf_sizes, sklearn_build_times, label="Sklearn Build Time", marker="s")
plt.xscale("log")
plt.xlabel("Leaf Size (log scale)")
plt.ylabel("Build Time (s)")
plt.title("Build Time vs Leaf Size")
plt.legend()
plt.grid(True)
plt.show()

# Plot query times
plt.figure(figsize=(10, 6))
plt.plot(leaf_sizes, scipy_query_times, label="Scipy Query Time", marker="o")
plt.plot(leaf_sizes, pykdtree_query_times, label="PyKDTree Query Time", marker="^")
plt.plot(leaf_sizes, sklearn_query_times, label="Sklearn Query Time", marker="s")
plt.xscale("log")
plt.xlabel("Leaf Size (log scale)")
plt.ylabel("Query Time (s)")
plt.title("Query Time vs Leaf Size")
plt.legend()
plt.grid(True)
plt.show()

# Calculate relative speedups
relative_scipy_pykdtree = np.array(scipy_total_times) / np.array(pykdtree_total_times)
relative_sklearn_pykdtree = np.array(sklearn_total_times) / np.array(pykdtree_total_times)

# Plot relative speedups
plt.figure(figsize=(10, 6))
plt.plot(leaf_sizes, relative_scipy_pykdtree, label="Scipy/PyKDTree Total Time Ratio", marker="d")
plt.plot(leaf_sizes, relative_sklearn_pykdtree, label="Sklearn/PyKDTree Total Time Ratio", marker="x")
plt.xscale("log")
plt.xlabel("Leaf Size (log scale)")
plt.ylabel("Relative Speedup")
plt.axhline(1, color='gray', linestyle='--', label="Equal Performance")
plt.title("Relative Speedup: PyKDTree vs Scipy and Sklearn")
plt.legend()
plt.grid(True)
plt.show()

VeckoTheGecko mentioned this issue Aug 5, 2024

Two libraries used for KDTree in Parcels. Use only one? OceanParcels/Parcels#1616

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated benchmarks against `scipy.spatial.cKDTree` #121

Updated benchmarks against `scipy.spatial.cKDTree` #121

VeckoTheGecko commented Aug 5, 2024

djhoese commented Aug 5, 2024

VPetukhov commented Dec 7, 2024

GiacomoNassif commented Dec 29, 2024

Updated benchmarks against scipy.spatial.cKDTree #121

Updated benchmarks against scipy.spatial.cKDTree #121

Comments

VeckoTheGecko commented Aug 5, 2024

djhoese commented Aug 5, 2024

VPetukhov commented Dec 7, 2024

GiacomoNassif commented Dec 29, 2024

Updated benchmarks against `scipy.spatial.cKDTree` #121

Updated benchmarks against `scipy.spatial.cKDTree` #121