Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-up Indexing with custom metrics #245

Open
akurniawan opened this issue Jul 31, 2024 · 1 comment
Open

Speed-up Indexing with custom metrics #245

akurniawan opened this issue Jul 31, 2024 · 1 comment

Comments

@akurniawan
Copy link

Hi, thanks for the great package! Currently I'm trying to build a kNN with custom metrics for DTW, but the index building for just 90k of data would take around 5 hours. Do you have any suggestions to improve this?

Below is the code that I used to build the index and perform a query

@jit(nopython=True, fastmath=True)
def dtw_numba(x, y):
    """
    Compute the Dynamic Time Warping (DTW) distance between two sequences.
    
    Parameters:
    x : array-like
        First sequence.
    y : array-like
        Second sequence.
        
    Returns:
    float
        The DTW distance between sequences x and y.
    """
    n, m = len(x), len(y)
    dtw_matrix = np.full((n + 1, m + 1), np.inf)
    dtw_matrix[0, 0] = 0

    for i in range(1, n + 1):
        for j in range(1, m + 1):
            cost = (x[i - 1] - y[j - 1]) ** 2
            dtw_matrix[i, j] = cost + min(dtw_matrix[i - 1, j],    # insertion
                                          dtw_matrix[i, j - 1],    # deletion
                                          dtw_matrix[i - 1, j - 1]) # match

    return np.sqrt(dtw_matrix[n, m])

import pynndescent

index = pynndescent.NNDescent(flat_inj_vecs, metric=dtw_numba)

index.query([flat_vecs[10]], k=100)

Thank you!

@lmcinnes
Copy link
Owner

lmcinnes commented Aug 1, 2024

I think the real catch here is the DTW is a pretty expensive metric to compute, so you are going to face a bit of an uphill battle no matter what you do. One good option might be ANNchor which is an ANN approach specifically designed for dealing with expensive metrics (essentially by finding cheap approximations to the metric tailored to the training data) and would probably be a good fir for your needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants