Fixed #592 Add Top-K Nearest Neighbors for Normalized Matrix Profile (#…

…595) * Added test function to test TopK scrump in AB_join * Refactored * Added definition of parameter k to docstring * Improved docstring * Removed trailing colon * Cleaned code * Avoided allocating new memory in inner for-loop * Fixed typos * Improved comments * Avoided allocating new memory in each iteration * Same ndim in output regardless of value of k * Revised docstrings * Enhanced function to perform shift left as well * Enhanced test function to test newly added functionality * Fixed format * Fixed format * Removed/Renamed intermediate variables * Renamed variable for the sake of consistency * Avoided shape mismatch by reshaping ndarray * Refactored * Fixed comment * Refacored and Minor restructuring of lines * Modified stimp after changing output shape in scrump * Add pragma no cover * Revised Docstrings * Fixed docstring * Revised docstring * Removed unnecessary dangling else * Removed unnecessary comment * Revised structure of test function so, it follows the structure of the performant version * Replaced ravel with flatten to get copy of array * Changed the type of input parameter and revised docstring * Update the value of parameter to match its type * Update the value of parameter to match its type * Correct format * Changed output structure of naive.scrump * Correct format * Add test function for scrump_plus_plus for TopK * Add naive version to merge peason profiles * Add test function for merging pearson profiles * Corret format * Add performant function to merge pearson profiles * Optimize function * Avoid creating new memory * Improve docstring * Refactored * Avoid creating new memory in for-loop * Update test function * Revise function to make it parallelizable * Full test and coverage in 1hr * Revise docstrings * Revise docstrings * Optimize function * Optimize function * Rename variable to improve readability * Revise comments * Improve comments and docstrings * Correct naive implementation * Enhance naive function to support top matrix profile * Enhace performant function to support topk matrix profile * Update existing test functions * Correct format * Fix shape of array * Fix shape of array * Add kind keyword for sorting * Fix bugs * Remove ineffective inner prange * Temporarily added parameter k to avoid decorator failure * Improve comments * Improve comments * Improve docstring * Add KNN test function for stumpi * Fix shape of output for KNN test * Full test and coverage 1 hr * Avoid using searchsort when k is 1 * Revise code according to top k matrix profile structure * Remove if condition * Improve dosctrings * Avoid allocating new memory * Avoid allocating new memory * Improve comments * Remove numpy.where to avoid copying unchanged values * Remove unnecessary trailing colon * Replace negative np.inf with np.NINF * delete a wrong file * Avoid advance indexing by using chain slicing so it can be run by njit * Improve docstring * Added gpu_searchsorted checks when GPUs unavailable * Added error checks and pytest ignore warning * Improve docstrings * minor changes in if-block and dosctring * Improve docstrings * Improve comments * minor changes * Correct format * Improve docstrings * optimize functions * Remove redundant import * minor change * Revise docstrings * Fixed black formatting after conflict resolution * Correct docstring * Revise docstrings * minor change * Revise comments * Avoid redundant allocation of memory * Revise docstrings and comments * rename variables * minor correction * Fix indexing * Add new test function * Modify test function * Avoid dumplicate in naive prescrump * Add parameter assume_unique to handle duplicates * Add test function to test for duplicates in topk_merge * Add parameter assume_unique to performant merge_topk * fix test function * Fix bug * Revise prescrump to avoid duplicates * Avoid duplocates in scrump * Revise test function to consider new parameter * Fix bug * Revise naive scrump to avoid duplicates * Add comment * minor optimization * Correct style * Correct style * increase threshold * Specifiy kind in sort * minor change * specify kind in sort * minor changes * De-otpimize if condition Due to numerical erorrs, we need to avoid partial traversal of array * Update scrump * minor changes * add new test function * optimize if condition * Give priority to PA in case of ties between IA and IB * Remove trailing colon * update test function * revise function to avoid adding new parameter * Update module scrump and improvee its readability * Fix syntax * update test functions * minor fix * correct format * Improve docstring * Avoid overlap while merging matrix profiles * Add function to find overlapping values * replace numpy function with our implementation * Avoid unnecessary call of a function * Revise docsting and comment * Improve test function * Remove comment * Add test function to ensure duplicates are avoided * Improve comments * Enhance naive version to avoid duplicates while merging * Add test function and revise naive version * Improve code readability and comment * Update top-k profile by getting insertion index In NearestNeighbor case, the distance between sequence i and its NN is the smallest. However, due to `imprecision` in calculation, it is possible that its corresponding distance, i.e. distance between seq i and its NN, is not the smallest value in its top-k neighbors. So, instead of inserting it at index 0, we use numpy.searchsorted to find the correct insertion index. * Merge nested if statements into one * Remove blank lines * Fix typo * Improve comment * Improve comments * Improve docstring * Remove unnecessary comments * passing copy of variable as input * minor change in test functions * Correct style * Revise comment * Remove comment * Revise comment * Fix format * Remove unnecessary newline * Return 1D array for matrix profile when `k` is 1 * Remove unnecessary flattening operatiton on array * Fix comments * Make matrix profile and mp index 1D when k=1 * Revise tests functions * Improve Docstrings * Make prescrump output 1D when k is one * minor change * update test functions * Modify merge_topk to support 1D input * Fix merge_topk * Fix shape of variables in test functions * Remove unnecessary flatten operation * Update test function for case k=1 * revise comment * Avoid using return in the middle of code * Add new private function to get 2D ouput when k=1 * Remove check for 1D in merge_topk * Revise test functions * Revise docstring to provide description for 1D case * Add overlap check in merge_topk with 1D input * Add overlap check in 1D and revise docstring * Add separate test function for _merge_topk 1D case * Add preprocessing function for prescrump * Update test function * fix missing argument * Fix Docstring * Put back the missing decorator * Add preprocessing function in prescraamp * Revise naive function * Fix value of imprecision in test functions * create overlaps randomly for test merge_topk in 1D case * Revise docstrings * Fix docstrings * minor changes * minor fix * change variable name * change variables names * convert attr to property attr to get 1D when k is 1 * avoid calling performant function in a naive function * minor modification on z_norm functions * fix function * revise docstrings * change variable name * Relocate comment * minor changes * fix uint * fixed uint * fixed test function * fixed calling function * Removed redundant return statement Co-authored-by: Sean Law <[email protected]> Co-authored-by: Sean M. Law <[email protected]>
TDAmeritrade · Nov 9, 2022 · e8a7dd4 · e8a7dd4
1 parent 65d3e85
commit e8a7dd4
Show file tree

Hide file tree

Showing 22 changed files with 2,545 additions and 579 deletions.
diff --git a/stumpy/aamp.py b/stumpy/aamp.py
@@ -255,7 +255,8 @@ def _aamp(
     return np.power(P[0, :, :], 1.0 / p), I[0, :, :]
 
 
-def aamp(T_A, m, T_B=None, ignore_trivial=True, p=2.0):
+def aamp(T_A, m, T_B=None, ignore_trivial=True, p=2.0, k=1):
+    # function needs to be changed to return top-k matrix profile
     """
     Compute the non-normalized (i.e., without z-normalization) matrix profile
 
@@ -282,6 +283,11 @@ def aamp(T_A, m, T_B=None, ignore_trivial=True, p=2.0):
     p : float, default 2.0
         The p-norm to apply for computing the Minkowski distance.
 
+    k : int, default 1
+        The number of top `k` smallest distances used to construct the matrix profile.
+        Note that this will increase the total computational time and memory usage
+        when k > 1.
+
     Returns
     -------
     out : numpy.ndarray

diff --git a/stumpy/aamped.py b/stumpy/aamped.py
@@ -12,7 +12,8 @@
 logger = logging.getLogger(__name__)
 
 
-def aamped(dask_client, T_A, m, T_B=None, ignore_trivial=True, p=2.0):
+def aamped(dask_client, T_A, m, T_B=None, ignore_trivial=True, p=2.0, k=1):
+    # function needs to be revised to return top-k matrix profile
     """
     Compute the non-normalized (i.e., without z-normalization) matrix profile
 
@@ -46,6 +47,11 @@ def aamped(dask_client, T_A, m, T_B=None, ignore_trivial=True, p=2.0):
     p : float, default 2.0
         The p-norm to apply for computing the Minkowski distance.
 
+    k : int, default 1
+        The number of top `k` smallest distances used to construct the matrix profile.
+        Note that this will increase the total computational time and memory usage
+        when k > 1.
+
     Returns
     -------
     out : numpy.ndarray

diff --git a/stumpy/aampi.py b/stumpy/aampi.py
@@ -8,6 +8,7 @@
 
 
 class aampi:
+    # needs to be enhanced to support top-k matrix profile
     """
     Compute an incremental non-normalized (i.e., without z-normalization) matrix profile
     for streaming data
@@ -28,6 +29,11 @@ class aampi:
     p : float, default 2.0
         The p-norm to apply for computing the Minkowski distance.
 
+    k : int, default 1
+        The number of top `k` smallest distances used to construct the matrix profile.
+        Note that this will increase the total computational time and memory usage
+        when k > 1.
+
     Attributes
     ----------
     P_ : numpy.ndarray
@@ -62,7 +68,7 @@ class aampi:
     Note that we have extended this algorithm for AB-joins as well.
     """
 
-    def __init__(self, T, m, egress=True, p=2.0):
+    def __init__(self, T, m, egress=True, p=2.0, k=1):
         """
         Initialize the `stumpi` object
 
@@ -81,6 +87,12 @@ def __init__(self, T, m, egress=True, p=2.0):
 
         p : float, default 2.0
             The p-norm to apply for computing the Minkowski distance.
+
+
+        k : int, default 1
+            The number of top `k` smallest distances used to construct the matrix
+            profile. Note that this will increase the total computational time and
+            memory usage when k > 1.
         """
         self._T = core._preprocess(T)
         core.check_window_size(m, max_size=self._T.shape[-1])

diff --git a/stumpy/core.py b/stumpy/core.py
@@ -209,6 +209,20 @@ def _gpu_aamp_stimp_driver_not_found(*args, **kwargs):  # pragma: no cover
     driver_not_found()
 
 
+def _gpu_searchsorted_left_driver_not_found(*args, **kwargs):  # pragma: no cover
+    """
+    Dummy function to raise CudaSupportError driver not found error.
+    """
+    driver_not_found()
+
+
+def _gpu_searchsorted_right_driver_not_found(*args, **kwargs):  # pragma: no cover
+    """
+    Dummy function to raise CudaSupportError driver not found error.
+    """
+    driver_not_found()
+
+
 def get_pkg_name():  # pragma: no cover
     """
     Return package name.
@@ -240,7 +254,7 @@ def rolling_window(a, window):
     return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
 
 
-def z_norm(a, axis=0):
+def z_norm(a, axis=0, threshold=config.STUMPY_STDDEV_THRESHOLD):
     """
     Calculate the z-normalized input array `a` by subtracting the mean and
     dividing by the standard deviation along a given axis.
@@ -253,13 +267,16 @@ def z_norm(a, axis=0):
     axis : int, default 0
         NumPy array axis
 
+    threshold : float, default to config.STUMPY_STDDEV_THRESHOLD
+        A non-nan std value being less than `threshold` will be replaced with 1.0
+
     Returns
     -------
     output : numpy.ndarray
         An array with z-normalized values computed along a specified axis.
     """
     std = np.std(a, axis, keepdims=True)
-    std[std == 0] = 1
+    std[np.less(std, threshold, where=~np.isnan(std))] = 1.0
 
     return (a - np.mean(a, axis, keepdims=True)) / std
 
@@ -2559,6 +2576,201 @@ def _select_P_ABBA_value(P_ABBA, k, custom_func=None):
     return MPdist
 
 
+@njit
+def _merge_topk_PI(PA, PB, IA, IB):
+    """
+    Merge two top-k matrix profiles `PA` and `PB`, and update `PA` (in place).
+    When the inputs are 1D arrays, PA[i] is updated if it is greater than PB[i] and
+    IA[i] != IB[i]. In such case, PA[i] and IA[i] are replaced with PB[i] and IB[i],
+    respectively. (Note that it might happen that IA[i]==IB[i] but PA[i] != PB[i].
+    This situation can occur if there is slight imprecision in numerical calculations.
+    In that case, we do not update PA[i] and IA[i]. While updating PA[i] and IA[i]
+    is harmless in this case, we avoid doing that so to be consistent with the merging
+    process when the inputs are 2D arrays)
+    When the inputs are 2D arrays, we always prioritize the values of `PA` over the
+    values of `PB` in case of ties. (i.e., values from `PB` are always inserted to
+    the right of values from `PA`). Also, update `IA` accordingly. In case of
+    overlapping values between two arrays IA[i] and IB[i], the ones in IB[i] (and
+    their corresponding values in PB[i]) are ignored throughout the updating process
+    of IA[i] (and PA[i]).
+
+    Unlike `_merge_topk_ρI`, where `top-k` largest values are kept, this function
+    keeps `top-k` smallest values.
+
+    Parameters
+    ----------
+    PA : numpy.ndarray
+        A (top-k) matrix profile where values in each row are sorted in ascending
+        order. `PA` must be 1- or 2-dimensional.
+
+    PB : numpy.ndarray
+        A (top-k) matrix profile where values in each row are sorted in ascending
+        order. `PB` must have the same shape as `PA`.
+
+    IA : numpy.ndarray
+        A (top-k) matrix profile indices corresponding to `PA`
+
+    IB : numpy.ndarray
+        A (top-k) matrix profile indices corresponding to `PB`
+
+    Returns
+    -------
+    None
+    """
+    if PA.ndim == 1:
+        mask = (PB < PA) & (IB != IA)
+        PA[mask] = PB[mask]
+        IA[mask] = IB[mask]
+    else:
+        k = PA.shape[1]
+        tmp_P = np.empty(k, dtype=np.float64)
+        tmp_I = np.empty(k, dtype=np.int64)
+        for i in range(PA.shape[0]):
+            overlap = set(IB[i]).intersection(set(IA[i]))
+            aj, bj = 0, 0
+            idx = 0
+            # 2 * k iterations are required to traverse both A and B if needed.
+            for _ in range(2 * k):
+                if idx >= k:
+                    break
+                if bj < k and PB[i, bj] < PA[i, aj]:
+                    if IB[i, bj] not in overlap:
+                        tmp_P[idx] = PB[i, bj]
+                        tmp_I[idx] = IB[i, bj]
+                        idx += 1
+                    bj += 1
+                else:
+                    tmp_P[idx] = PA[i, aj]
+                    tmp_I[idx] = IA[i, aj]
+                    idx += 1
+                    aj += 1
+
+            PA[i] = tmp_P
+            IA[i] = tmp_I
+
+
+@njit
+def _merge_topk_ρI(ρA, ρB, IA, IB):
+    """
+    Merge two top-k pearson profiles `ρA` and `ρB`, and update `ρA` (in place).
+    When the inputs are 1D arrays, ρA[i] is updated if it is less than ρB[i] and
+    IA[i] != IB[i]. In such case, ρA[i] and IA[i] are replaced with ρB[i] and IB[i],
+    respectively. (Note that it might happen that IA[i]==IB[i] but ρA[i] != ρB[i].
+    This situation can occur if there is slight imprecision in numerical calculations.
+    In that case, we do not update ρA[i] and IA[i]. While updating ρA[i] and IA[i]
+    is harmless in this case, we avoid doing that so to be consistent with the merging
+    process when the inputs are 2D arrays)
+    When the inputs are 2D arrays, we always prioritize the values of `ρA` over
+    the values of `ρB` in case of ties. (i.e., values from `ρB` are always inserted
+    to the left of values from `ρA`). Also, update `IA` accordingly. In case of
+    overlapping values between two arrays IA[i] and IB[i], the ones in IB[i] (and
+    their corresponding values in ρB[i]) are ignored throughout the updating process
+    of IA[i] (and ρA[i]).
+
+    Unlike `_merge_topk_PI`, where `top-k` smallest values are kept, this function
+    keeps `top-k` largest values.
+
+    Parameters
+    ----------
+    ρA : numpy.ndarray
+        A (top-k) pearson profile where values in each row are sorted in ascending
+        order. `ρA` must be 1- or 2-dimensional.
+
+    ρB : numpy.ndarray
+        A (top-k) pearson profile, where values in each row are sorted in ascending
+        order. `ρB` must have the same shape as `ρA`.
+
+    IA : numpy.ndarray
+        A (top-k) matrix profile indices corresponding to `ρA`
+
+    IB : numpy.ndarray
+        A (top-k) matrix profile indices corresponding to `ρB`
+
+    Returns
+    -------
+    None
+    """
+    if ρA.ndim == 1:
+        mask = (ρB > ρA) & (IB != IA)
+        ρA[mask] = ρB[mask]
+        IA[mask] = IB[mask]
+    else:
+        k = ρA.shape[1]
+        tmp_ρ = np.empty(k, dtype=np.float64)
+        tmp_I = np.empty(k, dtype=np.int64)
+        last_idx = k - 1
+        for i in range(len(ρA)):
+            overlap = set(IB[i]).intersection(set(IA[i]))
+            aj, bj = last_idx, last_idx
+            idx = last_idx
+            # 2 * k iterations are required to traverse both A and B if needed.
+            for _ in range(2 * k):
+                if idx < 0:
+                    break
+                if bj >= 0 and ρB[i, bj] > ρA[i, aj]:
+                    if IB[i, bj] not in overlap:
+                        tmp_ρ[idx] = ρB[i, bj]
+                        tmp_I[idx] = IB[i, bj]
+                        idx -= 1
+                    bj -= 1
+                else:
+                    tmp_ρ[idx] = ρA[i, aj]
+                    tmp_I[idx] = IA[i, aj]
+                    idx -= 1
+                    aj -= 1
+
+            ρA[i] = tmp_ρ
+            IA[i] = tmp_I
+
+
+@njit
+def _shift_insert_at_index(a, idx, v, shift="right"):
+    """
+    If `shift=right` (default), all elements in `a[idx:]` are shifted to the right by
+    one element, `v` in inserted at index `idx` and the last element is discarded.
+    If `shift=left`, all elements in `a[:idx]` are shifted to the left by one element,
+    `v` in inserted at index `idx-1`, and the first element is discarded. In both cases,
+    `a` is updated in place and its length remains unchanged.
+
+    Note that all unrecognized `shift` inputs will default to `shift=right`.
+
+
+    Parameters
+    ----------
+    a: numpy.ndarray
+        A 1d array
+
+    idx: int
+        The index at which the value `v` should be inserted. This can be any
+        integer number from `0` to `len(a)`. When `idx=len(a)` and `shift="right"`,
+        OR when `idx=0` and `shift="left"`, then no change will occur on
+        the input array `a`.
+
+    v: float
+        The value that should be inserted into array `a` at index `idx`
+
+    shift: str, default "right"
+        The value that indicates whether the shifting of elements should be towards
+        the right or left. If `shift="right"` (default), all elements in `a[idx:]`
+        are shifted to the right by one element. If `shift="left"`, all elements
+        in `a[:idx]` are shifted to the left by one element.
+
+    Returns
+    -------
+    None
+    """
+    if shift == "left":
+        if 0 < idx <= len(a):
+            a[: idx - 1] = a[1:idx]
+            # elements were shifted to the left, thus the insertion index becomes
+            # `idx-1`
+            a[idx - 1] = v
+    else:
+        if 0 <= idx < len(a):
+            a[idx + 1 :] = a[idx:-1]
+            a[idx] = v
+
+
 def _check_P(P, threshold=1e-6):
     """
     Check if the 1-dimensional matrix profile values are too small and

diff --git a/stumpy/gpu_aamp.py b/stumpy/gpu_aamp.py
@@ -339,7 +339,9 @@ def _gpu_aamp(
     return profile_fname, indices_fname
 
 
-def gpu_aamp(T_A, m, T_B=None, ignore_trivial=True, device_id=0, p=2.0):
+def gpu_aamp(T_A, m, T_B=None, ignore_trivial=True, device_id=0, p=2.0, k=1):
+    # function needs to be revised to return (top-k) matrix profile and
+    # matrix profile indices
     """
     Compute the non-normalized (i.e., without z-normalization) matrix profile with one
     or more GPU devices
@@ -375,6 +377,11 @@ def gpu_aamp(T_A, m, T_B=None, ignore_trivial=True, device_id=0, p=2.0):
     p : float, default 2.0
         The p-norm to apply for computing the Minkowski distance.
 
+    k : int, default 1
+        The number of top `k` smallest distances used to construct the matrix profile.
+        Note that this will increase the total computational time and memory usage
+        when k > 1.
+
     Returns
     -------
     out : numpy.ndarray