Skip to content

Commit

Permalink
Fixed #592 Add Top-K Nearest Neighbors for Normalized Matrix Profile (#…
Browse files Browse the repository at this point in the history
…595)

* Added test function to test TopK scrump in AB_join

* Refactored

* Added definition of parameter k to docstring

* Improved docstring

* Removed trailing colon

* Cleaned code

* Avoided allocating new memory in inner for-loop

* Fixed typos

* Improved comments

* Avoided allocating new memory in each iteration

* Same ndim in output regardless of value of k

* Revised docstrings

* Enhanced function to perform shift left as well

* Enhanced test function to test newly added functionality

* Fixed format

* Fixed format

* Removed/Renamed intermediate variables

* Renamed variable for the sake of consistency

* Avoided shape mismatch by reshaping ndarray

* Refactored

* Fixed comment

* Refacored and Minor restructuring of lines

* Modified stimp after changing output shape in scrump

* Add pragma no cover

* Revised Docstrings

* Fixed docstring

* Revised  docstring

* Removed unnecessary dangling else

* Removed unnecessary comment

* Revised structure of  test function

so, it follows the structure of the performant version

* Replaced ravel with flatten to get copy of array

* Changed the type of input parameter and revised docstring

* Update the value of parameter to match its type

* Update the value of parameter to match its type

* Correct format

* Changed output structure of naive.scrump

* Correct format

* Add test function for scrump_plus_plus for TopK

* Add naive version to merge peason profiles

* Add test function for  merging pearson profiles

* Corret format

* Add performant function to merge pearson profiles

* Optimize function

* Avoid creating new memory

* Improve docstring

* Refactored

* Avoid creating new memory in for-loop

* Update test function

* Revise function to make it parallelizable

* Full test and coverage in 1hr

* Revise docstrings

* Revise docstrings

* Optimize function

* Optimize function

* Rename variable to improve readability

* Revise comments

* Improve  comments and docstrings

* Correct naive implementation

* Enhance naive function to support top  matrix profile

* Enhace performant function to support topk matrix profile

* Update existing test functions

* Correct format

* Fix shape of array

* Fix shape of array

* Add kind keyword for sorting

* Fix bugs

* Remove ineffective inner prange

* Temporarily added parameter k to avoid decorator failure

* Improve comments

* Improve comments

* Improve docstring

* Add KNN test function for stumpi

* Fix shape of output for KNN test

* Full test and coverage 1 hr

* Avoid using searchsort when k is 1

* Revise code according to top k matrix profile structure

* Remove if condition

* Improve dosctrings

* Avoid allocating new memory

* Avoid allocating new memory

* Improve comments

* Remove numpy.where to avoid copying unchanged values

* Remove unnecessary trailing colon

* Replace negative np.inf with np.NINF

* delete a wrong file

* Avoid advance indexing by using chain slicing so it can be run by njit

* Improve docstring

* Added gpu_searchsorted checks when GPUs unavailable

* Added error checks and pytest ignore warning

* Improve docstrings

* minor changes in if-block and dosctring

* Improve docstrings

* Improve  comments

* minor changes

* Correct format

* Improve docstrings

* optimize functions

* Remove redundant import

* minor change

* Revise docstrings

* Fixed black formatting after conflict resolution

* Correct docstring

* Revise docstrings

* minor change

* Revise comments

* Avoid redundant allocation of memory

* Revise docstrings and comments

* rename variables

* minor correction

* Fix indexing

* Add new test function

* Modify test function

* Avoid dumplicate in naive prescrump

* Add parameter assume_unique to handle duplicates

* Add test function to test for duplicates in topk_merge

* Add parameter assume_unique to performant merge_topk

* fix test function

* Fix bug

* Revise prescrump to avoid duplicates

* Avoid duplocates in scrump

* Revise test function to consider new parameter

* Fix bug

* Revise naive scrump to avoid duplicates

* Add comment

* minor optimization

* Correct style

* Correct style

* increase threshold

* Specifiy kind in sort

* minor change

* specify kind in sort

* minor changes

* De-otpimize if condition

Due to numerical erorrs, we need to avoid partial traversal of array

* Update scrump

* minor changes

* add new test function

* optimize if condition

* Give priority to PA in case of ties between IA and IB

* Remove trailing colon

* update test function

* revise function to avoid adding new parameter

* Update module scrump and improvee its readability

* Fix syntax

* update test functions

* minor fix

* correct format

* Improve docstring

* Avoid overlap while merging matrix profiles

* Add function to find overlapping values

* replace numpy function with our implementation

* Avoid unnecessary call of a function

* Revise docsting and comment

* Improve test function

* Remove comment

* Add test function to ensure duplicates are avoided

* Improve comments

* Enhance naive version to avoid duplicates while merging

* Add test function and revise naive version

* Improve code readability and comment

* Update top-k profile by getting insertion index

In NearestNeighbor case, the distance  between sequence i and its NN
is the smallest. However, due to `imprecision` in calculation, it is
possible that its corresponding distance, i.e. distance  between seq i
and its NN, is not the smallest value in its  top-k neighbors. So,
instead of inserting it at index 0, we use numpy.searchsorted to find
the correct insertion index.

* Merge nested if statements into one

* Remove blank lines

* Fix typo

* Improve comment

* Improve comments

* Improve docstring

* Remove unnecessary comments

* passing copy of variable as input

* minor change in test functions

* Correct style

* Revise comment

* Remove comment

* Revise comment

* Fix format

* Remove unnecessary newline

* Return 1D array for matrix profile when `k` is 1

* Remove unnecessary flattening operatiton on array

* Fix comments

* Make matrix profile and mp index 1D when k=1

* Revise tests functions

* Improve Docstrings

* Make prescrump output 1D when k is one

* minor change

* update test functions

* Modify merge_topk to support 1D input

* Fix merge_topk

* Fix shape of variables in test functions

* Remove unnecessary flatten operation

* Update test function for case k=1

* revise comment

* Avoid using return in the middle of code

* Add new private function to get 2D ouput when k=1

* Remove check for 1D in merge_topk

* Revise test functions

* Revise docstring to provide description for 1D case

* Add overlap check in merge_topk with 1D input

* Add overlap check in 1D and revise docstring

* Add separate test function for _merge_topk 1D case

* Add preprocessing function for prescrump

* Update test function

* fix missing argument

* Fix Docstring

* Put back the missing decorator

* Add preprocessing function in prescraamp

* Revise naive function

* Fix value of imprecision in test functions

* create overlaps randomly for test merge_topk in 1D case

* Revise docstrings

* Fix docstrings

* minor changes

* minor fix

* change variable name

* change variables names

* convert attr to property attr to get 1D when k is 1

* avoid calling performant function in a naive function

* minor modification on z_norm functions

* fix function

* revise docstrings

* change variable name

* Relocate comment

* minor changes

* fix uint

* fixed uint

* fixed test function

* fixed calling function

* Removed redundant return statement

Co-authored-by: Sean Law <[email protected]>
Co-authored-by: Sean M. Law <[email protected]>
  • Loading branch information
3 people authored Nov 9, 2022
1 parent 65d3e85 commit e8a7dd4
Show file tree
Hide file tree
Showing 22 changed files with 2,545 additions and 579 deletions.
8 changes: 7 additions & 1 deletion stumpy/aamp.py
Original file line number Diff line number Diff line change
Expand Up @@ -255,7 +255,8 @@ def _aamp(
return np.power(P[0, :, :], 1.0 / p), I[0, :, :]


def aamp(T_A, m, T_B=None, ignore_trivial=True, p=2.0):
def aamp(T_A, m, T_B=None, ignore_trivial=True, p=2.0, k=1):
# function needs to be changed to return top-k matrix profile
"""
Compute the non-normalized (i.e., without z-normalization) matrix profile
Expand All @@ -282,6 +283,11 @@ def aamp(T_A, m, T_B=None, ignore_trivial=True, p=2.0):
p : float, default 2.0
The p-norm to apply for computing the Minkowski distance.
k : int, default 1
The number of top `k` smallest distances used to construct the matrix profile.
Note that this will increase the total computational time and memory usage
when k > 1.
Returns
-------
out : numpy.ndarray
Expand Down
8 changes: 7 additions & 1 deletion stumpy/aamped.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@
logger = logging.getLogger(__name__)


def aamped(dask_client, T_A, m, T_B=None, ignore_trivial=True, p=2.0):
def aamped(dask_client, T_A, m, T_B=None, ignore_trivial=True, p=2.0, k=1):
# function needs to be revised to return top-k matrix profile
"""
Compute the non-normalized (i.e., without z-normalization) matrix profile
Expand Down Expand Up @@ -46,6 +47,11 @@ def aamped(dask_client, T_A, m, T_B=None, ignore_trivial=True, p=2.0):
p : float, default 2.0
The p-norm to apply for computing the Minkowski distance.
k : int, default 1
The number of top `k` smallest distances used to construct the matrix profile.
Note that this will increase the total computational time and memory usage
when k > 1.
Returns
-------
out : numpy.ndarray
Expand Down
14 changes: 13 additions & 1 deletion stumpy/aampi.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@


class aampi:
# needs to be enhanced to support top-k matrix profile
"""
Compute an incremental non-normalized (i.e., without z-normalization) matrix profile
for streaming data
Expand All @@ -28,6 +29,11 @@ class aampi:
p : float, default 2.0
The p-norm to apply for computing the Minkowski distance.
k : int, default 1
The number of top `k` smallest distances used to construct the matrix profile.
Note that this will increase the total computational time and memory usage
when k > 1.
Attributes
----------
P_ : numpy.ndarray
Expand Down Expand Up @@ -62,7 +68,7 @@ class aampi:
Note that we have extended this algorithm for AB-joins as well.
"""

def __init__(self, T, m, egress=True, p=2.0):
def __init__(self, T, m, egress=True, p=2.0, k=1):
"""
Initialize the `stumpi` object
Expand All @@ -81,6 +87,12 @@ def __init__(self, T, m, egress=True, p=2.0):
p : float, default 2.0
The p-norm to apply for computing the Minkowski distance.
k : int, default 1
The number of top `k` smallest distances used to construct the matrix
profile. Note that this will increase the total computational time and
memory usage when k > 1.
"""
self._T = core._preprocess(T)
core.check_window_size(m, max_size=self._T.shape[-1])
Expand Down
216 changes: 214 additions & 2 deletions stumpy/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,20 @@ def _gpu_aamp_stimp_driver_not_found(*args, **kwargs): # pragma: no cover
driver_not_found()


def _gpu_searchsorted_left_driver_not_found(*args, **kwargs): # pragma: no cover
"""
Dummy function to raise CudaSupportError driver not found error.
"""
driver_not_found()


def _gpu_searchsorted_right_driver_not_found(*args, **kwargs): # pragma: no cover
"""
Dummy function to raise CudaSupportError driver not found error.
"""
driver_not_found()


def get_pkg_name(): # pragma: no cover
"""
Return package name.
Expand Down Expand Up @@ -240,7 +254,7 @@ def rolling_window(a, window):
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)


def z_norm(a, axis=0):
def z_norm(a, axis=0, threshold=config.STUMPY_STDDEV_THRESHOLD):
"""
Calculate the z-normalized input array `a` by subtracting the mean and
dividing by the standard deviation along a given axis.
Expand All @@ -253,13 +267,16 @@ def z_norm(a, axis=0):
axis : int, default 0
NumPy array axis
threshold : float, default to config.STUMPY_STDDEV_THRESHOLD
A non-nan std value being less than `threshold` will be replaced with 1.0
Returns
-------
output : numpy.ndarray
An array with z-normalized values computed along a specified axis.
"""
std = np.std(a, axis, keepdims=True)
std[std == 0] = 1
std[np.less(std, threshold, where=~np.isnan(std))] = 1.0

return (a - np.mean(a, axis, keepdims=True)) / std

Expand Down Expand Up @@ -2559,6 +2576,201 @@ def _select_P_ABBA_value(P_ABBA, k, custom_func=None):
return MPdist


@njit
def _merge_topk_PI(PA, PB, IA, IB):
"""
Merge two top-k matrix profiles `PA` and `PB`, and update `PA` (in place).
When the inputs are 1D arrays, PA[i] is updated if it is greater than PB[i] and
IA[i] != IB[i]. In such case, PA[i] and IA[i] are replaced with PB[i] and IB[i],
respectively. (Note that it might happen that IA[i]==IB[i] but PA[i] != PB[i].
This situation can occur if there is slight imprecision in numerical calculations.
In that case, we do not update PA[i] and IA[i]. While updating PA[i] and IA[i]
is harmless in this case, we avoid doing that so to be consistent with the merging
process when the inputs are 2D arrays)
When the inputs are 2D arrays, we always prioritize the values of `PA` over the
values of `PB` in case of ties. (i.e., values from `PB` are always inserted to
the right of values from `PA`). Also, update `IA` accordingly. In case of
overlapping values between two arrays IA[i] and IB[i], the ones in IB[i] (and
their corresponding values in PB[i]) are ignored throughout the updating process
of IA[i] (and PA[i]).
Unlike `_merge_topk_ρI`, where `top-k` largest values are kept, this function
keeps `top-k` smallest values.
Parameters
----------
PA : numpy.ndarray
A (top-k) matrix profile where values in each row are sorted in ascending
order. `PA` must be 1- or 2-dimensional.
PB : numpy.ndarray
A (top-k) matrix profile where values in each row are sorted in ascending
order. `PB` must have the same shape as `PA`.
IA : numpy.ndarray
A (top-k) matrix profile indices corresponding to `PA`
IB : numpy.ndarray
A (top-k) matrix profile indices corresponding to `PB`
Returns
-------
None
"""
if PA.ndim == 1:
mask = (PB < PA) & (IB != IA)
PA[mask] = PB[mask]
IA[mask] = IB[mask]
else:
k = PA.shape[1]
tmp_P = np.empty(k, dtype=np.float64)
tmp_I = np.empty(k, dtype=np.int64)
for i in range(PA.shape[0]):
overlap = set(IB[i]).intersection(set(IA[i]))
aj, bj = 0, 0
idx = 0
# 2 * k iterations are required to traverse both A and B if needed.
for _ in range(2 * k):
if idx >= k:
break
if bj < k and PB[i, bj] < PA[i, aj]:
if IB[i, bj] not in overlap:
tmp_P[idx] = PB[i, bj]
tmp_I[idx] = IB[i, bj]
idx += 1
bj += 1
else:
tmp_P[idx] = PA[i, aj]
tmp_I[idx] = IA[i, aj]
idx += 1
aj += 1

PA[i] = tmp_P
IA[i] = tmp_I


@njit
def _merge_topk_ρI(ρA, ρB, IA, IB):
"""
Merge two top-k pearson profiles `ρA` and `ρB`, and update `ρA` (in place).
When the inputs are 1D arrays, ρA[i] is updated if it is less than ρB[i] and
IA[i] != IB[i]. In such case, ρA[i] and IA[i] are replaced with ρB[i] and IB[i],
respectively. (Note that it might happen that IA[i]==IB[i] but ρA[i] != ρB[i].
This situation can occur if there is slight imprecision in numerical calculations.
In that case, we do not update ρA[i] and IA[i]. While updating ρA[i] and IA[i]
is harmless in this case, we avoid doing that so to be consistent with the merging
process when the inputs are 2D arrays)
When the inputs are 2D arrays, we always prioritize the values of `ρA` over
the values of `ρB` in case of ties. (i.e., values from `ρB` are always inserted
to the left of values from `ρA`). Also, update `IA` accordingly. In case of
overlapping values between two arrays IA[i] and IB[i], the ones in IB[i] (and
their corresponding values in ρB[i]) are ignored throughout the updating process
of IA[i] (and ρA[i]).
Unlike `_merge_topk_PI`, where `top-k` smallest values are kept, this function
keeps `top-k` largest values.
Parameters
----------
ρA : numpy.ndarray
A (top-k) pearson profile where values in each row are sorted in ascending
order. `ρA` must be 1- or 2-dimensional.
ρB : numpy.ndarray
A (top-k) pearson profile, where values in each row are sorted in ascending
order. `ρB` must have the same shape as `ρA`.
IA : numpy.ndarray
A (top-k) matrix profile indices corresponding to `ρA`
IB : numpy.ndarray
A (top-k) matrix profile indices corresponding to `ρB`
Returns
-------
None
"""
if ρA.ndim == 1:
mask = (ρB > ρA) & (IB != IA)
ρA[mask] = ρB[mask]
IA[mask] = IB[mask]
else:
k = ρA.shape[1]
tmp_ρ = np.empty(k, dtype=np.float64)
tmp_I = np.empty(k, dtype=np.int64)
last_idx = k - 1
for i in range(len(ρA)):
overlap = set(IB[i]).intersection(set(IA[i]))
aj, bj = last_idx, last_idx
idx = last_idx
# 2 * k iterations are required to traverse both A and B if needed.
for _ in range(2 * k):
if idx < 0:
break
if bj >= 0 and ρB[i, bj] > ρA[i, aj]:
if IB[i, bj] not in overlap:
tmp_ρ[idx] = ρB[i, bj]
tmp_I[idx] = IB[i, bj]
idx -= 1
bj -= 1
else:
tmp_ρ[idx] = ρA[i, aj]
tmp_I[idx] = IA[i, aj]
idx -= 1
aj -= 1

ρA[i] = tmp_ρ
IA[i] = tmp_I


@njit
def _shift_insert_at_index(a, idx, v, shift="right"):
"""
If `shift=right` (default), all elements in `a[idx:]` are shifted to the right by
one element, `v` in inserted at index `idx` and the last element is discarded.
If `shift=left`, all elements in `a[:idx]` are shifted to the left by one element,
`v` in inserted at index `idx-1`, and the first element is discarded. In both cases,
`a` is updated in place and its length remains unchanged.
Note that all unrecognized `shift` inputs will default to `shift=right`.
Parameters
----------
a: numpy.ndarray
A 1d array
idx: int
The index at which the value `v` should be inserted. This can be any
integer number from `0` to `len(a)`. When `idx=len(a)` and `shift="right"`,
OR when `idx=0` and `shift="left"`, then no change will occur on
the input array `a`.
v: float
The value that should be inserted into array `a` at index `idx`
shift: str, default "right"
The value that indicates whether the shifting of elements should be towards
the right or left. If `shift="right"` (default), all elements in `a[idx:]`
are shifted to the right by one element. If `shift="left"`, all elements
in `a[:idx]` are shifted to the left by one element.
Returns
-------
None
"""
if shift == "left":
if 0 < idx <= len(a):
a[: idx - 1] = a[1:idx]
# elements were shifted to the left, thus the insertion index becomes
# `idx-1`
a[idx - 1] = v
else:
if 0 <= idx < len(a):
a[idx + 1 :] = a[idx:-1]
a[idx] = v


def _check_P(P, threshold=1e-6):
"""
Check if the 1-dimensional matrix profile values are too small and
Expand Down
9 changes: 8 additions & 1 deletion stumpy/gpu_aamp.py
Original file line number Diff line number Diff line change
Expand Up @@ -339,7 +339,9 @@ def _gpu_aamp(
return profile_fname, indices_fname


def gpu_aamp(T_A, m, T_B=None, ignore_trivial=True, device_id=0, p=2.0):
def gpu_aamp(T_A, m, T_B=None, ignore_trivial=True, device_id=0, p=2.0, k=1):
# function needs to be revised to return (top-k) matrix profile and
# matrix profile indices
"""
Compute the non-normalized (i.e., without z-normalization) matrix profile with one
or more GPU devices
Expand Down Expand Up @@ -375,6 +377,11 @@ def gpu_aamp(T_A, m, T_B=None, ignore_trivial=True, device_id=0, p=2.0):
p : float, default 2.0
The p-norm to apply for computing the Minkowski distance.
k : int, default 1
The number of top `k` smallest distances used to construct the matrix profile.
Note that this will increase the total computational time and memory usage
when k > 1.
Returns
-------
out : numpy.ndarray
Expand Down
Loading

0 comments on commit e8a7dd4

Please sign in to comment.