Update Scaling preprocessors (#69)

* Renamed ZNormalizer to StandardScaler * Implement RobustScaler.py * Included more tests
ML-KULeuven · Jan 7, 2025 · 5d438ab · 5d438ab
1 parent c8c81eb
commit 5d438ab
Show file tree

Hide file tree

Showing 20 changed files with 730 additions and 471 deletions.
diff --git a/docs/additional_information/changelog.rst b/docs/additional_information/changelog.rst
@@ -12,6 +12,7 @@ Added
 - Implemented ``ClusterBasedLocalOutlierFactor`` (CBLOF) anomaly detector.
 - Implemented ``KMeansAnomalyDetector`` anomaly detector.
 - Implemented ``CopulaBasedOutlierDetector`` (COPOD) anomaly detector.
+- Implemented ``RobustScaler`` preprocessor.
 
 Changed
 ^^^^^^^
@@ -25,6 +26,7 @@ Changed
 
 Fixed
 ^^^^^
+- Renamed ``ZNormalizer`` to ``StandardScaler``, to make it align with the Sklearn declaration.
 
 
 [0.2.3] - 2024-12-02

diff --git a/docs/api/preprocessing.rst b/docs/api/preprocessing.rst
@@ -12,7 +12,8 @@ Preprocessing module
 .. autoclass:: dtaianomaly.preprocessing.ChainedPreprocessor
 .. autoclass:: dtaianomaly.preprocessing.Identity
 .. autoclass:: dtaianomaly.preprocessing.MinMaxScaler
-.. autoclass:: dtaianomaly.preprocessing.ZNormalizer
+.. autoclass:: dtaianomaly.preprocessing.StandardScaler
+.. autoclass:: dtaianomaly.preprocessing.RobustScaler
 .. autoclass:: dtaianomaly.preprocessing.MovingAverage
 .. autoclass:: dtaianomaly.preprocessing.ExponentialMovingAverage
 .. autoclass:: dtaianomaly.preprocessing.SamplingRateUnderSampler

diff --git a/docs/getting_started/examples/quantitative_evaluation.rst b/docs/getting_started/examples/quantitative_evaluation.rst
@@ -45,9 +45,9 @@ is applied.
 
     preprocessors = [
         Identity(),
-        ZNormalizer(),
-        ChainedPreprocessor([MovingAverage(10), ZNormalizer()]),
-        ChainedPreprocessor([ExponentialMovingAverage(0.8), ZNormalizer()])
+        StandardScaler(),
+        ChainedPreprocessor([MovingAverage(10), StandardScaler()]),
+        ChainedPreprocessor([ExponentialMovingAverage(0.8), StandardScaler()])
     ]
 
 We will now initialize our anomaly detectors. Each anomaly detector will be combined with each
@@ -124,7 +124,7 @@ as follows:
     { 'type': <name-of-component>, 'optional-param': <value-optional-parameter>}
 
 The ``'type'`` equals the name of the component, for example ``'LocalOutlierFactor'``
-or ``'ZNormalizer'``. This string must exactly match the object name of the component
+or ``'StandardScaler'``. This string must exactly match the object name of the component
 you want to add to the workflow. In addition, it is possible to define hyperparameters
 of each component. For example for ``'LocalOutlierFactor'``, you must define a
 ``'window_size'``, but can optionally also define a ``'stride'``. An error will be

diff --git a/dtaianomaly/preprocessing/RobustScaler.py b/dtaianomaly/preprocessing/RobustScaler.py
@@ -0,0 +1,97 @@
+
+import numpy as np
+from typing import Optional, Tuple
+from sklearn.exceptions import NotFittedError
+
+from dtaianomaly.utils import get_dimension
+from dtaianomaly.preprocessing.Preprocessor import Preprocessor
+
+
+class RobustScaler(Preprocessor):
+    """
+    Scale the time series using robust statistics.
+
+    The :py:class:`~dtaianomaly.preprocessing.RobustScaler` is similar to
+    :py:class:`~dtaianomaly.preprocessing.StandardScaler`, but uses robust
+    statistics rather than mean and standard deviation. The center of the data
+    is computed via the median, and the scale is computed as the range between
+    two quantiles (by default uses the IQR). This ensures that scaling is less
+    affected by outliers.
+
+    For a time series :math:`x`, center :math:`c` and scale :math:`s`, observation
+    :math:`x_i` is scaled to observation :math:`y_i` using the following equation:
+
+    .. math::
+
+       y_i = \\frac{x_i - c}{s}
+
+    Notice the similarity with the formula for standard scaling. For multivariate
+    time series, each attribute is scaled independently, each with an independent
+    scale and center.
+
+    Parameters
+    ----------
+    quantile_range: tuple of (float, float), default = (25.0, 75.0)
+        Quantile range used to compute the ``scale_`` of the robust scaler.
+        By default, this is equal to the Inter Quantile Range (IQR). The first
+        value of the quantile range corresponds to the smallest quantile, the
+        second value corresponds to the larger quantile. If the first value is
+        not smaller than the second value, an error will be thrown. The values
+        must also both be in the range [0, 100].
+
+    Attributes
+    ----------
+    center_: array-like of shape (n_attributes)
+        The median value in each attribute of the training data.
+    scale_: array-like of shape (n_attributes)
+        The quantile range for each attribute of the training data.
+
+    Raises
+    ------
+    NotFittedError
+        If the `transform` method is called before fitting this StandardScaler.
+    """
+    quantile_range: (float, float)
+    center_: np.array
+    scale_: np.array
+
+    def __init__(self, quantile_range: (float, float) = (25.0, 75.0)):
+        if not isinstance(quantile_range, tuple):
+            raise TypeError("`quantile_range` should be tuple")
+        if len(quantile_range) != 2:
+            raise ValueError("'quantile_range' should consist of exactly two values (length of 2)")
+        if not isinstance(quantile_range[0], (float, int)) or isinstance(quantile_range[0], bool):
+            raise TypeError("The first element `quantile_range` should be a float or int")
+        if not isinstance(quantile_range[1], (float, int)) or isinstance(quantile_range[1], bool):
+            raise TypeError("The second element `quantile_range` should be a float or int")
+        if quantile_range[0] < 0.0:
+            raise ValueError("the first element in 'quantile_range' must be at least 0.0")
+        if quantile_range[1] > 100.0:
+            raise ValueError("the second element in 'quantile_range' must be at most 100.0")
+        if not quantile_range[0] < quantile_range[1]:
+            raise ValueError("the first element in 'quantile_range' must be at smaller than the second element in 'quantile_range'")
+        self.quantile_range = quantile_range
+
+    def _fit(self, X: np.ndarray, y: Optional[np.ndarray] = None) -> 'RobustScaler':
+        if get_dimension(X) == 1:
+            # univariate case
+            self.center_ = np.array([np.nanmedian(X)])
+            q_min = np.percentile(X, q=self.quantile_range[0])
+            q_max = np.percentile(X, q=self.quantile_range[1])
+            self.scale_ = np.array([q_max - q_min])
+        else:
+            # multivariate case
+            self.center_ = np.nanmedian(X, axis=0)
+            q_min = np.percentile(X, q=self.quantile_range[0], axis=0)
+            q_max = np.percentile(X, q=self.quantile_range[1], axis=0)
+            self.scale_ = q_max - q_min
+        return self
+
+    def _transform(self, X: np.ndarray, y: Optional[np.ndarray] = None) -> Tuple[np.ndarray, Optional[np.ndarray]]:
+        if not (hasattr(self, 'center_') and hasattr(self, 'scale_')):
+            raise NotFittedError(f'Call `fit` before using transform on {str(self)}')
+        if not ((len(X.shape) == 1 and self.center_.shape[0] == 1) or X.shape[1] == self.center_.shape[0]):
+            raise AttributeError(f'Trying to robust scale a time series with {X.shape[0]} attributes while it was fitted on {self.center_.shape[0]} attributes!')
+
+        X_ = (X - self.center_) / self.scale_
+        return np.where(np.isnan(X_), X, X_), y
diff --git a/dtaianomaly/preprocessing/ZNormalizer.py → dtaianomaly/preprocessing/StandardScaler.py b/dtaianomaly/preprocessing/ZNormalizer.py → dtaianomaly/preprocessing/StandardScaler.py
@@ -6,9 +6,9 @@
 from dtaianomaly.preprocessing.Preprocessor import Preprocessor
 
 
-class ZNormalizer(Preprocessor):
+class StandardScaler(Preprocessor):
     """
-    Rescale to zero mean, unit variance.
+    Standard scale the data: rescale to zero mean, unit variance.
 
     Rescale to zero mean and unit variance. A mean value and standard
     deviation is computed on a training set, after which these values
@@ -37,7 +37,7 @@ class ZNormalizer(Preprocessor):
     Raises
     ------
     NotFittedError
-        If the `transform` method is called before fitting this MinMaxScaler.
+        If the `transform` method is called before fitting this StandardScaler.
     """
     min_std: float
     mean_: np.array
@@ -46,7 +46,7 @@ class ZNormalizer(Preprocessor):
     def __init__(self, min_std: float = 1e-9):
         self.min_std = min_std
 
-    def _fit(self, X: np.ndarray, y: Optional[np.ndarray] = None) -> 'ZNormalizer':
+    def _fit(self, X: np.ndarray, y: Optional[np.ndarray] = None) -> 'StandardScaler':
         if len(X.shape) == 1 or X.shape[1] == 1:
             # univariate case
             self.mean_ = np.array([np.nanmean(X)])
@@ -62,7 +62,7 @@ def _transform(self, X: np.ndarray, y: Optional[np.ndarray] = None) -> Tuple[np.
         if not (hasattr(self, 'mean_') and hasattr(self, 'std_')):
             raise NotFittedError(f'Call `fit` before using transform on {str(self)}')
         if not ((len(X.shape) == 1 and self.mean_.shape[0] == 1) or X.shape[1] == self.mean_.shape[0]):
-            raise AttributeError(f'Trying to z-normalize a time series with {X.shape[0]} attributes while it was fitted on {self.min_.shape[0]} attributes!')
+            raise AttributeError(f'Trying to standard scale a time series with {X.shape[0]} attributes while it was fitted on {self.mean_.shape[0]} attributes!')
 
         # If the std of all attributes is 0, then no transformation happens
         if np.all((self.std_ < self.min_std)):

diff --git a/dtaianomaly/preprocessing/__init__.py b/dtaianomaly/preprocessing/__init__.py
@@ -8,24 +8,26 @@
 from .Preprocessor import Preprocessor, check_preprocessing_inputs, Identity
 from .ChainedPreprocessor import ChainedPreprocessor
 from .MinMaxScaler import MinMaxScaler
-from .ZNormalizer import ZNormalizer
+from .StandardScaler import StandardScaler
 from .MovingAverage import MovingAverage
 from .ExponentialMovingAverage import ExponentialMovingAverage
 from .UnderSampler import SamplingRateUnderSampler, NbSamplesUnderSampler
 from .Differencing import Differencing
 from .PiecewiseAggregateApproximation import PiecewiseAggregateApproximation
+from .RobustScaler import RobustScaler
 
 __all__ = [
     'Preprocessor',
     'check_preprocessing_inputs',
     'Identity',
     'ChainedPreprocessor',
     'MinMaxScaler',
-    'ZNormalizer',
+    'StandardScaler',
     'MovingAverage',
     'ExponentialMovingAverage',
     'SamplingRateUnderSampler',
     'NbSamplesUnderSampler',
     'Differencing',
-    'PiecewiseAggregateApproximation'
+    'PiecewiseAggregateApproximation',
+    'RobustScaler'
 ]
diff --git a/dtaianomaly/workflow/workflow_from_config.py b/dtaianomaly/workflow/workflow_from_config.py
@@ -345,10 +345,10 @@ def preprocessing_entry(entry):
             raise TypeError(f'Too many parameters given for entry: {entry}')
         return preprocessing.MinMaxScaler()
 
-    elif processing_type == 'ZNormalizer':
+    elif processing_type == 'StandardScaler':
         if len(entry_without_type) > 0:
             raise TypeError(f'Too many parameters given for entry: {entry}')
-        return preprocessing.ZNormalizer()
+        return preprocessing.StandardScaler()
 
     elif processing_type == 'MovingAverage':
         return preprocessing.MovingAverage(**entry_without_type)
@@ -368,6 +368,9 @@ def preprocessing_entry(entry):
     elif processing_type == 'PiecewiseAggregateApproximation':
         return preprocessing.PiecewiseAggregateApproximation(**entry_without_type)
 
+    elif processing_type == 'RobustScaler':
+        return preprocessing.RobustScaler(**entry_without_type)
+
     elif processing_type == 'ChainedPreprocessor':
         if len(entry_without_type) != 1:
             raise TypeError(f'ChainedPreprocessor must have base_preprocessors as key: {entry}')

diff --git a/notebooks/Config.json b/notebooks/Config.json
@@ -20,14 +20,14 @@
 	],
 	"preprocessors": [
 		{"type": "Identity"},
-		{"type": "ZNormalizer"},
+		{"type": "StandardScaler"},
 		{"type": "ChainedPreprocessor", "base_preprocessors":  [
 			{"type": "MovingAverage", "window_size": 10},
-			{"type": "ZNormalizer"}
+			{"type": "StandardScaler"}
 		]},
 		{"type": "ChainedPreprocessor", "base_preprocessors":  [
 			{"type": "ExponentialMovingAverage", "alpha": 0.8},
-			{"type": "ZNormalizer"}
+			{"type": "StandardScaler"}
 		]}
 	],
 	"detectors": [

diff --git a/notebooks/Industrial-anomaly-detection.ipynb b/notebooks/Industrial-anomaly-detection.ipynb
@@ -313,7 +313,7 @@
    "source": [
     "##### (2) Preprocessors\n",
     "\n",
-    "Next, we can define zero, one or multiple preprocessors to process the data. ``dtaianomaly`` already offers a number of preprocessors, (e.g., ``MinMaxScaler``, ``ZNormalizer``, ``MovingAverage``, ``ChainedPreprocessor``, etc.), but it is also possible to develop a custom preprocessor. For example, the wind turbine data has missing values, which typically cannot be handled by anomaly detectors. To cope with these, we define an ``Imputer`` preprocessor as below. All we need to do for this is add ``Preprocessor`` as a parent of the class and implement the ``._fit()`` and ``._transform()`` methods. For the ``Imputer``, no fitting is required, and the missing values are replaced by the previous observed value. Note that more complex imputation strategies could be implemented as well. "
+    "Next, we can define zero, one or multiple preprocessors to process the data. ``dtaianomaly`` already offers a number of preprocessors, (e.g., ``MinMaxScaler``, ``StandardScaler``, ``MovingAverage``, ``ChainedPreprocessor``, etc.), but it is also possible to develop a custom preprocessor. For example, the wind turbine data has missing values, which typically cannot be handled by anomaly detectors. To cope with these, we define an ``Imputer`` preprocessor as below. All we need to do for this is add ``Preprocessor`` as a parent of the class and implement the ``._fit()`` and ``._transform()`` methods. For the ``Imputer``, no fitting is required, and the missing values are replaced by the previous observed value. Note that more complex imputation strategies could be implemented as well. "
    ],
    "id": "6d23ff059c6f7832"
   },