Merge pull request #14 from lifelonglab/example_update

Update example and add additional options for heatmap
lifelonglab · Aug 14, 2024 · f040018 · f040018
2 parents 9c8ba8a + 0edbc17
commit f040018
Show file tree

Hide file tree

Showing 5 changed files with 92 additions and 45 deletions.
diff --git a/docs/index.md b/docs/index.md
@@ -1,43 +1,57 @@
 ### What is pyCLAD?
 
-The pyCLAD library contains classes providing tools and methods to experiment with continual anomaly detection problems.
+pyCLAD is a unified framework for continual anomaly detection. Its main goal is to foster successful scientific
+development in continual anomaly detection by providing robust implementations of common functionalities for continual
+anomaly detection off-the-shelf,  minimizing the risk of error-prone tasks and fostering replicability. pyCLAD also
+facilitates the design and implementation of experimental pipelines, providing a streamlined,
+unified, and fully reproducible execution workflow.
 
-The library is meant to be as easy-to-use as possible, coupled with extensive documentation of all functions and examples of how to effectively use them.
+It also provides a simple and convenient infrastructure for designing new strategies, models, and evaluation procedures,
+enabling researchers to avoid repetitive tasks and allowing them to focus on creative and scientific aspects, reducing
+the friction related to low-level implementation aspects.
 
-The main goal of pyCLAD is to simplify cumbersome tasks related to the design of experimental pipelines for continual anomaly detection. 
-
-pyCLAD provides implementations of common functionalities for continual anomaly detection off-the-shelf, facilitating data analysis and visualization, and minimizing the risk of error-prone tasks, and fostering replicability. 
-
-It also provides a simple and convenient infrastructure for designing new strategies, models, and evaluation procedures, enabling researchers to avoid repetitive tasks and allowing them to focus on creative and scientific aspects, reducing the friction related to low-level implementation aspects. 
-
-The core coding infrastructure is influenced by PyTorch, a very popular framework for deep learning as well as SkLearn, the leading machine learning and data analysis library in Python.
+The core coding infrastructure is influenced by PyTorch, a very popular framework for deep learning as well as SkLearn,
+the leading machine learning and data analysis library in Python.
 
 ### What is continual anomaly detection?
 
-**Continual Learning** puts emphasis on models that answering the need for machine learning models that continuously adapt to new challenges in dynamic environments while retaining past knowledge.
+**Continual Learning** puts emphasis on models that answering the need for machine learning models that continuously
+adapt to new challenges in dynamic environments while retaining past knowledge.
 
-**Anomaly Detection** is the process of detecting deviations from the normal behavior of a process, and has a very wide range of applications including monitoring cyber-physical systems, human conditions, as well as network traffic.
+**Anomaly Detection** is the process of detecting deviations from the normal behavior of a process, and has a very wide
+range of applications including monitoring cyber-physical systems, human conditions, as well as network traffic.
 
-**Continual anomaly detection** lies at the intersection of these two fields. Its strategies focus on specific goals which can be compared to other types of anomaly detection approaches. One possible categorization is the following:
+**Continual anomaly detection** lies at the intersection of these two fields. Its strategies focus on specific goals
+which can be compared to other types of anomaly detection approaches. One possible categorization is the following:
 
-- **Offline**: Models are trained once on background data and do not require updates (examples: post-incident analysis, breast cancer detection)  and  on updating the model as new data is observed, assuming only the most recent information is relevant. This approach is static in nature and does not provide adaptation. 
+- **Offline**: Models are trained once on background data and do not require updates (examples: post-incident analysis,
+  breast cancer detection)  and on updating the model as new data is observed, assuming only the most recent information
+  is relevant. This approach is static in nature and does not provide adaptation.
 
-- **Online**: Models are updated as new data is observed, assuming that the most recent information is the most relevant. This approach is popular in real-world dynamic applications where adaptation is necessary, but makes models prone to forgetting past knowledge. 
+- **Online**: Models are updated as new data is observed, assuming that the most recent information is the most
+  relevant. This approach is popular in real-world dynamic applications where adaptation is necessary, but makes models
+  prone to forgetting past knowledge.
 
-- **Continual**: Models are updated to simultaneously consider *adaptation* to new conditions and *knowledge retention* of previously observed (and potentially recurring) conditions.
-This behavior attempts to overcome limitations of both offline and online anomaly detection in complex scenarios.
+- **Continual**: Models are updated to simultaneously consider *adaptation* to new conditions and *knowledge retention*
+  of previously observed (and potentially recurring) conditions.
+  This behavior attempts to overcome limitations of both offline and online anomaly detection in complex scenarios.
 
-If you want to learn more about continual anomaly detection, we recommend [this open-access paper](https://ieeexplore.ieee.org/abstract/document/10473036/).
+If you want to learn more about continual anomaly detection, we
+recommend [this open-access paper](https://ieeexplore.ieee.org/abstract/document/10473036/).
 
 ### How do I install pyCLAD?
 
-pyCLAD is available as a [Python package on PyPI](https://pypi.org/project/pyclad/). Therefore, it can be installed using tools such as pip and conda.
+pyCLAD is available as a [Python package on PyPI](https://pypi.org/project/pyclad/). Therefore, it can be installed
+using tools such as pip and conda.
+
 #### Conda
+
 ```
 conda install -c conda-forge pyclad
 ```
 
 #### Pip
+
 ```
 pip install pyclad
 ```
@@ -50,8 +64,8 @@ We do not include them in default installation to avoid putting heavy dependenci
 pyCLAD supports the use of any model from pyOD library, some of which may require installation of additional packages (
 see [pyOD docs](https://pyod.readthedocs.io/en/latest/).
 
-
 ### Citing pyCLAD
+
 A paper describing pyCLAD is currently under review in SoftwareX journal. Feel free to use pyCLAD in your research, but
 please come back before your final submission to check if we already have a DOI for pyCLAD. Thank you!
 

diff --git a/examples/concept_incremental_example.py b/examples/concept_incremental_example.py
@@ -13,7 +13,7 @@
 from pyclad.metrics.continual.average_continual import ContinualAverage
 from pyclad.metrics.continual.backward_transfer import BackwardTransfer
 from pyclad.metrics.continual.forward_transfer import ForwardTransfer
-from pyclad.models.adapters.pyod_adapters import ECODAdapter
+from pyclad.models.adapters.pyod_adapters import IsolationForestAdapter
 from pyclad.output.json_writer import JsonOutputWriter
 from pyclad.scenarios.concept_incremental import ConceptIncrementalScenario
 from pyclad.strategies.baselines.cumulative import CumulativeStrategy
@@ -22,35 +22,45 @@
 logging.basicConfig(level=logging.DEBUG, handlers=[logging.FileHandler("debug.log"), logging.StreamHandler()])
 
 
+def _generate_normal_dist(mean, cov):
+    train_data = np.random.multivariate_normal(mean, cov, (100,))
+    test_data = np.concatenate([np.random.multivariate_normal(mean, cov, (50,)),
+                                np.random.multivariate_normal([3 * m for m in mean], cov, (50,))])
+    test_labels = np.array([0] * 50 + [1] * 50)
+    return train_data, test_data, test_labels
+
+
 if __name__ == "__main__":
-    concept1_train = Concept("concept1", data=np.random.chisquare(2, (100, 2)))
-    concept1_test = Concept(
-        "concept1",
-        data=np.concatenate([np.random.chisquare(2, (50, 2)), np.random.chisquare(20, (50, 2))]),
-        labels=np.array([0] * 50 + [1] * 50),
-    )
+    """
+    This example show how to create a simple dataset with 4 concepts and carry out a concept aware scenario with
+    CumulativeStrategy and IsolationForest model.
+    """
+    concept1_train_data, concept1_test_data, concept1_test_labels = _generate_normal_dist((2, 2), [[1, 0], [0, 1]])
+    concept2_train_data, concept2_test_data, concept2_test_labels = _generate_normal_dist((50, 50), [[1, 0], [0, 1]])
+    concept3_train_data, concept3_test_data, concept3_test_labels = _generate_normal_dist((5, 5), [[1, 0], [0, 1]])
+    concept4_train_data, concept4_test_data, concept4_test_labels = _generate_normal_dist((20, 20), [[1, 0], [0, 1]])
 
-    concept2_train = Concept("concept2", data=np.random.chisquare(100, (100, 2)))
-    concept2_test = Concept(
-        "concept2",
-        data=np.concatenate([np.random.chisquare(100, (50, 2)), np.random.chisquare(150, (50, 2))]),
-        labels=np.array([0] * 50 + [1] * 50),
-    )
+    concept1_train = Concept("concept1", data=concept1_train_data)
+    concept1_test = Concept("concept1", data=concept1_test_data, labels=concept1_test_labels)
 
-    concept3_train = Concept("concept3", data=np.random.chisquare(50, (100, 2)))
-    concept3_test = Concept(
-        "concept3",
-        data=np.concatenate([np.random.chisquare(50, (50, 2)), np.random.chisquare(125, (50, 2))]),
-        labels=np.array([0] * 50 + [1] * 50),
-    )
+    concept2_train = Concept("concept2", data=concept2_train_data)
+    concept2_test = Concept("concept2", data=concept2_test_data, labels=concept2_test_labels)
+
+    concept3_train = Concept("concept3", data=concept3_train_data)
+    concept3_test = Concept("concept3", data=concept3_test_data, labels=concept3_test_labels)
+
+    concept4_train = Concept("concept4", data=concept4_train_data)
+    concept4_test = Concept("concept4", data=concept4_test_data, labels=concept4_test_labels)
 
+    # Build a dataset based on the previously created concepts
     dataset = ConceptsDataset(
         name="GeneratedDataset",
-        train_concepts=[concept1_train, concept2_train, concept3_train],
-        test_concepts=[concept1_test, concept2_test, concept3_test],
+        train_concepts=[concept1_train, concept2_train, concept3_train, concept4_train],
+        test_concepts=[concept1_test, concept2_test, concept3_test, concept4_test],
     )
 
-    model = ECODAdapter()
+    # Define model, strategy, and callbacks
+    model = IsolationForestAdapter()
     strategy = CumulativeStrategy(model)
     callbacks = [
         ConceptMetricCallback(
@@ -60,9 +70,11 @@
         TimeEvaluationCallback(),
     ]
 
+    # Execute the concept incremental scenario
     scenario = ConceptIncrementalScenario(dataset, strategy=strategy, callbacks=callbacks)
     scenario.run()
 
+    # Save the results
     output_writer = JsonOutputWriter(pathlib.Path("output.json"))
     output_writer.write([model, dataset, strategy, *callbacks])
 
@@ -73,5 +85,7 @@
     sns.scatterplot(x=concept2_test.data[:, 0], y=concept2_test.data[:, 1], label="concept2_test")
     sns.scatterplot(x=concept3_train.data[:, 0], y=concept3_train.data[:, 1], label="concept3")
     sns.scatterplot(x=concept3_test.data[:, 0], y=concept3_test.data[:, 1], label="concept3_test")
+    sns.scatterplot(x=concept4_train.data[:, 0], y=concept4_train.data[:, 1], label="concept4")
+    sns.scatterplot(x=concept4_test.data[:, 0], y=concept4_test.data[:, 1], label="concept4_test")
     plt.legend()
     plt.show()
diff --git a/examples/plot_heatmap_example.py b/examples/plot_heatmap_example.py
@@ -9,7 +9,8 @@
         loaded_data = json.load(fp)
         concepts_order = loaded_data["concept_metric_callback_ROC-AUC"]["concepts_order"]
         metric_matrix = loaded_data["concept_metric_callback_ROC-AUC"]["metric_matrix"]
-        names_mapping = {"Cluster_0": "C0", "Cluster_1": "C1", "Cluster_2": "C2", "Cluster_3": "C3", "Cluster_4": "C4"}
+        # names_mapping = {"Cluster1": "C0", "Concept1": "C1", "Cluster_2": "C2", "Cluster_3": "C3", "Cluster_4": "C4"}
+        names_mapping = {"concept1": "C1", "concept2": "C2", "concept3": "C3", "concept4": "C4"}
         plot_metric_heatmap(
-            metric_matrix, concepts_order, names_mapping=names_mapping, output_path=pathlib.Path("heatmap.pdf")
+            metric_matrix, concepts_order, names_mapping=names_mapping, annotate=True, output_path=pathlib.Path("heatmap.pdf", ignore_upper_diagonal=True)
         )
diff --git a/src/pyclad/analysis/scenario_heatmap.py b/src/pyclad/analysis/scenario_heatmap.py
@@ -2,11 +2,21 @@
 from typing import Dict, List
 
 import matplotlib.pyplot as plt
+import numpy as np
 import pandas as pd
 import seaborn as sns
 from matplotlib.axes import Axes
 
 
+def _create_upper_diagonal_mask(tasks_no: int) -> np.array:
+    mask = np.zeros((tasks_no, tasks_no))
+    for i in range(tasks_no):
+        for j in range(tasks_no):
+            if j > i:
+                mask[i, j] = True
+    return mask
+
+
 def plot_metric_heatmap(
     matrix: Dict,
     concepts_order: List[str],
@@ -18,6 +28,7 @@ def plot_metric_heatmap(
     annotate: bool = False,
     color_palette: str = "plasma",
     figsize: tuple = (6, 5),
+    ignore_upper_diagonal: bool = False,
 ):
     sns.set_theme(style="darkgrid")
     sns.set(rc={"figure.figsize": figsize})
@@ -38,13 +49,20 @@ def plot_metric_heatmap(
     df = pd.DataFrame(data, columns=["learned_concept", "evaluated_concept", "metric_value"])
     df = df.pivot(index="learned_concept", columns="evaluated_concept", values="metric_value")
     p: Axes = sns.heatmap(
-        df, vmin=0, vmax=1, center=0.5, cmap=sns.color_palette(color_palette, as_cmap=True), annot=annotate
+        df,
+        vmin=0,
+        vmax=1,
+        center=0.5,
+        cmap=sns.color_palette(color_palette, as_cmap=True),
+        annot=annotate,
+        mask=None if ignore_upper_diagonal else _create_upper_diagonal_mask(len(concepts_order)),
     )
     p.set_xlabel(xlabel)
     p.set_ylabel(ylabel)
     p.set_title(title)
 
     if output_path is not None:
+        plt.tight_layout()
         plt.savefig(output_path)
 
     return p
diff --git a/src/pyclad/strategies/baselines/cumulative.py b/src/pyclad/strategies/baselines/cumulative.py
@@ -27,4 +27,4 @@ def name(self) -> str:
         return "Cumulative"
 
     def additional_info(self) -> Dict:
-        return {"model": self._model.name(), "buffer_size": len(self._replay)}
+        return {"model": self._model.name(), "buffer_size": len(np.concatenate(self._replay))}