Skip to content

Commit

Permalink
Merge pull request #14 from lifelonglab/example_update
Browse files Browse the repository at this point in the history
Update example and add additional options for heatmap
  • Loading branch information
Nyderx authored Aug 14, 2024
2 parents 9c8ba8a + 0edbc17 commit f040018
Show file tree
Hide file tree
Showing 5 changed files with 92 additions and 45 deletions.
52 changes: 33 additions & 19 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,57 @@
### What is pyCLAD?

The pyCLAD library contains classes providing tools and methods to experiment with continual anomaly detection problems.
pyCLAD is a unified framework for continual anomaly detection. Its main goal is to foster successful scientific
development in continual anomaly detection by providing robust implementations of common functionalities for continual
anomaly detection off-the-shelf, minimizing the risk of error-prone tasks and fostering replicability. pyCLAD also
facilitates the design and implementation of experimental pipelines, providing a streamlined,
unified, and fully reproducible execution workflow.

The library is meant to be as easy-to-use as possible, coupled with extensive documentation of all functions and examples of how to effectively use them.
It also provides a simple and convenient infrastructure for designing new strategies, models, and evaluation procedures,
enabling researchers to avoid repetitive tasks and allowing them to focus on creative and scientific aspects, reducing
the friction related to low-level implementation aspects.

The main goal of pyCLAD is to simplify cumbersome tasks related to the design of experimental pipelines for continual anomaly detection.

pyCLAD provides implementations of common functionalities for continual anomaly detection off-the-shelf, facilitating data analysis and visualization, and minimizing the risk of error-prone tasks, and fostering replicability.

It also provides a simple and convenient infrastructure for designing new strategies, models, and evaluation procedures, enabling researchers to avoid repetitive tasks and allowing them to focus on creative and scientific aspects, reducing the friction related to low-level implementation aspects.

The core coding infrastructure is influenced by PyTorch, a very popular framework for deep learning as well as SkLearn, the leading machine learning and data analysis library in Python.
The core coding infrastructure is influenced by PyTorch, a very popular framework for deep learning as well as SkLearn,
the leading machine learning and data analysis library in Python.

### What is continual anomaly detection?

**Continual Learning** puts emphasis on models that answering the need for machine learning models that continuously adapt to new challenges in dynamic environments while retaining past knowledge.
**Continual Learning** puts emphasis on models that answering the need for machine learning models that continuously
adapt to new challenges in dynamic environments while retaining past knowledge.

**Anomaly Detection** is the process of detecting deviations from the normal behavior of a process, and has a very wide range of applications including monitoring cyber-physical systems, human conditions, as well as network traffic.
**Anomaly Detection** is the process of detecting deviations from the normal behavior of a process, and has a very wide
range of applications including monitoring cyber-physical systems, human conditions, as well as network traffic.

**Continual anomaly detection** lies at the intersection of these two fields. Its strategies focus on specific goals which can be compared to other types of anomaly detection approaches. One possible categorization is the following:
**Continual anomaly detection** lies at the intersection of these two fields. Its strategies focus on specific goals
which can be compared to other types of anomaly detection approaches. One possible categorization is the following:

- **Offline**: Models are trained once on background data and do not require updates (examples: post-incident analysis, breast cancer detection) and on updating the model as new data is observed, assuming only the most recent information is relevant. This approach is static in nature and does not provide adaptation.
- **Offline**: Models are trained once on background data and do not require updates (examples: post-incident analysis,
breast cancer detection) and on updating the model as new data is observed, assuming only the most recent information
is relevant. This approach is static in nature and does not provide adaptation.

- **Online**: Models are updated as new data is observed, assuming that the most recent information is the most relevant. This approach is popular in real-world dynamic applications where adaptation is necessary, but makes models prone to forgetting past knowledge.
- **Online**: Models are updated as new data is observed, assuming that the most recent information is the most
relevant. This approach is popular in real-world dynamic applications where adaptation is necessary, but makes models
prone to forgetting past knowledge.

- **Continual**: Models are updated to simultaneously consider *adaptation* to new conditions and *knowledge retention* of previously observed (and potentially recurring) conditions.
This behavior attempts to overcome limitations of both offline and online anomaly detection in complex scenarios.
- **Continual**: Models are updated to simultaneously consider *adaptation* to new conditions and *knowledge retention*
of previously observed (and potentially recurring) conditions.
This behavior attempts to overcome limitations of both offline and online anomaly detection in complex scenarios.

If you want to learn more about continual anomaly detection, we recommend [this open-access paper](https://ieeexplore.ieee.org/abstract/document/10473036/).
If you want to learn more about continual anomaly detection, we
recommend [this open-access paper](https://ieeexplore.ieee.org/abstract/document/10473036/).

### How do I install pyCLAD?

pyCLAD is available as a [Python package on PyPI](https://pypi.org/project/pyclad/). Therefore, it can be installed using tools such as pip and conda.
pyCLAD is available as a [Python package on PyPI](https://pypi.org/project/pyclad/). Therefore, it can be installed
using tools such as pip and conda.

#### Conda

```
conda install -c conda-forge pyclad
```

#### Pip

```
pip install pyclad
```
Expand All @@ -50,8 +64,8 @@ We do not include them in default installation to avoid putting heavy dependenci
pyCLAD supports the use of any model from pyOD library, some of which may require installation of additional packages (
see [pyOD docs](https://pyod.readthedocs.io/en/latest/).


### Citing pyCLAD

A paper describing pyCLAD is currently under review in SoftwareX journal. Feel free to use pyCLAD in your research, but
please come back before your final submission to check if we already have a DOI for pyCLAD. Thank you!

Expand Down
58 changes: 36 additions & 22 deletions examples/concept_incremental_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from pyclad.metrics.continual.average_continual import ContinualAverage
from pyclad.metrics.continual.backward_transfer import BackwardTransfer
from pyclad.metrics.continual.forward_transfer import ForwardTransfer
from pyclad.models.adapters.pyod_adapters import ECODAdapter
from pyclad.models.adapters.pyod_adapters import IsolationForestAdapter
from pyclad.output.json_writer import JsonOutputWriter
from pyclad.scenarios.concept_incremental import ConceptIncrementalScenario
from pyclad.strategies.baselines.cumulative import CumulativeStrategy
Expand All @@ -22,35 +22,45 @@
logging.basicConfig(level=logging.DEBUG, handlers=[logging.FileHandler("debug.log"), logging.StreamHandler()])


def _generate_normal_dist(mean, cov):
train_data = np.random.multivariate_normal(mean, cov, (100,))
test_data = np.concatenate([np.random.multivariate_normal(mean, cov, (50,)),
np.random.multivariate_normal([3 * m for m in mean], cov, (50,))])
test_labels = np.array([0] * 50 + [1] * 50)
return train_data, test_data, test_labels


if __name__ == "__main__":
concept1_train = Concept("concept1", data=np.random.chisquare(2, (100, 2)))
concept1_test = Concept(
"concept1",
data=np.concatenate([np.random.chisquare(2, (50, 2)), np.random.chisquare(20, (50, 2))]),
labels=np.array([0] * 50 + [1] * 50),
)
"""
This example show how to create a simple dataset with 4 concepts and carry out a concept aware scenario with
CumulativeStrategy and IsolationForest model.
"""
concept1_train_data, concept1_test_data, concept1_test_labels = _generate_normal_dist((2, 2), [[1, 0], [0, 1]])
concept2_train_data, concept2_test_data, concept2_test_labels = _generate_normal_dist((50, 50), [[1, 0], [0, 1]])
concept3_train_data, concept3_test_data, concept3_test_labels = _generate_normal_dist((5, 5), [[1, 0], [0, 1]])
concept4_train_data, concept4_test_data, concept4_test_labels = _generate_normal_dist((20, 20), [[1, 0], [0, 1]])

concept2_train = Concept("concept2", data=np.random.chisquare(100, (100, 2)))
concept2_test = Concept(
"concept2",
data=np.concatenate([np.random.chisquare(100, (50, 2)), np.random.chisquare(150, (50, 2))]),
labels=np.array([0] * 50 + [1] * 50),
)
concept1_train = Concept("concept1", data=concept1_train_data)
concept1_test = Concept("concept1", data=concept1_test_data, labels=concept1_test_labels)

concept3_train = Concept("concept3", data=np.random.chisquare(50, (100, 2)))
concept3_test = Concept(
"concept3",
data=np.concatenate([np.random.chisquare(50, (50, 2)), np.random.chisquare(125, (50, 2))]),
labels=np.array([0] * 50 + [1] * 50),
)
concept2_train = Concept("concept2", data=concept2_train_data)
concept2_test = Concept("concept2", data=concept2_test_data, labels=concept2_test_labels)

concept3_train = Concept("concept3", data=concept3_train_data)
concept3_test = Concept("concept3", data=concept3_test_data, labels=concept3_test_labels)

concept4_train = Concept("concept4", data=concept4_train_data)
concept4_test = Concept("concept4", data=concept4_test_data, labels=concept4_test_labels)

# Build a dataset based on the previously created concepts
dataset = ConceptsDataset(
name="GeneratedDataset",
train_concepts=[concept1_train, concept2_train, concept3_train],
test_concepts=[concept1_test, concept2_test, concept3_test],
train_concepts=[concept1_train, concept2_train, concept3_train, concept4_train],
test_concepts=[concept1_test, concept2_test, concept3_test, concept4_test],
)

model = ECODAdapter()
# Define model, strategy, and callbacks
model = IsolationForestAdapter()
strategy = CumulativeStrategy(model)
callbacks = [
ConceptMetricCallback(
Expand All @@ -60,9 +70,11 @@
TimeEvaluationCallback(),
]

# Execute the concept incremental scenario
scenario = ConceptIncrementalScenario(dataset, strategy=strategy, callbacks=callbacks)
scenario.run()

# Save the results
output_writer = JsonOutputWriter(pathlib.Path("output.json"))
output_writer.write([model, dataset, strategy, *callbacks])

Expand All @@ -73,5 +85,7 @@
sns.scatterplot(x=concept2_test.data[:, 0], y=concept2_test.data[:, 1], label="concept2_test")
sns.scatterplot(x=concept3_train.data[:, 0], y=concept3_train.data[:, 1], label="concept3")
sns.scatterplot(x=concept3_test.data[:, 0], y=concept3_test.data[:, 1], label="concept3_test")
sns.scatterplot(x=concept4_train.data[:, 0], y=concept4_train.data[:, 1], label="concept4")
sns.scatterplot(x=concept4_test.data[:, 0], y=concept4_test.data[:, 1], label="concept4_test")
plt.legend()
plt.show()
5 changes: 3 additions & 2 deletions examples/plot_heatmap_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@
loaded_data = json.load(fp)
concepts_order = loaded_data["concept_metric_callback_ROC-AUC"]["concepts_order"]
metric_matrix = loaded_data["concept_metric_callback_ROC-AUC"]["metric_matrix"]
names_mapping = {"Cluster_0": "C0", "Cluster_1": "C1", "Cluster_2": "C2", "Cluster_3": "C3", "Cluster_4": "C4"}
# names_mapping = {"Cluster1": "C0", "Concept1": "C1", "Cluster_2": "C2", "Cluster_3": "C3", "Cluster_4": "C4"}
names_mapping = {"concept1": "C1", "concept2": "C2", "concept3": "C3", "concept4": "C4"}
plot_metric_heatmap(
metric_matrix, concepts_order, names_mapping=names_mapping, output_path=pathlib.Path("heatmap.pdf")
metric_matrix, concepts_order, names_mapping=names_mapping, annotate=True, output_path=pathlib.Path("heatmap.pdf", ignore_upper_diagonal=True)
)
20 changes: 19 additions & 1 deletion src/pyclad/analysis/scenario_heatmap.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,21 @@
from typing import Dict, List

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib.axes import Axes


def _create_upper_diagonal_mask(tasks_no: int) -> np.array:
mask = np.zeros((tasks_no, tasks_no))
for i in range(tasks_no):
for j in range(tasks_no):
if j > i:
mask[i, j] = True
return mask


def plot_metric_heatmap(
matrix: Dict,
concepts_order: List[str],
Expand All @@ -18,6 +28,7 @@ def plot_metric_heatmap(
annotate: bool = False,
color_palette: str = "plasma",
figsize: tuple = (6, 5),
ignore_upper_diagonal: bool = False,
):
sns.set_theme(style="darkgrid")
sns.set(rc={"figure.figsize": figsize})
Expand All @@ -38,13 +49,20 @@ def plot_metric_heatmap(
df = pd.DataFrame(data, columns=["learned_concept", "evaluated_concept", "metric_value"])
df = df.pivot(index="learned_concept", columns="evaluated_concept", values="metric_value")
p: Axes = sns.heatmap(
df, vmin=0, vmax=1, center=0.5, cmap=sns.color_palette(color_palette, as_cmap=True), annot=annotate
df,
vmin=0,
vmax=1,
center=0.5,
cmap=sns.color_palette(color_palette, as_cmap=True),
annot=annotate,
mask=None if ignore_upper_diagonal else _create_upper_diagonal_mask(len(concepts_order)),
)
p.set_xlabel(xlabel)
p.set_ylabel(ylabel)
p.set_title(title)

if output_path is not None:
plt.tight_layout()
plt.savefig(output_path)

return p
2 changes: 1 addition & 1 deletion src/pyclad/strategies/baselines/cumulative.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,4 @@ def name(self) -> str:
return "Cumulative"

def additional_info(self) -> Dict:
return {"model": self._model.name(), "buffer_size": len(self._replay)}
return {"model": self._model.name(), "buffer_size": len(np.concatenate(self._replay))}

0 comments on commit f040018

Please sign in to comment.