PATH: A Dataset for Evaluating Online Anomaly Detection Approaches for Discrete Multivariate Time Series
We propose a diverse, extensive, and non-trivial data set generated via state-of-the-art simulation tools that reflect realistic behaviour of an automotive powertrain, including its multivariate, dynamic and variable-state properties. To cater for both unsupervised and semi-supervised anomaly detection settings, as well as time series generation and forecasting, different versions of the data set are made available, where training and test subsets are offered in contaminated and clean versions, depending on the task. The preprint of the paper corresponding to this repository can be found on arXiv.
The multivariate time series composing the proposed data set are generated using a Simulink simulation model and are provided so you don't have to simulate yourself.
Simulation is both financially and computationally expensive, but if you do really want to simulate yourself, find relevant scripts for data set generation under simulation_model/EV
folder, which should be set as the working directory:
simulation_script_normal_parallel.m
performs the simulation of normal sequencessimulation_script_anomaly_parallel.m
performs the simulation of anomalous sequences
The scripts are written to leverage multi-core CPUs to run multiple simulations at once.
Each simulation yields a multivariate time series and is saved as a .mat file in the format A_B_C_D_E.mat
, where:
A
= drive cycleB
= battery temperature in °C multiplied by 10C
= battery state of charge in % multiplied by 10D
= label (anomaly type, control or normal)E
= start of anomaly, if applicable
Note that to generate the data set, Matlab and Simulink as well as the following toolboxes are required:
- Parallel Computing Toolbox
- Statistics and Machine Learning Toolbox
- Powertrain Blockset
- Simscape
- Simscape Fluids
- Simscape Electric Plant
The Matlab version used for simulation is 23.2, which applies to Simulink and all toolboxes as well. After simulation all processes (data processing, model training, inference, evaluation) are done using Python 3.10.
The data set and consists of three states, each with a folder associated with it:
0_simulation
, where the raw simulation output is saved1_postsim
, where files are saved after post-simulation processing2_preprocessed
, where files are saved after downsampling, standardising, and windowing
The contents of the 1_postsim
folder can be found on Zenodo, which consists of the following pickle files:
normal.pkl
, which contains all nominal sequencesanomalous.pkl
, which contains all anomalous sequencescontrol.pkl
, which contains all control-counterparts to anomalous.pkltraining.pkl
, which contains all pre-determined folds for trainingtraining_clean.pkl
, a version of training.pkl without anomalous sequencestesting.pkl
, which contains all pre-determined folds for testingtesting_clean.pkl
, a version of testing.pkl without anomalous sequences
Each pickle file is a list of several 2D NumPy arrays, each representing a multivariate time series. The name of the corresponding .mat file (and, by extension, the label) is present in the metadata. For NumPy object array
, it can be read by calling array.dtype.metadata['file_name']
.
The raw simulation output sequences in the 0_simulation
folder are not provided due to data host limitations.
We decided to omit the data belonging to the 2_preprocessed
folder as the contents are specific to the TensorFlow data pipeline and the same data host limitations would apply. If needed, the contents can be obtained by running 1_data.py
; for more details, see the Reproducing Results section below.
Working scripts for OmniAnomaly
, TCN-AE
, SISVAE
, LW-VAE
, and TeVAE
can be found in the src
folder:
0_postsim.py
performs post simulation processing (trimming, adding noise, comparing with control simulations, splitting into folds and training/test subsets) on outputs of the simulation (.mat files). This script is only relevant if you want to generate the data set yourself and want to process it after simulation.1_data.py
performs data processing prior to training (downsampling, standardising, windowing, converting to tf.data).2_training.py
performs model training.3_inference.py
does the inference on the validation and testing subsets.4_evaluation.py
evaluates the results from inference.
❗ Important ❗ Users that do not wish to generate the data set themselves, can skip 0_postsim.py
and use the train(_clean).pkl
and test(_clean).pkl
pickle files in the 1_postsim
data folder.
The remaining scripts can be executed in that order to obtain the results in the paper.
Utility functions can be found in the ts_processor
script in the ts_functions
folder in this repository.
Custom model classes for each of the tested approaches can be found in the model_garden
folder in this repository.
Typically, a .env
file should be excluded from version control, though we have added a dummy one (.env_dummy
) to illustrate the file structure.
requirements.txt
(venv) and pyprojects.toml
(uv) contain all libraries used.
TSADIS
requires a separate environment on Python 3.9 due to incompatibility with Pyton 3.10. See README.md in the tsadis
folder in this repository for more details.