Skip to content

Generate synthetic time-series using generative adversarial networks. Functional end-to-end system for dataset generation, model registry/inferences and UI interface for evaluation.

Notifications You must be signed in to change notification settings

ML4ITS/synthetic-data

Repository files navigation

Synthetic Time-Series

Generate synthetic time-series using generative adversarial networks. This project holds an end-to-end system for generating time-series dataset with database support, training scripts running on compute clusters, post-training model registration, interactive model inferences with time-series visualization.



Software Architecture

Home page


Docker Structure

Home page


How It Works

  1. Create dataset in the TS Generation page. Dataset is sent to API which saves it in the Mongo DB database with the configurations parameters used.
  2. From the TS Database page, we query the API and get automatically all the datasets available from the database. We can then inspect/interact with the vizualized datasets (time-series).
  3. We're now ready to initiate a training session by submitting a train job to the Ray. In addition to training functions, we're required to set a model name and the dataset name we want. From the training script, we submit the job to Ray, which runs the job, saves each model after each training and finally loops through all of the trials, and registrates the best one to the database. As the job is running we can inspect the progression for each trial in the ML Flow.
  4. As the page loads, we fetch all registrated models from the model registry. By selecting the model we want, we can send a inference request to the API with a given model name, version and inference parameters. The request will prompt the API to load the registrated model from the ML Flow model registry (or use a locally cached version). Subsequently, the API runs a forward pass on the data provided, and returns a prediction response. Finally, the UI application will process the meta response and render a interactive vizualization of the prediction.

User-Interface

HOME

Home page


TS Generation

Home page


TS Database

Home page


TS Operations

Home page

File structure

.
├── docker-compose.yml
├── Dockerfile
├── pyproject.toml
├── README.md
├── setup.py
└── synthetic_data
    ├── api
    │   └── *.py
    ├── app
    │   ├── *.py
    │   └── pages
    │       └── *.py
    ├── common
    │   ├── *.py
    └── mlops
        ├── datasets
        │   └── *.py
        ├── models
        │   └── *.py
        ├── tools
        │   └── *.py
        ├── train_*.py
        ├── train_*.sh
        └── transforms
            └── *.py

Prerequisites

Usage

  1. Follow instructions for installing Docker Engine.

  2. Install repository

    git clone [email protected]:ML4ITS/synthetic-data.git
    cd synthetic-data
  3. Create an environment file (.env) with the following credentials:

    # Hostname of service/server running Ray ML (aka. the Ray compute cluster) 
    COMPUTATION_HOST=<REPLACE> // Service running Ray ML
    # Port of service/server running Ray ML (aka. the Ray compute cluster) 
    RAY_PORT=<REPLACE>
    
    # Hostname of service/server running ML Flow 
    APPLICATION_HOST=<REPLACE>
    # Port of service/server running ML Flow 
    MODELREG_PORT=<REPLACE>
    
    # Select the name of your database
    DATABASE_NAME=<REPLACE>
    
    # Protect the database with your username & password
    DATABASE_USERNAME=<REPLACE>
    DATABASE_PASSWORD=<REPLACE>
    
    # Hostname of database (aka. the name of the container when running Docker)
    DATABASE_HOST=mongodb
    DATABASE_PORT=27017
    
    # Hostname of service/server running the API (aka. the name of the container when running Docker)
    BACKEND_HOST=backend
    BACKEND_PORT=8502

    The following credentials are then stored in the .env file, and will be located by dotenv to handle the various config classes when running the application.
    When interfacing with Ray Tune / ML Flow, we use underlying server configuration:

      class ServerConfig:
    
          @property
          def APPLICATION_HOST(self):
              return os.getenv("APPLICATION_HOST")
    
          @property
          def COMPUTATION_HOST(self):
              return os.getenv("COMPUTATION_HOST")

    (see config.py for more details)

  4. Run the following command to start the application:

      docker-compose up --build -d

Training

  1. Create an virtual environment and install dependencies

    NOTE: local developement requires python 3.7, because of the timesynth library

      virtualenv venv -p=python3.7
      source venv/bin/activate
      pip install -e .
  2. Run the following shell script to train your C-GAN/WGAN-GP model:

    NOTE: adjust training parameters as needed inside their respective *.py files

      sh synthetic_data/mlops/train_cgan.sh
      # or
      sh synthetic_data/mlops/train_gan_gp.sh

Evaluation

Following performance indications and visualizations are based on two models: the WGAN-GP model and the C-GAN model. Both were trained on the same datasets. WGAN-GP model was trained using a learning rate of 0.0002, batch size of 128 and a total of 1000 epochs (or 9990050 global steps). C-GAN model was trained using a learning rate of 0.0002, batch size of 128 and a total of 300 epochs (or 3000050 global steps).

---

In the space of deep learning, GANs opposed to e.g. object detectors, doesn't really have a direct way of measuring performance straight forward. Where object detectors could rely on intersection over union as simple and easy evaluation metric for measuring bounding box accuracy, GAN models are much more difficult to guide and interpret in terms of training and evaluation.

Commonly, GAN models are used for image generation which is a task that is not directly related to time-series generation. For image generation, popular evaluation metrics such as Inception Score and Fréchet Inception Distance has been used to evaluate the performance of GAN models. Both of these metrics relies on a pre-trained image classifier (e.g. Inception-v3) developed for 2D -domain. This leaves us with the challenge of evaluating the performance of GAN models in terms of time-series generation, as we're working with 1D -domain.

---

Efforts has been made to evaluate the performance of GAN models in terms of time-series generation. The analysis.ipynb notebooks shows various experiments and evaluations such as average cosine similarity scoring, t-SNE, PCA and latent-space interpolation. By looking at a few of them, we can evaluate them visually.

The most straight forward way to evaluate the performance visually is to generate (e.g. 10 samples) and compare them equally to the real time-series data.

WGAN-GP: Randomly sampled original data vs. Random generated data

Home page

C-GAN: Sequentially sampled original data vs. Conditionally generated data

Home page

t-SNE and PCA

t-SNE and PCA are two slightly more uncommon ways of evaluating performance, but they could help discovery insights by displaying clusters visually. For instance, using PCA, we can sample the original and generated data per class, cluster them in pairs and vizualize the distributions as seen below.

C-GAN: Condition on 1-10 Hz

Home page

--

Latent-space exploration

To investigate how the models generalize as they are presented various latent-space inputs, we can by interpolation, manipulate the inputs to discover different but similar sequence generations. The examples below shows output sequences based on a given latent space from 10 different noise distributions with 200 spherical linear interpolation interpolations between each one. For setup and how to perform these interpolations, see the slerp.ipynb notebook.

Both models were trained on the same multi-harmonic dataset, consisting of 10 000 time series evenly distributed between 1-10 Hz. Using conditions/labels, we can manipulate (embed) the input latent space to make the generator output desired frequencies.

WGAN-GP: latent space interpolation (slerp)

Home page


C-GAN: latent space interpolation (slerp)

Home page

Future suggestions

  • Create unified 'model-registration-method' for training scripts
  • Migrate model trainers to PyTorch Lightning
  • Refactor LSMT to train with the new MultiHarmonicDataset
  • Implementing IS and FID (using some kinda of 1-D classifier (e.g Incepetion v3 model but for 1-D?)
  • Experiment with training / evaluating models using TSTR or TRTS.
  • Experiment with other datasets (e.g. synthetic, real, etc.)
  • Experiment with other evaluation metrics for GANs.
  • Experiment with subtracting some gaussian, and looking at the residuals.
  • Experiment with other evaluation metrics for GANs.
  • Refactor certain components of the application, and remove unnecessary/unused methods.
  • Experiment with different types of generated time-series (e.g. gaussian process, pseudo-periodic, auto-regressive, etc.)

About

Generate synthetic time-series using generative adversarial networks. Functional end-to-end system for dataset generation, model registry/inferences and UI interface for evaluation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published