This repository provides a comprehensive toolkit for generating synthetic data using seven different models. The toolkit evaluates the generated data for utility, similarity/fidelity, and privacy, specifically tailored for tabular datasets with binary classification problems (e.g., True/False, Yes/No).
The project implements the following models for synthetic data generation:
- CopulaGAN
- CTGAN
- Gaussian Copula
- TVAE
- Gaussian Multivariate
- WGAN
- ARF
Install the package using pip:
pip install synthius
To understand how to use this package, explore the three example Jupyter notebooks included in the repository:
-
- Demonstrates how to generate synthetic data using seven different models.
- Update paths and configurations (e.g., file paths, target column) to fit your dataset.
- Run the cells to generate synthetic datasets.
-
- Evaluates the utility.
- Update the paths as needed to analyze your data.
-
- Provides examples of computing metrics for evaluating synthetic data, including:
- Utility
- Fidelity/Similarity
- Privacy
- Update paths and dataset-specific configurations and run the cells to compute the results.
- Provides examples of computing metrics for evaluating synthetic data, including:
These notebooks serve as practical examples to demonstrate how to effectively utilize the toolkit.
Mac users may encounter errors during installation. To resolve these issues, install the required dependencies and set up the environment:
-
Install dependencies using Homebrew:
brew install libomp llvm
-
Set up the environment:
export PATH="/opt/homebrew/opt/llvm/bin:$PATH" export CC=$(brew --prefix llvm)/bin/clang export CXX=$(brew --prefix llvm)/bin/clang++ export CXXFLAGS="-I$(brew --prefix llvm)/include -I$(brew --prefix libomp)/include" export LDFLAGS="-L$(brew --prefix llvm)/lib -L$(brew --prefix libomp)/lib -lomp"
Special thanks to all contributors and the libraries used in this project.