This repository contains the code and data used for a research project by Sergio Arroni. The goal of the project is to create a synthetic tabular data generator using diffusion models. Different datasets can be found in ./all_results/synthetic
within their respective folders.
data/
: Contains the data used in the project, divided into subfolders by type.img/
: Images used for training and analysis.prep/
: Data preprocessing scripts.all_results/
: Results obtained during project development.final_results_tmp/
: Temporary results of the model.doc/
: Research work and resources used.main.py
: Main script of the project.requirements.txt
: List of dependencies needed to run the project..gitignore
: Files and folders ignored by git.
-
Clone the repository:
git clone https://github.com/SergioArroni/Synthetic-Tabular-Data-Generator.git cd TFM
-
Install dependencies:
pip install -r requirements.txt
-
Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate
-
Install PyTorch with CUDA 12 support: Follow the instructions on the PyTorch website:
pip install torch torchvision torchaudio
-
Verify CUDA installation: Follow the instructions on the CUDA Toolkit website:
nvcc --version
-
Enter the virtual environment:
.\.venv\Scripts\activate
-
Set the PYTHONPATH (in case of path issues):
$env:PYTHONPATH="Your/Code/Path"
-
Run the preprocessing script:
python prep/prep.py
-
Modify
main.py
as needed for the experiment and execute:python main.py
The main script initializes seeds, preprocesses data if required, loads data, sets up and trains a diffusion model, and evaluates the model.
- Boolean Flags:
prep
control data preprocessing,load
controls whether to load a pre-trained model or train a new one. - Training and Evaluation: Uses PyTorch for training with options to adjust model parameters like epochs and batch size.
The Test
class in test.py
evaluates the model's performance in terms of efficiency, quality, and privacy.
- Evaluation Methods: Includes methods for generating synthetic data, evaluating efficiency, quality, and privacy.
- Customization: The tests can be modified depending on the specific evaluation criteria needed for the experiment.
The DiffusionModel
class defines the architecture and training procedure for the diffusion model.
- Components:
- Encoder: Converts input into a higher-dimensional latent representation.
- Transformer: Processes the latent representation.
- Decoder: Converts the processed latent representation back into the original input size.
- Diffusion Process: Iteratively adds noise controlled by beta values and adjusted based on the standard deviation of the activations.
- Training: Uses PyTorch for backpropagation and gradient clipping.
The project is structured to easily allow the addition of new models. This is achieved by using a strategy pattern, which ensures that new models can be integrated without altering the existing codebase significantly. This makes the project flexible and adaptable to various experimental needs.
To contribute, please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature/new-feature
). - Make your changes and commit them (
git commit -am 'Add new feature'
). - Push your changes (
git push origin feature/new-feature
). - Open a Pull Request.
Sergio Arroni - [email protected]
- More details can be found in the official repository.