Synthetic Tabular Data Generator Based on Diffusion Models

Description

This repository contains the code and data used for a research project by Sergio Arroni. The goal of the project is to create a synthetic tabular data generator using diffusion models. Different datasets can be found in ./all_results/synthetic within their respective folders.

Repository Structure

data/: Contains the data used in the project, divided into subfolders by type.
img/: Images used for training and analysis.
prep/: Data preprocessing scripts.
all_results/: Results obtained during project development.
final_results_tmp/: Temporary results of the model.
doc/: Research work and resources used.
main.py: Main script of the project.
requirements.txt: List of dependencies needed to run the project.
.gitignore: Files and folders ignored by git.

Installation

Cloning the Repository and Installing Dependencies

Clone the repository:

git clone https://github.com/SergioArroni/Synthetic-Tabular-Data-Generator.git
cd TFM

Install dependencies:
```
pip install -r requirements.txt
```

Setting Up PyTorch and CUDA 12

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate

Install PyTorch with CUDA 12 support: Follow the instructions on the PyTorch website:
```
pip install torch torchvision torchaudio
```
Verify CUDA installation: Follow the instructions on the CUDA Toolkit website:
```
nvcc --version
```

Usage

Enter the virtual environment:
```
.\.venv\Scripts\activate
```
Set the PYTHONPATH (in case of path issues):
```
$env:PYTHONPATH="Your/Code/Path"
```
Run the preprocessing script:
```
python prep/prep.py
```
Modify main.py as needed for the experiment and execute:
```
python main.py
```

Explanation of Main Components

`main.py`

The main script initializes seeds, preprocesses data if required, loads data, sets up and trains a diffusion model, and evaluates the model.

Boolean Flags: prep control data preprocessing, load controls whether to load a pre-trained model or train a new one.
Training and Evaluation: Uses PyTorch for training with options to adjust model parameters like epochs and batch size.

`test.py`

The Test class in test.py evaluates the model's performance in terms of efficiency, quality, and privacy.

Evaluation Methods: Includes methods for generating synthetic data, evaluating efficiency, quality, and privacy.
Customization: The tests can be modified depending on the specific evaluation criteria needed for the experiment.

`diffusion_model.py`

The DiffusionModel class defines the architecture and training procedure for the diffusion model.

Components:
- Encoder: Converts input into a higher-dimensional latent representation.
- Transformer: Processes the latent representation.
- Decoder: Converts the processed latent representation back into the original input size.
Diffusion Process: Iteratively adds noise controlled by beta values and adjusted based on the standard deviation of the activations.
Training: Uses PyTorch for backpropagation and gradient clipping.

Extendability

The project is structured to easily allow the addition of new models. This is achieved by using a strategy pattern, which ensures that new models can be integrated without altering the existing codebase significantly. This makes the project flexible and adaptable to various experimental needs.

Contributions

To contribute, please follow these steps:

Fork the repository.
Create a new branch (git checkout -b feature/new-feature).
Make your changes and commit them (git commit -am 'Add new feature').
Push your changes (git push origin feature/new-feature).
Open a Pull Request.

Contact

Sergio Arroni - [email protected]

Resources

More details can be found in the official repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic Tabular Data Generator Based on Diffusion Models

Description

Repository Structure

Installation

Cloning the Repository and Installing Dependencies

Setting Up PyTorch and CUDA 12

Usage

Explanation of Main Components

`main.py`

`test.py`

`diffusion_model.py`

Extendability

Contributions

Contact

Resources

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Hefesto		Hefesto
all_results		all_results
data/cardio		data/cardio
doc		doc
final_results_tmp		final_results_tmp
img/train		img/train
prep		prep
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

SergioArroni/Synthetic-Tabular-Data-Generator

Folders and files

Latest commit

History

Repository files navigation

Synthetic Tabular Data Generator Based on Diffusion Models

Description

Repository Structure

Installation

Cloning the Repository and Installing Dependencies

Setting Up PyTorch and CUDA 12

Usage

Explanation of Main Components

main.py

test.py

diffusion_model.py

Extendability

Contributions

Contact

Resources

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`main.py`

`test.py`

`diffusion_model.py`

Packages