Updated Repo TheRensselaerIDEA/synthetic_data
Supplemental Material for the ESANN 2019 Submission "Preserving privacy using synthetic data models and applications in health informatics education". This includes a supplemental section to the paper located at supplemental_material.pdf
. There is also code for all of the generative methods and metrics.
The code for this method is located in the generators/gaussian_multivariate.py
file. It uses the sci-kit learn Gaussian mixture method.
The code for this method is located in the generators/wgan.py
file. This method uses tensorflow to create the GAN. It is based on the methods from the paper "Improved Training of Wasserstein GANs" and the repository from the author https://github.com/igul222/improved_wgan_training.
The architecture of the HealthGAN is as follows
- Generator
- Input: 100 nodes of noise from the latent space of a normal distribution
- Dense Layer of 2 x number of features in the data nodes
- Rectified linear unit activation
- Dense Layer of 1.5 x number of features in the data nodes
- Rectified linear unit activation
- Dense Layer of number of features of the data nodes
- Sigmoid activation
- Discriminator
- Input: number of features in the data
- Dense Layer of 64 nodes
- Leaky rectified linear unit activation
- Dense Layer of 128 nodes
- Leaky rectified linear unit activation
- Dense Layer of 256 nodes
- Leaky rectified linear unit activation
- Dense Layer of 1 node
- No activation function
- Batch size is computed to be the size of the training data divided by the number of critic iterations
The code for this method is located in the generators/additive_noise_model.py
file. It uses the random forest classifiers from sci-kit learn.
The code for this method is located in the generators/parzen_windows.py
file. It uses the kernel density method from sci-kit learn.
This method just copies the original data and therefore there isn't any code included.
This method was done using the open source software ARX.
The generators/sdv_converter.py
file contains code to convert the data into values from 0 to 1 as described in the supplemental material. This is used for the Wasserstein GAN method to ensure the values generated are reasonable.
The nearest neighbor adversarial accuracy is calculated using the metrics/nn_adversarial_accuracy.py
file.
The nearest neighbor utility is calculated using the metrics/nn_utility.py
file.