ICT4S'23: Energy cost and machine learning accuracy impact of k-anonymisation and synthetic data techniques
This is the repository that contains the code for the ICT4S'23 paper. The code may be reproduced by referring to this paper. The presentation of the paper was recorded, you can find the recording here. The slides are provided in this repository. All files should have sufficient documentation for reproduction and understanding. Any remaining questions or comments may be sent to my e-mail.
The article and/or this repository should be cited as:
@inproceedings{de2023energy,
title={Energy cost and machine learning accuracy impact of k-anonymisation and synthetic data techniques},
author={de Reus, Pepijn and Oprescu, Ana and van Elsen, Koen},
booktitle={2023 International Conference on ICT for Sustainability (ICT4S)},
pages={57--65},
year={2023},
organization={IEEE}
}
The repository is structured with the following folders and files:
This folder contains the data sets obtained from the UCI machine learning repository, separated using two different folders. It also contains two Python files to clean and preprocess the data sets as described in the Experimental Setup of the paper. After running these files the Energy folder will be available containing the energy consumptions of the data preprocessing and cleaning.
The benchmark folder contains the Python scripts for three different machine learning models and one script (run_results.py) that combines these three models to obtain results. After running the results the Performance folder will be filled with measurements of the accuracy and energy consumption for this benchmark.
The folders for Anonymisation and Synthetic data have separate readme files with introduction and instructions to the code. The folders are used for anonymising or synthesising the data and performing the experiments, after which the results will be stored in these folders respectively. The hyperparameters used in our paper are included in the synthetic data generation code and anonymisation code.
Two notebooks are provided that use the results to summarise and visualise the data from the experiments. This notebook contains the plots used in Figures 3 and 4 of the paper. Finally this notebook contains the code required for the Mann Whitney U test presented in Table V.
The used notebooks for Tables II-IV and VI are available in the folders Anonymisation and Synthetic data.
The gitignore file is set up to ignore preprocessed data sets and results to keep the repository small in size. It also ignores .ipynb files and the checkpoints of these as these were used for development purposes only.