This is the CleanML Benchmark for Joint Data Cleaning and Machine Learning.
The codebase is located in: https://github.com/chu-data-lab/CleanML
The details of the benchmark methodology and design are described in the paper:
CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]
This project has been tested with Python 3.6 and Python 3.7.
python3 -m venv venv
source venv/bin/activate
pip install -U pip
pip install -r requirements.txt
To run experiments, download and unzip the datasets. Place it under the project home directory and execute the following command from the project home directory:
python3 main.py --run_experiments [--dataset <name>] [--cpu <num_cpu>] [--log]
--dataset
: the experiment dataset. If not specified, the program will run experiments on all datasets.
--cpu
: the number of cpu used for experiment. Default is 1.
--log
: whether to log experiment process
The experimental results for each dataset will be saved in /result
directory as a json file named as <dataset name>_result.json. Each result is a key-value pair. The key is a string in format "<dataset>/<split seed>/<error type>/<clean method>/<ML model>/<random search seed>". The value is a set of key-value pairs for each evaluation metric and result. Our experimental results are provided in result.zip
.
To run analysis for populating relations described in the paper, unzip result.zip
and execute the following command from the project home directory:
python3 main.py --run_analysis [--alpha <value>]
--alpha
: the significance level for multiple hypothesis test. Default is 0.05.
The relations R1, R2 and R3 will be saved in /analysis
directory. Our analysis results are provided in analysis.zip
.
To add a new dataset, first, create a new folder with dataset name under /data
and create a raw
folder under the new folder. The raw
folder must contain raw data named raw.csv
. For dataset with inconsistencies, it must also contain the inconsistency-cleaned version data named inconsistency_clean_raw.csv
. For dataset with mislabels, it must also contain the mislabel-cleaned version data named mislabel_clean_raw.csv
. The structure of the directory looks like:
. └── data └── new_dataset └── raw ├── raw.csv ├── inconsistency_clean_raw.csv (for dataset with inconsistencies) └── mislabel_clean_raw.csv (for dataset with mislabels)
Then add a dictionary to /schema/dataset.py
and append it to datasets
array at the end of the file.
The new dictionary must contain the following keys:
data_dir: the name of the dataset.
error_types: a list of error types that the dataset contains.
label: the label of ML task.
The following keys are optional:
class_imbalance: whether the dataset is class imbalanced.
categorical_variables: a list of categorical attributes.
text_variables: a list of text attributes.
key_columns: a list of key columns used for deduplication.
drop_variables: a list of irrelevant attributes.
To add a new error type, add a dictionary to /schema/error_type.py
and append it to error_types
array at the end of the file.
The new dictionary must contain the following keys:
name: the name of the error type.
cleaning_methods: a dictionary, {cleaning method name: cleaning methods object}.
To add a new ML model, add a dictionary to /schema/model.py
and append it to models
array at the end of the file.
The new dictionary must contain the following keys:
name: the name of the model.
fn: the function of the model.
fixed_params: parameters not to be tuned.
hyperparams: the hyperparameter to be tuned.
hyperparams_type: the type of hyperparameter "real" or "int".
hyperparams_range: range of search. Use log base for real type hyperparameters.
To add a new cleaning methods, add a class to /schema/cleaning_method.py
.
The class must contain two methods:
fit(dataset, dirty_train)
: take in the dataset dictionary and dirty training set. Compute statistics or train models on training set for data cleaning.
clean(dirty_train, dirty_test)
: take in the dirty training set and dirty test set. Clean the error in the training set and test set. Return (clean_train, indicator_train, clean_test, indicator_test)
, which are the clean version datasets and indicators that indicate the location of error.
We consider "BD" and "CD" scenarios in our paper. To investigate other scenarios, add scenarios to /schema/scenario.py
.