Code for the first project of the Machine Learning Course (CS-433) at EPFL.
In this document, we provide instructions on how to run this code, and information about how the project is organized. Please refer to the report in report.pdf
for a detailed description of the project, motivation of the choices we made, and the results we obtained.
- Make sure
python
is installed on your system. We usedv3.11.8
for development. - Install all the packages listed in
requirements.txt
(numpy
andmatplotlib
). For example, if you use pip for package management, you can do so with:
pip install -r requirements.txt
- Place the raw data (
x_train.csv
,y_train.csv
andx_test.csv
files) in thedata_raw
folder. You can download the data here.
run.py
contains the full pipeline, which performs all of our experiments.models.py
contains the implementation of SVM, Logistic Regression, Ridge Regression and PCAcolumns.py
contains the result of manual interpretation of the dataset features; features (names of columns) are separated into categorical and numericalanalysis.ipynb
generates our figuresdata_preprocessing.py
is where we transform the raw dataset into all versions we compare (see columns of Table 1)helpers.py
is a modified version of the file with the same name provided for the ML labs, and it contains functions for reading the csv files of the raw dataset, as well as functions to create the AICrowd submission
The implementation of the ML methods is in the implementations.py
file. The methods are implemented as functions with the interface specified by the project description. The methods implemented are:
mean_squared_error_gd(y, tx, initial_w, max_iters, gamma)
: Linear regression using gradient descentmean_squared_error_sgd(y, tx, initial_w, max_iters, gamma)
: Linear regression using stochastic gradient descentleast_squares(y, tx)
: Least squares regression using normal equationsridge_regression(y, tx, lambda_)
: Ridge regression using normal equationslogistic_regression(y, tx, initial_w, max_iters, gamma)
: Logistic regression using gradient descentreg_logistic_regression(y, tx, lambda_, initial_w, max_iters, gamma)
: Regularized logistic regression using gradient descent
The original data is expected to be stored in the data_raw
folder. This includes the x_train.csv
, y_train.csv
and x_test.csv
files. The pipeline we use for the medical dataset preprocesses this raw data, and stores the preprocessed data in the data_clean
folder, along with checkpoints and results. If you want to change these paths (data_raw
and data_clean
), you can do so in the configuration as explained below.
All data preprocessing is done in the data_preprocessing.py
file, where the main entry point is the get_all_data
function. This function reads the raw data, preprocesses it according to the configuration, and returns the preprocessed data. Please refer to its doc string and comments in the code for more information.
The main pipeline is implemented in the run.py
file. To run the pipeline, you can simply run the following command:
python run.py
The current settings evaluates all the models on all the data preprocessing configurations trying all possible combinations of hyperparameters. This takes several hours on a laptop! If you want to run only the best model which produced the best submission predictions on AICrowd, the sections Configuration and Generating final AICrowd predictions explain how to do that. Furthermore, they explain the configuration options that allows to narrow down the search space of hyperparameters and data pipelines.
The configuration of the pipeline is done at the top of the run.py
file, in a global dictionary called cfg
:
### global config
cfg = {
"raw_data_path": "data_raw",
"clean_data_path": "data_clean",
"allow_load_clean_data": False,
"remap_labels_to_01": True,
"seed": 0,
"scoring_fn": f1,
"eval_frac": 0.1,
"retrain_on_all_data_after_eval": True,
"train": {
"retrain_selected_on_all_data": True,
"cv": {
"k_folds": 5,
"shuffle": True,
},
# "holdout": {
# "split_frac": 0.2,
# "seed": 0,
# },
},
}
Explanation of the configuration:
raw_data_path
: path to the folder containing the raw data (default:data_raw
)clean_data_path
: path to the folder where the preprocessed data and results will be stored (default:data_clean
)allow_load_clean_data
: ifTrue
, the pipeline will try to load the preprocessed data from theclean_data_path
folder, falling back to preprocessing the raw data if it is not found. IfFalse
, the pipeline will always preprocess the raw data and store it in theclean_data_path
folder (default:False
)remap_labels_to_01
: ifTrue
, the pipeline will remap the labels to 0 and 1. For the methods to work correctly, the labels should be 0 and 1 (default:True
).seed
: seed to assure reproducibility of the results (default:0
)scoring_fn
: scoring function to use for the cross-validation (default:f1
, options:f1
,accuracy
)eval_frac
: fraction of the data to use for the final evaluation of the models (default:0.1
)retrain_on_all_data_after_eval
: ifTrue
, the pipeline will retrain the best selected model on all the data after the final evaluation (default:True
)train
: configuration for the training part of the pipelineretrain_selected_on_all_data
: ifTrue
, the pipeline will retrain the selected model on all the data after the cross-validation (default:True
)cv
: configuration for the cross-validationk_folds
: number of folds for the cross-validation (default:5
)shuffle
: ifTrue
, the data will be shuffled before the cross-validation, otherwise it will assign everyk
-th sample to the same fold (default:True
)
holdout
: configuration for the holdout validation (computationally cheaper than cross-validation)split_frac
: fraction of the data to use for the holdout validation (default:0.2
)seed
: seed to assure reproducibility of the results (default:0
)
As you can see, holdout
is commented out by default. This is the way of specifying that we want to use cross-validation instead of holdout validation. If you want to use holdout validation, you can simply uncomment the holdout
configuration, specify the desired parameters, and comment out the cv
configuration.
The configuration of the hyperparameter search is done separately from cfg
, in the runs
dictionary that has the following structure:
### data-model combinations to run
runs = {
"data": {
"<data preprocessing name>": <preprocessing config or None>,
},
"models": {
"<model name>": {
"model_cls": <model class>,
"hyperparam_search": <hyperparameter search space config>,
},
},
}
Explanation of the runs
configuration:
data
: dictionary containing the data preprocessing configurations. The key is the name of the data preprocessing, and the value is the configuration for the data preprocessing (dict
orNone
if no preprocessing is needed). Currently supported are:process_cols
: columns to clean and use (all
,selected
, or integer representing the percentage of columns to use)pca_kwargs
: configuration for the PCA preprocessing:None
if this PCA step should be omitted, and a dictionary with the following keys if it should be included:max_frac_of_nan
: maximum fraction of NaN values in a column to keep it in the PCA preprocessing (between 0 and 1)n_components
: number of components to keep in the PCA preprocessing (you need to specify either this ormin_explained_variance
, not both)min_explained_variance
: minimum explained variance to keep in the PCA preprocessing (between 0 and 1; you need to specify either this orn_components
, not both)
standardize_num
: ifTrue
, the numerical columns will be standardized (default:True
)onehot_cat
: ifTrue
, the categorical columns will be one-hot encoded (default:True
)skip_rule_transformations
: ifTrue
, the rule-based transformations will be skipped (default:False
)
models
: dictionary containing the models to run and their hyperparameter search spaces. The key is the name of the model, and the value is a dictionary with the following keys:model_cls
: the class of the model to runhyperparam_search
: the hyperparameter search space configuration. This is a dictionary with the following keys:param1
: list of values to search for the first hyperparameterparam2
: list of values to search for the second hyperparameter (optional)- ...
paramN
: list of values to search for the N-th hyperparameter (optional)- Please refer to the
models.py
file to see the available models and their hyperparameters.
To generate the submission we used for AICrowd, comment out or remove all the data pipelines other than the "All columns": {"process_cols": "all", "pca_kwargs": None}
. Furthermore, use only the following model dictionary in the models
configuration (currently commented out in the run.py
file):
"Logistic Regression": { ### AICrowd submission
"model_cls": LogisticRegression,
"hyperparam_search": {
"gamma": [None],
"use_line_search": [True],
"optim_algo": ["lbfgs"],
"optim_kwargs": [{"epochs": 1}],
"class_weights": [{0: 1, 1: 4}],
"reg_mul": [0],
"verbose": [False],
},
},
The cfg
configuration should be kept as is (seed=0).
After running python run.py
, the pipeline will train the model on all the data and store the predictions in the data_clean/runs/<current-timestep>
folder (default path; <current-timestep>
will be a timestamp of the run). The predictions will be stored in a file ending with submission.csv
file in the same folder (for logistic regression this would be Logistic_Regression_submission.csv
). This file can be submitted to AICrowd.