Optimizing an ML Pipeline in Azure

Overview

This project is part of the Udacity Azure ML Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run.

Architecture diagram

Summary

The dataset used in this project contains information about Bank's Marketing Data. The aim is to predict if the customer subscribes to a fixed term deposit or not. The best performing model was VotingEnsemble with the accuracy of 0.9175 using AutoML.

Scikit-learn Pipeline

Firstly, the dataset was imported from specifies URL of Bank Marketing Data. It was then pre-processed in the clean_data function of train.py file and split into training and testing. After that Logistic Regression Model was used for training with tuning hyperparameters such as C and max_iter using HyperDrive.

Here, RandomParameterSampling was used as a parameter sampler because it covers most of the hyperparameter sample space from the continuous distribution along with less computation efforts.

Here, BanditPolicy was used as the early stopping policy which takes into account the slack factor and evaluation_interval. This helps to prevent the unnecessary computation and terminates the run with respect to best performing run.

AutoML

Using the AutoML, VotingEnsemble model performed the best with the accuracy of 91.75%. The AutoMLConfig was used to set the parameters like experiment_timeout_minutes=30, task="classification", primary_metric="accuracy", label_column_name="y", n_cross_validations=5 and training_data as concatenation of x and y received from clean_data function.

The hyperparameters recommended by the AutoML for VotingEnsemble model are min_samples_leaf = 0.01, min_samples_split=0.01, min_weight_fraction_leaf=0.0, n_estimators=10 and n_jobs=1. The weights assigned to the ensemble models used were [0.13333333333333333, 0.13333333333333333, 0.13333333333333333, 0.3333333333333333, 0.13333333333333333, 0.06666666666666667, 0.06666666666666667]

Pipeline comparison

The accuracy using HyperDrive was 91.18% and with AutoML it gave better result with accuracy as 91.75% i.e 0.57% difference. This difference was because in HyperDrive we specified a fixed model (Logistic Regression) and could only improve the hyperparameters whereas AutoML gave us the flexibility to use various models and get the best result. Thus we can use HyperDrive when we know the model and have less computation power and AutoML when we need more complex computation.

Future work

Some areas of improvement for future experiments can be:

As the existing data has class imbalance problem we can do better pre-processing of data or get more data to balance it and to explore the important features which can result in better performance with quality data
We can also use some complex algorithms like Deep Neural Networks to train our data on, for better accuracy
Using different combinations of hyperparameters like C and max_iter with HyperDrive and also try using different loss algorithm parameters.
In AutoML we can try other values for cross validation to improve accuracy
Optimizing the early stopping policy so that more time is spent on finding the best model
Exploring different algorithms with different evaluation metrics to get better understanding of performance as accuracy is not the only statistical metrics

Output Screenshots

Hyperdrive Run

AutoML Run

Best accuracy

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Screenshots		Screenshots
README.md		README.md
train.py		train.py
udacity-project.ipynb		udacity-project.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Optimizing an ML Pipeline in Azure

Overview

Summary

Scikit-learn Pipeline

AutoML

Pipeline comparison

Future work

Output Screenshots

About

Releases

Packages

Languages

somyadwivedi-mriirs/nd00333_AZMLND_Optimizing_a_Pipeline_in_Azure-Starter_Files

Folders and files

Latest commit

History

Repository files navigation

Optimizing an ML Pipeline in Azure

Overview

Summary

Scikit-learn Pipeline

AutoML

Pipeline comparison

Future work

Output Screenshots

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages