The aim of this project is to highlight the different phases of a deep learning project, from data preparation to serving the final model through an app.
Specifically, the steps covered are :
- Downloading and preparing the dataset
- Training an object recognition model on Google Colab, using Detectron2 framework
- Extending Detectron2 with custom neural networks
- Serving the model on https://www.streamlit.io/
- (Additional code to serve the model with a very basic REST API is also provided)
This project is organized in two folders :
-
The
MaskRCNN_finetune
folder contains all the Deep Learning related code. Specifically, it contains code to download and extract the well known balloon dataset, it also contains code to extend Detectron2 with new models (MobileNetV2 and VoVNet-19) and to fine-tune them on our dataset. -
The
REST_API_flask
contains the code to serve the trained model with an API built with flask.
This project is designed to run on Google Colab but should be reproducible without (too much) hassle on any linux machine with a cuda enabled device.
Please use notebooks/object_recognition.ipynb
to run the deep learning code and notebooks/object_recognition_REST_API.ipynb
to use the REST API.
In image classification problems, there is usually a single object of interest. For that specific object, we build models to predict whether that object belongs to a specific class. For example, given a picture of an animal, the model should tell you whether it is a cat or a dog.
However, the real world is much more complex. What if there are cats and dogs in the image? What if we need to exactly know where the dogs are and where the cats are? What if they are overlapping, having a dog walking in front of a cat?
Image segmentation techniques such as Mask R CNN allow us to detect multiple objects, classify them separately and localize them within a single image.
To start with, a good introduction read is: A Brief History of CNNs in Image Segmentation: From R-CNN to Mask R-CNN
Most Mask R-CNN model uses resnet101
as a backbone, which is a humongous model. According to the original paper, the inference time is lower bounded at 200ms which is very slow. We provide code to try out different backbone, a comparison table is available at the end of this document.
Please, upload notebooks/object_recognition.ipynb
to Google Colab and run the cells to reproduce the results.
The REST API uses ngrok and flask and is pretty straight forward in its current state. To use it, please upload notebooks/object_recognition_REST_API.ipynb
to Google Colab and follow the cells.
The app will first download and instantiate the model when the API is launched then you can :
- Upload a local image
- Run inference
- Predicted masks are displayed in the browser
- Predicted masks can be downloaded by the user
Model | Inference Time | AP50 (Val) | AP50 (Test) |
---|---|---|---|
ResNet-50-FPN | 134 ms | 90 | 84 |
ResNet-101-FPN | 179 ms | 95 | 88 |
MobileNetV2-FPN | 98.3 ms | 95 | 62 |
VoVNet-19-FPN | 95.6 ms | 90 | 85 |
-
Final trained weights for each model are available on dropbox :
- Download links for each model can be used directly as a config entry or as a script argument, those links are provided in the inference_config.
- ResNet-50-FPN : https://www.dropbox.com/s/yn7m8xnva068glq/ResNet50_FPN_model_final.pth?dl=1
- ResNet-101-FPN : https://www.dropbox.com/s/otp52ccygc2t3or/ResNet101_FPN_model_final.pth?dl=1
- MobileNetV2-FPN : https://www.dropbox.com/s/tn6fhy829ckp5ar/MobileNetV2_FPN_model_final.pth?dl=1
- VoVNet-FPN : https://www.dropbox.com/s/smm7t8jsyp05m4r/VoVNet19_FPN_model_final.pth?dl=1
-
Visual outputs for each model are available under
reports/figures/model_name/predicted_images/
-
Training metrics for each model are available under
models/model_name/