Skip to content

Latest commit

 

History

History
61 lines (49 loc) · 3.99 KB

README.md

File metadata and controls

61 lines (49 loc) · 3.99 KB

RetinaNet

Introduction

The project is an implementation of RetinaNet, a state-of-the-art object detection model based on deep learning. The project involved training RetinaNet on the MSCOCO dataset with the goal of achieving high accuracy in detecting objects in images.

Table of Contents

Dataset

The Microsoft Common Objects in Context (MSCOCO) dataset was chosen for its diversity and size. It contains over 330,000 images, with 200,000 labeled images spanning 80 object categories. This dataset is widely used for benchmarking object detection models.

Data Preprocessing

The preprocessing steps included:

  1. Resizing images: Adjusting images to a uniform size for consistent input.
  2. Normalizing pixel values: Scaling pixel values to a standard range.
  3. Data augmentation: Applying techniques such as horizontal flipping, rotation, and scaling to increase the diversity of the training data.

Model Architecture

RetinaNet is a single unified network composed of a backbone network and two task-specific subnetworks:

  1. Backbone Network: Responsible for feature extraction using a deep convolutional neural network (CNN) pre-trained on a large dataset.
  2. Feature Pyramid Network (FPN): Enhances the backbone by creating a feature map pyramid with high resolution.
  3. Classification Subnet: A CNN that predicts the probability of each anchor box containing an object for each class.
  4. Regression Subnet: A CNN that predicts the bounding box coordinates for each anchor box.

Focal Loss

The Focal Loss function addresses the class imbalance during training. It modifies the standard cross-entropy loss by adding a modulating factor to down-weight the loss assigned to well-classified examples, focusing more on hard, misclassified examples.

Focal Loss is defined as: [ FL(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t) ]

where:

  • ( p_t ) is the model's estimated probability for the true class.
  • ( \alpha_t ) balances the importance of positive/negative examples.
  • ( \gamma ) is a focusing parameter that adjusts the rate at which easy examples are down-weighted.

Training and Evaluation

Training

The RetinaNet model was trained on the MSCOCO dataset using the AdamW optimizer with a learning rate scheduler. The training steps included:

  1. Model Initialization: Initialized with a pre-trained ResNet50 backbone.
  2. Configuration: Used hyperparameters such as a learning rate of 0.0001, batch size of 16, and trained for 10 epochs.
  3. Loss Functions: Focal Loss for classification and Binary Cross-Entropy for regression.
  4. Monitoring: Performance monitored using Mean Average Precision (mAP) at different Intersection over Union (IoU) thresholds.

Evaluation

The evaluation was performed on the MSCOCO validation set using metrics such as Average Precision (AP) and Average Recall (AR) at various IoU thresholds.

Further Improvements

  1. Extended Training: Train for more epochs or initialize with deeper backbones (e.g., ResNet101, EfficientNet).
  2. Inference Optimization: Use techniques like model pruning, quantization, and knowledge distillation to enhance inference speed and deploy on edge devices.
  3. Advanced Data Augmentation: Employ methods such as MixUp, CutMix, and mosaic augmentation to improve generalization.

References

  1. Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal Loss for Dense Object Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  2. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common Objects in Context. European Conference on Computer Vision (ECCV).