In this project, I develop, train, and evaluate models for image captioning, inspired by BLIP's approach. The goal is to create a system that can generate descriptive and accurate captions for images. Additionally, I build a demo web app here to showcase these models in action, providing an interactive platform for users to experience the capabilities of AI-driven image captioning firsthand.
The Flickr30k dataset is divided into training and testing sets with a 70/30 split.
Model | Test WER | Test BLEU@4 | Train WER | Train BLEU@4 | Config | Checkpoint | Report | Paper |
---|---|---|---|---|---|---|---|---|
BLIP Base | 59.15 | 14.11 | 55.61 | 16.11 | Config | HuggingFace | Wandb | Arxiv |
You can this notebook (Colab) or this demo on HuggingFace for inference. You can also use the Streamlit demo offline by running this command from the root directory.
streamlit src/app.py
# clone project
git clone https://github.com/tanthinhdt/imcap
cd imcap
# [OPTIONAL] create conda environment
conda create -n imcap python=3.11.10
conda activate imcap
# install pytorch according to instructions
# https://pytorch.org/get-started/
# install requirements
pip install -r requirements.txt
# clone project
git clone https://github.com/tanthinhdt/imcap
cd imcap
# create conda environment and install dependencies
conda env create -f environment.yaml -n imcap
# activate conda environment
conda activate imcap
Train model with default configuration
# train on CPU
python src/train.py trainer=cpu
# train on GPU
python src/train.py trainer=gpu
Train model with chosen experiment configuration from configs/experiment/
python src/train.py experiment=experiment_name.yaml
You can override any parameter from command line like this
python src/train.py trainer.max_epochs=20 data.batch_size=64