A TensorFlow implementation of the image-to-text model described in the paper:
"Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge."
Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan.
IEEE transactions on pattern analysis and machine intelligence (2016).
Full text available at: http://arxiv.org/abs/1609.06647
Author: Chris Shallue
Pull requests and issues: @cshallue
The Show and Tell model is a deep neural network that learns how to describe the content of images. For example:
![Example captions](g3doc/example_captions.jpg)The Show and Tell model is an example of an encoder-decoder neural network. It works by first "encoding" an image into a fixed-length vector representation, and then "decoding" the representation into a natural language description.
The image encoder is a deep convolutional neural network. This type of network is widely used for image tasks and is currently state-of-the-art for object recognition and detection. Our particular choice of network is the Inception v3 image recognition model pretrained on the ILSVRC-2012-CLS image classification dataset.
The decoder is a long short-term memory (LSTM) network. This type of network is commonly used for sequence modeling tasks such as language modeling and machine translation. In the Show and Tell model, the LSTM network is trained as a language model conditioned on the image encoding.
Words in the captions are represented with an embedding model. Each word in the vocabulary is associated with a fixed-length vector representation that is learned during training.
The following diagram illustrates the model architecture.
![Show and Tell Architecture](g3doc/show_and_tell_architecture.png)In this diagram, {s0, s1, ..., sN-1} are the words of the caption and {wes0, wes1, ..., wesN-1} are their corresponding word embedding vectors. The outputs {p1, p2, ..., pN} of the LSTM are probability distributions generated by the model for the next word in the sentence. The terms {log p1(s1), log p2(s2), ..., log pN(sN)} are the log-likelihoods of the correct word at each step; the negated sum of these terms is the minimization objective of the model.
During the first phase of training the parameters of the Inception v3 model are kept fixed: it is simply a static image encoder function. A single trainable layer is added on top of the Inception v3 model to transform the image embedding into the word embedding vector space. The model is trained with respect to the parameters of the word embeddings, the parameters of the layer on top of Inception v3 and the parameters of the LSTM. In the second phase of training, all parameters - including the parameters of Inception v3 - are trained to jointly fine-tune the image encoder and the LSTM.
Given a trained model and an image we use beam search to generate captions for that image. Captions are generated word-by-word, where at each step t we use the set of sentences already generated with length t - 1 to generate a new set of sentences with length t. We keep only the top k candidates at each step, where the hyperparameter k is called the beam size. We have found the best performance with k = 3.
The time required to train the Show and Tell model depends on your specific hardware and computational capacity. In this guide we assume you will be running training on a single machine with a GPU. In our experience on an NVIDIA Tesla K20m GPU the initial training phase takes 1-2 weeks. The second training phase may take several additional weeks to achieve peak performance (but you can stop this phase early and still get reasonable results).
It is possible to achieve a speed-up by implementing distributed training across a cluster of machines with GPUs, but that is not covered in this guide.
Whilst it is possible to run this code on a CPU, beware that this may be approximately 10 times slower.
First ensure that you have installed the following required packages:
- Bazel (instructions)
- TensorFlow 1.0 or greater (instructions)
- NumPy (instructions)
- Natural Language Toolkit (NLTK):
- First install NLTK (instructions)
- Then install the NLTK data (instructions)
To train the model you will need to provide training data in native TFRecord
format. The TFRecord format consists of a set of sharded files containing
serialized tf.SequenceExample
protocol buffers. Each tf.SequenceExample
proto contains an image (JPEG format), a caption and metadata such as the image
id.
Each caption is a list of words. During preprocessing, a dictionary is created
that assigns each word in the vocabulary to an integer-valued id. Each caption
is encoded as a list of integer word ids in the tf.SequenceExample
protos.
We have provided a script to download and preprocess the [MSCOCO] (http://mscoco.org/) image captioning data set into this format. Downloading and preprocessing the data may take several hours depending on your network and computer speed. Please be patient.
Before running the script, ensure that your hard disk has at least 150GB of available space for storing the downloaded and processed data.
# Location to save the MSCOCO data.
MSCOCO_DIR="${HOME}/im2txt/data/mscoco"
# Build the preprocessing script.
bazel build im2txt/download_and_preprocess_mscoco
# Run the preprocessing script.
bazel-bin/im2txt/download_and_preprocess_mscoco "${MSCOCO_DIR}"
The final line of the output should read:
2016-09-01 16:47:47.296630: Finished processing all 20267 image-caption pairs in data set 'test'.
When the script finishes you will find 256 training, 4 validation and 8 testing
files in DATA_DIR
. The files will match the patterns train-?????-of-00256
,
val-?????-of-00004
and test-?????-of-00008
, respectively.
The Show and Tell model requires a pretrained Inception v3 checkpoint file to initialize the parameters of its image encoder submodel.
This checkpoint file is provided by the TensorFlow-Slim image classification library which provides a suite of pre-trained image classification models. You can read more about the models provided by the library here.
Run the following commands to download the Inception v3 checkpoint.
# Location to save the Inception v3 checkpoint.
INCEPTION_DIR="${HOME}/im2txt/data"
mkdir -p ${INCEPTION_DIR}
wget "http://download.tensorflow.org/models/inception_v3_2016_08_28.tar.gz"
tar -xvf "inception_v3_2016_08_28.tar.gz" -C ${INCEPTION_DIR}
rm "inception_v3_2016_08_28.tar.gz"
Note that the Inception v3 checkpoint will only be used for initializing the parameters of the Show and Tell model. Once the Show and Tell model starts training it will save its own checkpoint files containing the values of all its parameters (including copies of the Inception v3 parameters). If training is stopped and restarted, the parameter values will be restored from the latest Show and Tell checkpoint and the Inception v3 checkpoint will be ignored. In other words, the Inception v3 checkpoint is only used in the 0-th global step (initialization) of training the Show and Tell model.
Run the training script.
# Directory containing preprocessed MSCOCO data.
MSCOCO_DIR="${HOME}/im2txt/data/mscoco"
# Inception v3 checkpoint file.
INCEPTION_CHECKPOINT="${HOME}/im2txt/data/inception_v3.ckpt"
# Directory to save the model.
MODEL_DIR="${HOME}/im2txt/model"
# Build the model.
bazel build -c opt im2txt/...
# Run the training script.
bazel-bin/im2txt/train \
--input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" \
--inception_checkpoint_file="${INCEPTION_CHECKPOINT}" \
--train_dir="${MODEL_DIR}/train" \
--train_inception=false \
--number_of_steps=1000000
Run the evaluation script in a separate process. This will log evaluation metrics to TensorBoard which allows training progress to be monitored in real-time.
Note that you may run out of memory if you run the evaluation script on the same
GPU as the training script. You can run the command
export CUDA_VISIBLE_DEVICES=""
to force the evaluation script to run on CPU.
If evaluation runs too slowly on CPU, you can decrease the value of
--num_eval_examples
.
MSCOCO_DIR="${HOME}/im2txt/data/mscoco"
MODEL_DIR="${HOME}/im2txt/model"
# Ignore GPU devices (only necessary if your GPU is currently memory
# constrained, for example, by running the training script).
export CUDA_VISIBLE_DEVICES=""
# Run the evaluation script. This will run in a loop, periodically loading the
# latest model checkpoint file and computing evaluation metrics.
bazel-bin/im2txt/evaluate \
--input_file_pattern="${MSCOCO_DIR}/val-?????-of-00004" \
--checkpoint_dir="${MODEL_DIR}/train" \
--eval_dir="${MODEL_DIR}/eval"
Run a TensorBoard server in a separate process for real-time monitoring of training progress and evaluation metrics.
MODEL_DIR="${HOME}/im2txt/model"
# Run a TensorBoard server.
tensorboard --logdir="${MODEL_DIR}"
Your model will already be able to generate reasonable captions after the first phase of training. Try it out! (See [Generating Captions] (#generating-captions)).
You can further improve the performance of the model by running a second training phase to jointly fine-tune the parameters of the Inception v3 image submodel and the LSTM.
# Restart the training script with --train_inception=true.
bazel-bin/im2txt/train \
--input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" \
--train_dir="${MODEL_DIR}/train" \
--train_inception=true \
--number_of_steps=3000000 # Additional 2M steps (assuming 1M in initial training).
Note that training will proceed much slower now, and the model will continue to
improve by a small amount for a long time. We have found that it will improve
slowly for an additional 2-2.5 million steps before it begins to overfit. This
may take several weeks on a single GPU. If you don't care about absolutely
optimal performance then feel free to halt training sooner by stopping the
training script or passing a smaller value to the flag --number_of_steps
. Your
model will still work reasonably well.
Your trained Show and Tell model can generate captions for any JPEG image! The following command line will generate captions for an image from the test set.
# Directory containing model checkpoints.
CHECKPOINT_DIR="${HOME}/im2txt/model/train"
# Vocabulary file generated by the preprocessing script.
VOCAB_FILE="${HOME}/im2txt/data/mscoco/word_counts.txt"
# JPEG image file to caption.
IMAGE_FILE="${HOME}/im2txt/data/mscoco/raw-data/val2014/COCO_val2014_000000224477.jpg"
# Build the inference binary.
bazel build -c opt im2txt/run_inference
# Ignore GPU devices (only necessary if your GPU is currently memory
# constrained, for example, by running the training script).
export CUDA_VISIBLE_DEVICES=""
# Run inference to generate captions.
bazel-bin/im2txt/run_inference \
--checkpoint_path=${CHECKPOINT_DIR} \
--vocab_file=${VOCAB_FILE} \
--input_files=${IMAGE_FILE}
Example output:
Captions for image COCO_val2014_000000224477.jpg:
0) a man riding a wave on top of a surfboard . (p=0.040413)
1) a person riding a surf board on a wave (p=0.017452)
2) a man riding a wave on a surfboard in the ocean . (p=0.005743)
Note: you may get different results. Some variation between different models is expected.
Here is the image:
![Surfer](g3doc/COCO_val2014_000000224477.jpg)