Show, attend and tell: neural image caption generation with visual attention

Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio (2015)

Key points

Attention mechanism in image caption generation, allowing the decoder RNN to focus on specific parts of the image
- Can be important when there is a lot of clutter
Find correspondence between words and image patches by using lower convolutional layers as input to the RNN (before pooling)
- This low-level representation needs a powerful steering mechanism:
  1. Deterministic "soft" attention: trained end-to-end with backprop, weighted average of all regions in the image
    - Smooth areas in image
    - Doubly stochastic regularization: to prevent soft attention from focusing too much on certain regions
  2. Stochastic "hard" attention: select only 1 of the regions (stochastically) based on multinoulli distribution
    - Multiple "sharp" areas in image
    - Analogous to reinforcement learning: optimal sequence of actions to maximize the reward!
    - Deal with high variations through moving average baseline and entropy term
    - Trained using sampling methods (maximize approximate variational lower bound)
Decoder used sensible regions while generating text, like human attention