Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio (2015)
- Attention mechanism in image caption generation, allowing the decoder RNN to focus on specific parts of the image
- Can be important when there is a lot of clutter
- Find correspondence between words and image patches by using lower convolutional layers as input to the RNN (before pooling)
- This low-level representation needs a powerful steering mechanism:
- Deterministic "soft" attention: trained end-to-end with backprop, weighted average of all regions in the image
- Smooth areas in image
- Doubly stochastic regularization: to prevent soft attention from focusing too much on certain regions
- Stochastic "hard" attention: select only 1 of the regions (stochastically) based on multinoulli distribution
- Multiple "sharp" areas in image
- Analogous to reinforcement learning: optimal sequence of actions to maximize the reward!
- Deal with high variations through moving average baseline and entropy term
- Trained using sampling methods (maximize approximate variational lower bound)
- Deterministic "soft" attention: trained end-to-end with backprop, weighted average of all regions in the image
- This low-level representation needs a powerful steering mechanism:
- Decoder used sensible regions while generating text, like human attention