Skip to content

Latest commit

 

History

History
17 lines (16 loc) · 1.36 KB

dl_17.md

File metadata and controls

17 lines (16 loc) · 1.36 KB

Show, attend and tell: neural image caption generation with visual attention

Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio (2015)

Key points

  • Attention mechanism in image caption generation, allowing the decoder RNN to focus on specific parts of the image
    • Can be important when there is a lot of clutter
  • Find correspondence between words and image patches by using lower convolutional layers as input to the RNN (before pooling)
    • This low-level representation needs a powerful steering mechanism:
      1. Deterministic "soft" attention: trained end-to-end with backprop, weighted average of all regions in the image
        • Smooth areas in image
        • Doubly stochastic regularization: to prevent soft attention from focusing too much on certain regions
      2. Stochastic "hard" attention: select only 1 of the regions (stochastically) based on multinoulli distribution
        • Multiple "sharp" areas in image
        • Analogous to reinforcement learning: optimal sequence of actions to maximize the reward!
        • Deal with high variations through moving average baseline and entropy term
        • Trained using sampling methods (maximize approximate variational lower bound)
  • Decoder used sensible regions while generating text, like human attention