Skip to content

Latest commit

 

History

History
22 lines (20 loc) · 1.2 KB

dl_18.md

File metadata and controls

22 lines (20 loc) · 1.2 KB

WaveNet: a generative model for raw audio

Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu (2016)

Key points

  • Deep net (CNN) for generating raw audio
  • Fully probabilistic and autoregressive
    • Each sample is conditioned on the samples at all previous timesteps
  • Functionality:
    • Generate speech of many different speakers, switch easily by conditioning on speaker identity
    • Generate musical fragments
    • Speech recognition (architecture looks promising for this)
  • Components:
    • Joint probabilities, convolutional layers without pooling
    • Dilated causal convolutions: very large receptive fields, efficient
      • Causal: to ensure model doesn't violate order
      • Similar to pooling/strided convolution, but input size = output size
      • Dilated: greatly increase receptive field size at little cost
    • Softmax: flexible, easy to model
    • Residual and skip connections: speed up convergence, train deeper models
    • Conditional WaveNet: global (speaker ID) and local (linguistic features) modelling
    • Context stacks: computational ease + less capacity and longer timescales