WaveNet: a generative model for raw audio

Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu (2016)

Key points

Deep net (CNN) for generating raw audio
Fully probabilistic and autoregressive
- Each sample is conditioned on the samples at all previous timesteps
Functionality:
- Generate speech of many different speakers, switch easily by conditioning on speaker identity
- Generate musical fragments
- Speech recognition (architecture looks promising for this)
Components:
- Joint probabilities, convolutional layers without pooling
- Dilated causal convolutions: very large receptive fields, efficient
  - Causal: to ensure model doesn't violate order
  - Similar to pooling/strided convolution, but input size = output size
  - Dilated: greatly increase receptive field size at little cost
- Softmax: flexible, easy to model
- Residual and skip connections: speed up convergence, train deeper models
- Conditional WaveNet: global (speaker ID) and local (linguistic features) modelling
- Context stacks: computational ease + less capacity and longer timescales