Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu (2016)
- Deep net (CNN) for generating raw audio
- Fully probabilistic and autoregressive
- Each sample is conditioned on the samples at all previous timesteps
- Functionality:
- Generate speech of many different speakers, switch easily by conditioning on speaker identity
- Generate musical fragments
- Speech recognition (architecture looks promising for this)
- Components:
- Joint probabilities, convolutional layers without pooling
- Dilated causal convolutions: very large receptive fields, efficient
- Causal: to ensure model doesn't violate order
- Similar to pooling/strided convolution, but input size = output size
- Dilated: greatly increase receptive field size at little cost
- Softmax: flexible, easy to model
- Residual and skip connections: speed up convergence, train deeper models
- Conditional WaveNet: global (speaker ID) and local (linguistic features) modelling
- Context stacks: computational ease + less capacity and longer timescales