contributors: @GitYCC
- In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms.
- Text-to-speech synthesis is typically done in two steps:
- transforms the text into time-aligned features, such as a mel-spectrogram, or F0 frequencies and other linguistic features
- transforms these time-aligned features into audio samples, called vocoder (we focus on this!)
- Text-to-speech synthesis is typically done in two steps:
- WaveGlow = Glow + WaveNet
- WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.
- problem of auto-regressive models: can’t fully utilize parallel processors like GPUs or TPUs
- 3 neural network based models that can synthesize speech without auto-regression:
- Parallel WaveNet
- Clarinet
- However, these 3 models are more difficult to train and implement than the auto-regressive models. All three require compound loss functions to improve audio quality or problems with mode collapse.
- 3 neural network based models that can synthesize speech without auto-regression:
$X$ : audio sample -
$Z$ : latent space-
assume spherical Gaussian:
$z\sim Gaussian(Z;0,I)$ -
$x$ to$z$ by many invertible layers:$x=f_0\circ f_1\circ ...f_k(z)$ and$z=f_k\circ ...f_1\circ f_0(x)$ -
we can describe distribution between
$p_\theta(x)$ and$p_\theta(z)$ by a change of variables: $$ log\ p_\theta(x)=log\ p_\theta(z)+\sum_{i=1}^{k}log\ |det(J(f^{-1}_{i}(x)))| $$ -
inference time: once the network is trained, doing inference is simply a matter of randomly sampling
$z$ values from a Gaussian and running them through the network.
squeeze to vectors
we take groups of 8 audio samples as vectors, which we call the ”squeeze” operation
# audio: (batch, time) n_group = 8 audio = audio.unfold(1, n_group, n_group) \ # (batch, time//n_group, n_group) .permute(0, 2, 1) # (batch, n_group=channel, time//n_group)
invertible 1x1 convolution
- In order to avoid keeping the same part after mulitple coupling layers, we need a generalized permutation operation => invertible 1x1 convolution
- The
$W$ weights of these convolutions are initialized to be orthonormal and hence invertible.
affine coupling layer $$ x_a,x_b=split(x) $$
$$ (log\ s,t)=WN(x_a,mel_spectrogram) $$
$$ x_{b'}=s\odot x_b + t $$
$$ f_{coupling}^{-1}(x)=concat(x_a,x_{b'}) $$
- The
$WN$ architecture is similar to WaveNet-
$WN()$ uses layers of dilated convolutions with gated-tanh nonlinearities, as well as residual connections and skip connections - different from WaveNet: our convolutions have 3 taps and are not causal
- The affine coupling layer is also where we include the mel-spectrogram in order to condition the generated result on the input. The upsampled mel-spectrograms are added before the gated-tanh nonlinearites of each layer as in WaveNet.
- The
early outputs
- Rather than having all channels go through all the layers, we found it useful to output 2 of the channels to the loss function after every 4 coupling layers. After going through all the layers of the network, the final vectors are concatenated with all of the previously output channels to make the final
$z$ . Outputting some dimensions early makes it easier for the network to add information at multiple time scales, and helps gradients propagate to earlier layers, much like skip connections.
- Rather than having all channels go through all the layers, we found it useful to output 2 of the channels to the loss function after every 4 coupling layers. After going through all the layers of the network, the final vectors are concatenated with all of the previously output channels to make the final
After adding all the terms from the coupling layers, the final likelihood becomes:
audio quality comparison