You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Given a set $\mathcal{X}$, if $S(\mathcal{X})$ denotes the set of sequences of elements from $\mathcal{X}$,
$$S(\mathcal{X}) = \cup_{t=1}^\infty \mathcal{X}^t,$$
then we formally define:
When the input is a sequence $\mathbf{x} \in S(\mathbb{R}^p)$ of variable length $T(\mathbf{x})$, the historical approach is to use a recurrent model which maintains a recurrent state$\mathbf{h}_t \in \mathbb{R}^q$ updated at each time step $t$.
Formally, for $t=1, ..., T(\mathbf{x})$,
$$
\mathbf{h}_t = \phi(\mathbf{x}_t, \mathbf{h}_{t-1};\theta),
$$
where $\phi : \mathbb{R}^p \times \mathbb{R}^q \to \mathbb{R}^q$ and $\mathbf{h}_0 \in \mathbb{R}^q$.
Predictions can be computed at any time step $t$ from the recurrent state,
$$y_t = \psi(\mathbf{h}_t;\theta),$$
with $\psi : \mathbb{R}^q \to \mathbb{R}^C$.
Even though the number of steps $T$ depends on $\mathbf{x}$, this is a standard computational graph, and automatic differentiation can deal with it as usual.
In the case of recurrent neural networks, this is referred to as backpropagation through time.
class: middle
.width-100[]
Elman networks
Elman networks consist of $\phi$ and $\psi$ defined as primitive neuron units, such as logistic regression units
$$
\begin{aligned}
\mathbf{h}_t &= \sigma_h\left( \mathbf{W}^T_{xh} \mathbf{x}_t + \mathbf{W}^T_{hh} \mathbf{h}_{t-1} + \mathbf{b}_h \right) \\
y_t &= \sigma_y\left( \mathbf{W}_y^T \mathbf{h}_t + b_y \right)
\end{aligned}
$$
where $\mathbf{W}^T_{xh} \in \mathbb{R}^{p\times q}, \mathbf{W}^T_{hh} \in \mathbb{R}^{q\times q}, \mathbf{b}_{h} \in \mathbb{R}^{q}, b_{y} \in \mathbb{R}, \mathbf{h}_0=0$, and where $\sigma_h$ and $\sigma_y$ are non-linear activation functions, such as the sigmoid function, $\text{tanh}$ or $\text{ReLU}$.
class: middle
Benchmark example
Learn to recognize variable-length sequences that are palindromes.
For training, we use sequences of random sizes, from $1$ to $10$.
Computing the recurrent states forward in time does not make use of future input values $\mathbf{x}_{t+1:T}$, even though there are known.
RNNs can be made bidirectional by consuming the sequence in both directions.
Effectively, this amounts to run the same (single direction) RNN twice:
once over the original sequence $\mathbf{x}_{1:T}$,
once over the reversed sequence $\mathbf{x}_{T:1}$.
The resulting recurrent states of the bidirectional RNN is the concatenation of two resulting sequences of recurrent states.
class: middle
.center.width-80[]
Stacked RNNs
Recurrent networks can be viewed as layers producing sequences $\mathbf{h}_{1:T}^\ell$ of activations.
As for dense layers, recurrent layers can be composed in series to form a .bold[stack] of recurrent networks.
.center.width-100[]
class: middle
.center.width-80[]
Gating
When unfolded through time, the graph of computation of a recurrent network can grow very deep, and training involves dealing with vanishing gradients.
RNN cells should include a pass-through, or additive paths, so that the recurrent state does not go repeatedly through a squashing non-linearity.
For instance, the recurrent state update can be a per-component weighted average of its previous value $\mathbf{h}_{t-1}$ and a full update $\bar{\mathbf{h}}_t$, with the weighting $\mathbf{z}_t$ depending on the input and the recurrent state, hence acting as a forget gate.
The long short-term memory model (LSTM; Hochreiter and Schmidhuber, 1997) is an instance of the previous gated recurrent cell, with the following changes:
The recurrent state is split into two parts $\mathbf{c}_t$ and $\mathbf{h}_t$, where
$\mathbf{c}_t$ is the cell state and
$\mathbf{h}_t$ is output state.
A forget gate $\mathbf{f}_t$ selects the cell state information to erase.
An input gate $\mathbf{i}_t$ selects the cell state information to update.
An output gate $\mathbf{o}_t$ selects the cell state information to output.
The gated recurrent unit (GRU; Cho et al, 2014) is another gated recurrent cell.
It uses two gates instead of three: an update gate $\mathbf{z}_t$ and a reset gate $\mathbf{r}_t$.
???
GRUs perform similarly as LSTMs for language or speech modeling sequences, but with fewer parameters.
However, LSTMs remain strictly stronger than GRUs.
The standard strategy to solve this issue is gradient norm clipping, which rescales the norm of the gradient to a fixed threshold $\delta$ when it is above:
$$\tilde{\nabla} f = \frac{\nabla f}{||\nabla f||} \min(||\nabla f||, \delta).$$
Let us consider a simplified RNN, with no inputs, no bias, an identity activation function $\sigma$ (as in the positive part of a ReLU) and the initial recurrent state $\mathbf{h}_0$ set to the identity matrix.
For a sequence of size $n$, it comes
$$\mathbf{h}_n = \mathbf{W}(\mathbf{W}(\mathbf{W}(...(\mathbf{W}\mathbf{h}_0)...))) = \mathbf{W}^n\mathbf{h}_0 = \mathbf{W}^n I = \mathbf{W}^n.$$
Ideally, we would like $\mathbf{W}^n$ to neither vanish nor explode as $n$ increases.
???
Do on the iPad
class: middle
Fibonacci digression
The Fibonacci sequence is
$$0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, ...$$
It grows fast! But how fast?
class: middle
In matrix form, the Fibonacci sequence is equivalently expressed as