Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks (2006), Alex Graves et al.
contributors: @GitYCC
[paper]
- drawbacks of HMMs or CRFs
- require a significant amount of task specific knowledge
- to design the state models for HMMs
- to choose the input features for CRFs
- require explicit (and often questionable) dependency assumptions to make inference tractable
- the assumption that observations are independent for HMMs
- require a significant amount of task specific knowledge
- RNNs require no prior knowledge of the data, beyond the choice of input and output representation. They can be trained discriminatively, and their internal state provides a powerful, general mechanism for modelling time series. In addition, they tend to be robust to temporal and spatial noise.
- RNNs can only be trained to make a series of independent label classifications. This means that the training data must be pre-segmented, and that the network outputs must be post-processed to give the final label sequence.
- applicability of RNNs has so far been limited.
- This paper presents a novel method for training RNNs to label unsegmented sequences directly.
- A CTC network has a softmax output layer.
- The activation of the extra unit is the probability of observing a ‘blank’, or no label.
- These outputs define the probabilities of all possible ways of aligning all possible label sequences with the input sequence.
- The total probability of any one label sequence can then be found by summing the probabilities of its different alignments.
$L'=L\cup{blank}$ -
$\pi$ means a possible path
- We use
$B$ to define the conditional probability of a given labelling$l$ as the sum of the probabilities of all the paths corresponding to it:- mapping
$B$ : simply removing all blanks and repeated labels from the paths (e.g.$B(a − ab−) = B(−aa − −abb) = aab$ ).
- mapping
- Decoding (inference)
- The first method (best path decoding) is based on the assumption that the most probable path will correspond to the most probable labelling:
- The second method (prefix search decoding) relies on the fact that, by modifying the forward-backward algorithm, we can efficiently calculate the probabilities of successive extensions of labelling prefixes
To allow for blanks in the output paths, we consider a modified label sequence
Initialization:
Recursion:
$$ \alpha_t(s)= \begin{cases} [\alpha_{t-1}(s)+\alpha_{t-1}(s-1)]y_{l's}^t& \text{if } l's=b\text{ or }l'{s-2}=l's\ [\alpha{t-1}(s)+\alpha{t-1}(s-1)+\alpha_{t-1}(s-2)]y_{l'_s}^t & \text{otherwise} \end{cases} $$
Mapping from
Initialization:
Recursion: $$ \beta_t(s)= \begin{cases} [\beta_{t+1}(s)+\beta_{t+1}(s+1)]y_{l's}^t& \text{if } l's=b\text{ or }l'{s+2}=l's\ [\beta{t+1}(s)+\beta{t+1}(s+1)+\beta_{t+1}(s+2)]y_{l'_s}^t & \text{otherwise} \end{cases} $$
Since the training examples are independent we can consider them separately:
from (3), for any
Back to (13):
we could get
$$ =\sum_{s\in lab(l,k)}\frac{(\cdots)y_{l's}^t\times y{l's}^t(\cdots)}{[y{l'_s}^{t}]^2} $$
$$ \Rightarrow \frac{\partial p(l|x)}{\partial y_k^t}=\frac{1}{[y_{l's}^{t}]^2}\sum{s\in lab(l,k)}\alpha_t(s)\beta_t(s) $$