- Introduction
- Task 1: Choosing a model and making it efficient
- Task 2: fine tuning on GLUE
- Further work
The main efficiency bottleneck in Transformer models is its self-attention mechanism.
In short:
- This due to the fact that each token representation is updated by attending to all token in the sentence (attending to all token in a Encoder, and the leftwards tokens in a Decoder).
In more detail:
-
Fact 1:
- If
$A \in \mathbb{R}^{p\times q}$ ,$B \in \mathbb{R}^{q\times r}$ , then the matrix multiplication$C = AB \in \mathbb{R}^{p\times r}$ :- has time complexity
$O(pqr)$ .- Each element
$C_{ij} = \sum\limits_{k=1}^{q} A_{ik}B_{kj}$ requires$O(q)$ computations. - We have to compute
$pr$ such elements$\implies O(pqr)$ .
- Each element
- has space complexity
$O(pr)$ .- The space required to store the elements in
$AB$ .
- The space required to store the elements in
- has time complexity
- If
-
Fact 2:
Following (Vaswani et a., 2017) and assuming that:
-
$X \in \mathbb{R}^{n \times d_{\text{model}}}$ , where-
$X$ — the input embeddings. -
$n$ — sequence length, and -
$d_{\text{model}}$ — the model embedding dimension.
-
-
$W_i^Q, W_i^K \in \mathbb{R}^{d_\text{model}\times d_k}$ . -
$W_i^V \in \mathbb{R}^{d_\text{model}\times d_v}$ (often$d_v = d_k = d$ ).
then (single head) attention is defined as:
-
Since
Therefore, training Transformer-based models can be prohibitively costly, specially on long sequences.
Different approaches has been proposed to improve the efficiency of training these models. Among them:
- Incorporate structural priors on attention such as
- sparsity (Child et al. 2019).
- pooling-based compression (Rae et al. 2020).
- clustering/binning/convolution techniques, which applies
$k$ -means clustering to learn dynamic sparse attention regions (e.g. Roy et al. 2020).
- Restrict the attention mechanism to attend to local neighborhoods (Parmar et al. 2018).
- Using a Reformer approach (Kitaev et al. 2020)
- where locality sensitive hashing is used to group together tokens of similar embeddings).
- Sliding windows (Beltagy et al. 2020).
- Truncated targeting (Chelba et al. 2020).
First, the choice of the Transformer model depends on the downstream tasks we are asked to trained the model on.
Since we’ll need to test the models in some language understanding tasks (GLUE) we have to choose a Transformer text model.
The variety of Transformer-base text model is vast. A survey may be more exhaustive but some models can be found here.
In general, they can be further subdivided depending on whether they rely solely on the encoder, the decoder or both. This also ultimately defines the array of tasks they can be applied to
For simplicity’s sake we chose BERT, mainly for two reasons:
- Its architecture is not complex. Moreover, it’s faithfully based on the original Transformer implementation of (Vaswani et a., 2017), which we thought would make simpler to modify the self-attention sub-layer.
- Modifying the attention sub-layer implies influencing the representation power of the model. At this point we thought that an encoder-based model would be simpler to manipulate in comparison to only-decoder or encoder-decoder models.
- Otherwise, we’d need not only take care of changing the self-attention sub-layer in some way, but also making the cross-attention work (in hopefully some meaningful way).
The given choice (BERT) also affects the possible approaches we can pursue in making the transformer more efficient. More detail in next section.
- General approaches taken to deal with the attention bottleneck were outlined above.
- Given the encoder-based nature of BERT (and the nature of the downstream task) we’re prompted to not to choose some approaches, among them:
-
Autoregressive methods:
- Such as the ones discussed in Finetuning Pretrained Transformers into RNNs and Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.
- Reason:
- Although there are BERT variants modified to run text generation task (e.g. Rothe et al. 2020), we decided not to pursue that approach, since the downstream tasks on GLUE are more related to Natural Language Understanding than to pure generation.
-
Linformer (Wang et al. 2020):
- Idea:
- Approximate self-attention with an approximate low-rank matrix.
- In other words, project
$KW_i^K,\ VW_i^V \in \mathbb{R}^{n\times d}$ into a lower dimensional space using linear projections with matrices$E_i, F_i \in \mathbb R^{n×k}$ , where$i$ stand for the$i$ -th layer in the Encoder.
- Drawback:
- It assumes a fixed sequence length. In other words, for every possible sequence length
$n_j$ we’d need a new pair of matrices$E_i, F_i \in \mathbb R^{n_j×k}$ (either for the whole model, if we set$E_i = E$ and$F_i = F$ , or for every encoder layer). - We could have just padded every batch to
max_seq_length
, and set$n_j =$ max_seq_length
but we chose not to do that, assuming it may be harmful for dowstream performance.
- It assumes a fixed sequence length. In other words, for every possible sequence length
- Idea:
-
Reformer (Kitaev et al. 2020):
-
Idea:
- Use a multi-round hashing scheme when computing dot-product attention, which in theory reduces the self-attention complexity to
$O(n \log(n))$ .
- Use a multi-round hashing scheme when computing dot-product attention, which in theory reduces the self-attention complexity to
-
Drawback:
- Their complexity term has a large constant
$128^2$ and it is only more efficient than the vanilla transformer when sequence length is extremely long (Wang et al. 2020)
- Their complexity term has a large constant
-
-
Sparse Transformer and Block give attention (e.g. Child et al. 2019):
- Idea:
- Reduce the attention dot-product computation to just a subset of the query-key values.
- Drawback:
- It leads to performance degradation.
- Idea:
-
- Source: Lee-Thorp et al. 2021.
To be concise, the paper proposes to replace the self-attention sub-layer completely with a Fourier Transform layer:
The computed Fourier Transform is 2D: along the sequence-length dimension and along the embedding dimensions. That is, for a matrix
where
In this way, the time complexity is reduced from
The space complexity depends on the FFT implementation:
- If recursive, then
$O(n\log n)$ . - If iterative, then
$O(n)$ . - We’ll assume that the PyTorch implementations in
torch.fft
is efficient, so we expect a linear space complexity.
To implement this approach we decided to extend the HuggingFace’s (HF) transformers
library for two reasons:
- By suggestion, we are not supposed to write the whole architecture from scratch and only modify the needed layers
$\implies$ the less we code we write the better (: - For practical purposes we would like to spend as less time as possible implementing all the necessary code for fine-tuning (preprocessing datasets, creating the training loop, evaluating, plotting, etc).
transformers
provides a high-level interface that takes care of all that pretty easily by using itsAuto
andTrainer
classes).
For this reason the code is in separate repository (https://github.com/Joqsan/transformers) and not inside this project.
We know that HF’s transformers
already include the FNet model (see here) but, as expected, we ignore it and opt to:
-
Begin with a copy of their BERT code. See commit 8ffbcff. Copy-pasted from here.
-
Change the self-attention sub-layer as describe above (see image). In the transformers codebase this amounts to:
- Modifying the corresponding
PreTrainedConfig
subclass to match the FNet hyperparameters. - Replacing
BertSelfAttention
.
All this is done in commit dd08150
✅ We also change the class name to avoid conflicting with the names in
modeling_bert.py
. See commit 823bb6e. - Modifying the corresponding
- We initialize the weights of a
MyFNetModel
to the pre-trained weights of aBertModel.from_pretrained("bert-base-uncased")
model. - We host the weight on HF Hub for easily loading the them for the downstream fine-tuning tasks.
Usually when modifying inner layers in a model we need to fine-tune it in some pre-training task before fine-tuning it in downstream ones (for example, the same task the previous pre-trained weights were obtained from).
Even when doing so, some authors still got drops in performances and recommend training the whole model from scratch (Kasai et al. 2021)
In our case, we replace the self-attention layer with unparametrized Fourier Transform layers
Due to limitation of computational resources pre-training our FNet is too costly, so as in the previous section we decided to use the weights of a pre-trained BERT as starting point for our downstream fine-tuning.
Although we still expect an improvement in computation time across a encoder layer, we also expect a worse-than-expected performance degradation with our FNet. Whether this is the case, we’ll see in the model evaluation below (Spoiler: this was the case ):).
GLUE is a collection of diverse natural language understanding tasks. A summarized description all of the tasks can be found in (Wang et al. 2018).
The tasks we choose were the following:
-
CoLa — The Corpus of Linguistic Acceptability.
- Type of task:
- Binary classification.
- Input:
- Single sentence (English sentence).
- Output:
- Predict whether the English sentence is grammatically correct or not.
- Metric:
- Matthew’s Correlation.
- Type of task:
-
QNLI — Question Natural Language Inference:
- Built from SQuAD (Stanford Question Answering Dataset)
- Type of task:
- Binary classification.
- Input:
- pair of (question, answer).
- Output:
- Predict whether answer is the answer for question.
- Metric:
- Accuracy.
-
RTE — Recognizing Textual Entailment:
- A smaller version of MNLI — Multi-Genre Natural Language Inference.
- Type of task:
- Classification.
- Input:
- pair of sentences.
- Output:
- Predict whether the second sentence is an entailment, contradiction, or neutral with respect to the first one.
- Metric:
- Accuracy.
The script for fine-tuning can be found here. It can be run on any of the GLUE tasks.
- We run the fine-tuning script on Colab (free GPU). The corresponding notebook can be found here.
- Instructions about how to run it locally can be found here.
Access to all logging and plots can be found here.
- For details refer to the tables below.
- In general our custom FNet model perform faster during both training and testing with:
- ~30% increase in training/test steps done per second and in number of samples processed per second (first and second columns).
- ~25% decrease in runtime spent during training/testing.
- The number of operation required to run FNet (
train/total_flos
) also decreased.
QNLI:
CoLA:
RTE:
- As in the paper, we can observe performance degradation with respect to the GLUE metrics.
- But in our case we observe substantially performance decrease across all the tasks in comparison to what vanilla BERT achieved after fine-tuning.
- This may be linked to the fact that we did not carry out any fine-tuning with the whole model to account for the removal of the self-attention layer.
- For CoLA, Matthew’s correlation = 0, which implies that our FNet model was not able find any relationship between the variables (between the pair of sentences).
- We also note that, not only the loss between FNet and BERT are comparatively different, but the FNet almost did not change after 5 epochs.
- This shows that the mixing mechanism provided by Fourier Transform (although more efficient) cannot fully replace the representation power produced by vanilla Transformer.
QNLI:
CoLA:
RTE:
There are a number of step that can be taken in order to extend/complement the above analysis, such as:
- Redeem our sins: pre-train the full FNet model from scratch and repeat the downstream fine-tuning to compare how it performs against vanilla BERT and the naive FNet version discussed above.
- Compare the models’ time and space performance with respect to sequence length.
- Include other approaches for improving efficiency, that are non-orthogonal to FNet, such as using mix precision.