A transformer:
- Sequence to sequence system: [Je suis étudiant] → [I am a student].
- Structure: input [→ Encoder → Decoder] → Output.
An encoder:
- A NN: first layer x1, ..., xn → y1, ..., ym, with m < n.
- It is compression.
A decoder:
- Well... the opposite.
Unrolling:
- It is a stack of multiple encoders, and the final encoder is passed to all the decoders.
- Encoders do not share weights.
Encoder parts:
- Self attention →
- Feed forward neural network
Decoder:
- Self attention →
- Encoder-decoder attention →
- Feed Forward
Self-attention:
- Relationship among words.
Queries, keys, values:
- Obtained by multiplying a matrix per an embedding.
- Learning a function: which words relate to other words.
- A score is computed with q*k.
- Normalization and apply softmax.
- Multiply the softmax score per the value.
Hyperparameters:
- Matrices W^Q, W^K, W^V.
- To avoid overfitting.
Attention heads:
- There are many attention mechanisms.
- Half the heads may not learn anything.
- The z's are concatenated.
Position encoding:
- Special set of weight added depending on the position.
Residuals:
- In order to dampen, average input and output, and then normalize.
Last step:
- Logits → softmax → pick the higher one.