Skip to content

Latest commit

 

History

History
53 lines (40 loc) · 1.25 KB

01-Transformers.md

File metadata and controls

53 lines (40 loc) · 1.25 KB

Transformers

A transformer:

  • Sequence to sequence system: [Je suis étudiant] → [I am a student].
  • Structure: input [→ Encoder → Decoder] → Output.

An encoder:

  • A NN: first layer x1, ..., xn → y1, ..., ym, with m < n.
  • It is compression.

A decoder:

  • Well... the opposite.

Unrolling:

  • It is a stack of multiple encoders, and the final encoder is passed to all the decoders.
  • Encoders do not share weights.

Encoder parts:

  • Self attention →
  • Feed forward neural network

Decoder:

  • Self attention →
  • Encoder-decoder attention →
  • Feed Forward

Self-attention:

  • Relationship among words.

Queries, keys, values:

  • Obtained by multiplying a matrix per an embedding.
  • Learning a function: which words relate to other words.
  • A score is computed with q*k.
  • Normalization and apply softmax.
  • Multiply the softmax score per the value.

Hyperparameters:

  • Matrices W^Q, W^K, W^V.
  • To avoid overfitting.

Attention heads:

  • There are many attention mechanisms.
  • Half the heads may not learn anything.
  • The z's are concatenated.

Position encoding:

  • Special set of weight added depending on the position.

Residuals:

  • In order to dampen, average input and output, and then normalize.

Last step:

  • Logits → softmax → pick the higher one.