Merge branch 'master' of github.com:patrick-llgc/Learning-Deep-Learning

patrick-llgc · Oct 13, 2023 · c471c22 · c471c22
2 parents 362af81 + dffe417
commit c471c22
Showing 1 changed file with 1 addition and 0 deletions.
diff --git a/paper_notes/transformer.md b/paper_notes/transformer.md
@@ -42,6 +42,7 @@ Attention, as opposed to memory, has constant length between any two positions.
 	- There is a fixed linear transformation from two embeddings with a fixed distance apart. "We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PEpos+k can be represented as a linear function of PEpos."
 	- The distance between positions (via dot product) decays symmetrically and smoothly.
 - Positional embedding is added but not concatenated. Basically PE only takes small number of dimensions and large portion of the high dim space is occupied by WE ("Near orthogonality in high dim space" property).
+- Softmax: The term "soft" derives from the fact that the softmax function is continuous and differentiable. The arg max function, with its result represented as a one-hot vector, is not continuous or differentiable. The softmax function thus provides a "softened" version of the arg max. ([Source](https://en.wikipedia.org/wiki/Softmax_function))
 
 
 #### Notes