gaussian identities added

MarcToussaint · Aug 30, 2024 · b25bd43 · b25bd43
1 parent e3b2e30
commit b25bd43
Show file tree

Hide file tree

Showing 22 changed files with 809 additions and 177 deletions.
diff --git a/docs/index.md b/docs/index.md
@@ -21,11 +21,11 @@ parts:
 ## Lecture Notes
 
 <ul>
-{% assign sorted = site.pages | sort: 'date' %}
+{% assign sorted = site.pages | sort: 'order' %}
 {% for page in sorted %}
   {% if page.tags == 'note' %}
   <li>
-    <a href="{{site.baseurl}}{{page.url}}">{{ page.title }} ({{ page.date | date: '%B %d, %Y' }})</a>
+    <a href="{{site.baseurl}}{{page.url}}">{{ page.title }}</a>
   </li>
   {% endif %}
 {% endfor %}

diff --git a/docs/notes/energy.inc b/docs/notes/energy.inc
@@ -3,15 +3,17 @@ Probabilities & Energy
 
 Given a density $p(x)$, we call
 
-$$\begin{aligned}
-E(x) = -\log p(x) + c ~,\end{aligned}$$
+$$\begin{align}
+E(x) = -\log p(x) + c ~,
+\end{align}$$
 
 (for any choice of offset $c\in{\mathbb{R}}$) an energy function.
 Conversely, given an energy function $E(x)$, the corresponding density
 (called Boltzmann distribution) is
 
-$$\begin{aligned}
-p(x) = \textstyle\frac 1Z \exp\lbrace-E(x)\rbrace ~,\end{aligned}$$
+$$\begin{align}
+p(x) = \textstyle\frac 1Z \exp\lbrace-E(x)\rbrace ~,
+\end{align}$$
 
 where $Z$ is the normalization constant to ensure $\int_x p(x) = 1$, and
 $c=-\log Z$. From the perspective of physics, one can motivate why
@@ -22,14 +24,16 @@ minimalistically as follows:
 Probabilities are *multiplicative* (axiomatically): E.g., the likelihood
 of i.i.d.&nbsp;data $D = \lbrace x_i \rbrace_{i=1}^n$ is the product
 
-$$\begin{aligned}
-P(D) = \prod_{i=1}^n p(x_i) ~.\end{aligned}$$
+$$\begin{align}
+P(D) = \prod_{i=1}^n p(x_i) ~.
+\end{align}$$
 
 We often want to rewrite this with a log to have an *additive*
 expression
 
-$$\begin{aligned}
-E(D) = -\log P(D) = \sum_{i=1}^n E(x_i) ~,\quad E(x_i) = -\log p(x_i) ~.\end{aligned}$$
+$$\begin{align}
+E(D) = -\log P(D) = \sum_{i=1}^n E(x_i) ~,\quad E(x_i) = -\log p(x_i) ~.
+\end{align}$$
 
 The minus is a convention so that we can call the quantity $E(D)$ a
 *loss* or *error* &ndash; something we want to minimize instead of
@@ -45,22 +49,24 @@ P(D_2)$, (2) $E(D)$ is additive $E(D_1\cup D_2) = E(D_1) + E(D_2)$, and
 (3) there is a mapping $P(D) = \textstyle\frac 1Z f(E(D))$ between both.
 Then it follows that
 
-$$\begin{aligned}
+$$\begin{align}
 P(D_1\cup D_2)
 &= P(D_1)~ P(D_2) = \textstyle\frac 1{Z_1} f(E(D_1))~ \textstyle\frac 1{Z_2} f(E(D_2)) \\
 P(D_1\cup D_2)
 &= \textstyle\frac 1{Z_0} f(E(D_1\cup D_2)) = \textstyle\frac 1{Z_0} f(E(D_1) + E(D_2)) \\
 \Rightarrow\quad \textstyle\frac 1{Z_1} f(E_1) \textstyle\frac 1{Z_2} f(E_2)
-&= \textstyle\frac 1{Z_0} f(E_1+E_2) ~, \label{exp}\end{aligned}$$
+&= \textstyle\frac 1{Z_0} f(E_1+E_2) ~, \label{exp}
+\end{align}$$
 
 where we defined $E_i=E(D_i)$. The only function to fulfill the last
 equation for any $E_1,E_2\in{\mathbb{R}}$ is the exponential function
 $f(E) = \exp\lbrace -\beta E \rbrace$ with arbitrary coefficient $\beta$
 (and minus sign being a convention, $Z_0 = Z_1
  Z_2$). Boiling this down to an individual element $x\in D$, we have
 
-$$\begin{aligned}
-p(x) = \textstyle\frac 1Z \exp\lbrace-\beta E(x)\rbrace ~,\quad\beta E(x) = -\log p(x) - \log Z ~.\end{aligned}$$
+$$\begin{align}
+p(x) = \textstyle\frac 1Z \exp\lbrace-\beta E(x)\rbrace ~,\quad\beta E(x) = -\log p(x) - \log Z ~.
+\end{align}$$
 
 Partition Function
 ------------------

diff --git a/docs/notes/energy.md b/docs/notes/energy.md
@@ -2,11 +2,12 @@
 layout: home
 title:  "Probabilities, Energy, Boltzmann, and Partition Function"
 date: 2024-08-19
+order: 2
 tags: note
 ---
 
 *[Marc Toussaint](https://www.user.tu-berlin.de/mtoussai/), Learning &
-Intelligent Systems Lab, TU Berlin, {{ page.date  | date: '%B %d, %Y' }}*
+Intelligent Systems Lab, TU Berlin,* {{ page.date  | date: '%B, %Y' }}
 
 [[pdf version](../pdfs/energy.pdf)]
 

diff --git a/docs/notes/entropy.inc b/docs/notes/entropy.inc
@@ -26,8 +26,9 @@ sample $B$ surprises you more and gives you 2 bits of information.
 
 The entropy
 
-$$\begin{aligned}
-H(p) &= - \sum_x p(x) \log p(x) = \mathbb{E}_{p(x)}\!\left\{-\log p(x)\right\}\end{aligned}$$
+$$\begin{align}
+H(p) &= - \sum_x p(x) \log p(x) = \mathbb{E}_{p(x)}\!\left\{-\log p(x)\right\}
+\end{align}$$
 
 is the expected neg-log-likelihood. It is a measure of the distribution
 $p$ itself, not of a specific sample. It measures, how much *in average*
@@ -48,10 +49,10 @@ by a uniform discrete random variable of cardinality
 $P\mkern-1pt{}P(p)$".
 
 Given a gaussian distribution
-$p(x) \propto \exp\{-{\frac{1}{2}}(x-\mu)^2/\sigma^2\}$, the
+$p(x) \propto \exp\{-{\textstyle\frac{1}{2}}(x-\mu)^2/\sigma^2\}$, the
 neg-log-likelihood of a specific sample $x$ is
-$-\log p(x) = -{\frac{1}{2}}(x-\mu)^2/\sigma^2 + \textit{const}$. This
-can be thought of as the *square error* of $x$ from $\mu$, and its
+$-\log p(x) = -{\textstyle\frac{1}{2}}(x-\mu)^2/\sigma^2 + \textit{const}$.
+This can be thought of as the *square error* of $x$ from $\mu$, and its
 expectation (entropy) is the mean square error. Generally, the
 neg-log-likelihood $-\log p(x)$ often relates to an error or loss
 function.
@@ -61,8 +62,9 @@ Cross-Entropy
 
 The cross-entropy
 
-$$\begin{aligned}
-H(p,q) = - \sum_x p(x) \log q(x) = \mathbb{E}_{p(x)}\!\left\{-\log q(x)\right\}\end{aligned}$$
+$$\begin{align}
+H(p,q) = - \sum_x p(x) \log q(x) = \mathbb{E}_{p(x)}\!\left\{-\log q(x)\right\}
+\end{align}$$
 
 is also an expected neg-log-likelihood, but expectation is
 w.r.t.&nbsp;$p$, while the nll is w.r.t.&nbsp;$q$. This corresponds to
@@ -81,8 +83,9 @@ $p_{\bar y}(\cdot) = [0,..,0,1,0,..,0]$ with $1$ for $y=\bar y$. The
 cross-entropy is then nothing but the neg-log-likelihood of the true
 class label under the learned model:
 
-$$\begin{aligned}
-H(p_{\bar y}, q_\theta(\cdot|x)) = - \log q_\theta(\bar y|x) ~.\end{aligned}$$
+$$\begin{align}
+H(p_{\bar y}, q_\theta(\cdot|x)) = - \log q_\theta(\bar y|x) ~.
+\end{align}$$
 
 Note that we could equally cast a square error loss as a cross-entropy:
 If $y$ is continuous, $q_\theta(y|x)$ Gaussian around a mean prediction
@@ -96,16 +99,18 @@ Relative Entropy (KL-divergence)
 The Kullback-Leibler divergence, also called relative entropy, is
 defined as
 
-$$\begin{aligned}
+$$\begin{align}
 D\big(p\,\big\Vert\,q\big)
 &= \sum_x p(x) \log\frac{p(x)}{q(x)}
- = \mathbb{E}_{p(x)}\!\left\{\log\frac{p(x)}{q(x)}\right\} ~.\end{aligned}$$
+ = \mathbb{E}_{p(x)}\!\left\{\log\frac{p(x)}{q(x)}\right\} ~.
+\end{align}$$
 
 Given our definitions above, we can rewrite it as
 
-$$\begin{aligned}
+$$\begin{align}
 D\big(p\,\big\Vert\,q\big)
-&= H(p, q) - H(p) ~.\end{aligned}$$
+&= H(p, q) - H(p) ~.
+\end{align}$$
 
 Note that we described $H(p)$ as the expected code length when encoding
 $p$-samples using a $p$-model; and $H(p, q)$ as the expected code length

diff --git a/docs/notes/entropy.md b/docs/notes/entropy.md
@@ -2,11 +2,12 @@
 layout: home
 title:  "Entropy, Information, Cross-Entropy, and ML as Minimal Description Length"
 date: 2024-08-15
+order: 1
 tags: note
 ---
 
 *[Marc Toussaint](https://www.user.tu-berlin.de/mtoussai/), Learning &
-Intelligent Systems Lab, TU Berlin, {{ page.date  | date: '%B %d, %Y' }}*
+Intelligent Systems Lab, TU Berlin,* {{ page.date  | date: '%B, %Y' }}
 
 [[pdf version](../pdfs/entropy.pdf)]