Skip to content

Commit

Permalink
gaussian identities added
Browse files Browse the repository at this point in the history
  • Loading branch information
MarcToussaint committed Aug 30, 2024
1 parent e3b2e30 commit b25bd43
Show file tree
Hide file tree
Showing 22 changed files with 809 additions and 177 deletions.
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@ parts:
## Lecture Notes

<ul>
{% assign sorted = site.pages | sort: 'date' %}
{% assign sorted = site.pages | sort: 'order' %}
{% for page in sorted %}
{% if page.tags == 'note' %}
<li>
<a href="{{site.baseurl}}{{page.url}}">{{ page.title }} ({{ page.date | date: '%B %d, %Y' }})</a>
<a href="{{site.baseurl}}{{page.url}}">{{ page.title }}</a>
</li>
{% endif %}
{% endfor %}
Expand Down
30 changes: 18 additions & 12 deletions docs/notes/energy.inc
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,17 @@ Probabilities & Energy

Given a density $p(x)$, we call

$$\begin{aligned}
E(x) = -\log p(x) + c ~,\end{aligned}$$
$$\begin{align}
E(x) = -\log p(x) + c ~,
\end{align}$$

(for any choice of offset $c\in{\mathbb{R}}$) an energy function.
Conversely, given an energy function $E(x)$, the corresponding density
(called Boltzmann distribution) is

$$\begin{aligned}
p(x) = \textstyle\frac 1Z \exp\lbrace-E(x)\rbrace ~,\end{aligned}$$
$$\begin{align}
p(x) = \textstyle\frac 1Z \exp\lbrace-E(x)\rbrace ~,
\end{align}$$

where $Z$ is the normalization constant to ensure $\int_x p(x) = 1$, and
$c=-\log Z$. From the perspective of physics, one can motivate why
Expand All @@ -22,14 +24,16 @@ minimalistically as follows:
Probabilities are *multiplicative* (axiomatically): E.g., the likelihood
of i.i.d.&nbsp;data $D = \lbrace x_i \rbrace_{i=1}^n$ is the product

$$\begin{aligned}
P(D) = \prod_{i=1}^n p(x_i) ~.\end{aligned}$$
$$\begin{align}
P(D) = \prod_{i=1}^n p(x_i) ~.
\end{align}$$

We often want to rewrite this with a log to have an *additive*
expression

$$\begin{aligned}
E(D) = -\log P(D) = \sum_{i=1}^n E(x_i) ~,\quad E(x_i) = -\log p(x_i) ~.\end{aligned}$$
$$\begin{align}
E(D) = -\log P(D) = \sum_{i=1}^n E(x_i) ~,\quad E(x_i) = -\log p(x_i) ~.
\end{align}$$

The minus is a convention so that we can call the quantity $E(D)$ a
*loss* or *error* &ndash; something we want to minimize instead of
Expand All @@ -45,22 +49,24 @@ P(D_2)$, (2) $E(D)$ is additive $E(D_1\cup D_2) = E(D_1) + E(D_2)$, and
(3) there is a mapping $P(D) = \textstyle\frac 1Z f(E(D))$ between both.
Then it follows that

$$\begin{aligned}
$$\begin{align}
P(D_1\cup D_2)
&= P(D_1)~ P(D_2) = \textstyle\frac 1{Z_1} f(E(D_1))~ \textstyle\frac 1{Z_2} f(E(D_2)) \\
P(D_1\cup D_2)
&= \textstyle\frac 1{Z_0} f(E(D_1\cup D_2)) = \textstyle\frac 1{Z_0} f(E(D_1) + E(D_2)) \\
\Rightarrow\quad \textstyle\frac 1{Z_1} f(E_1) \textstyle\frac 1{Z_2} f(E_2)
&= \textstyle\frac 1{Z_0} f(E_1+E_2) ~, \label{exp}\end{aligned}$$
&= \textstyle\frac 1{Z_0} f(E_1+E_2) ~, \label{exp}
\end{align}$$

where we defined $E_i=E(D_i)$. The only function to fulfill the last
equation for any $E_1,E_2\in{\mathbb{R}}$ is the exponential function
$f(E) = \exp\lbrace -\beta E \rbrace$ with arbitrary coefficient $\beta$
(and minus sign being a convention, $Z_0 = Z_1
Z_2$). Boiling this down to an individual element $x\in D$, we have

$$\begin{aligned}
p(x) = \textstyle\frac 1Z \exp\lbrace-\beta E(x)\rbrace ~,\quad\beta E(x) = -\log p(x) - \log Z ~.\end{aligned}$$
$$\begin{align}
p(x) = \textstyle\frac 1Z \exp\lbrace-\beta E(x)\rbrace ~,\quad\beta E(x) = -\log p(x) - \log Z ~.
\end{align}$$

Partition Function
------------------
Expand Down
3 changes: 2 additions & 1 deletion docs/notes/energy.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,12 @@
layout: home
title: "Probabilities, Energy, Boltzmann, and Partition Function"
date: 2024-08-19
order: 2
tags: note
---

*[Marc Toussaint](https://www.user.tu-berlin.de/mtoussai/), Learning &
Intelligent Systems Lab, TU Berlin, {{ page.date | date: '%B %d, %Y' }}*
Intelligent Systems Lab, TU Berlin,* {{ page.date | date: '%B, %Y' }}

[[pdf version](../pdfs/energy.pdf)]

Expand Down
31 changes: 18 additions & 13 deletions docs/notes/entropy.inc
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,9 @@ sample $B$ surprises you more and gives you 2 bits of information.

The entropy

$$\begin{aligned}
H(p) &= - \sum_x p(x) \log p(x) = \mathbb{E}_{p(x)}\!\left\{-\log p(x)\right\}\end{aligned}$$
$$\begin{align}
H(p) &= - \sum_x p(x) \log p(x) = \mathbb{E}_{p(x)}\!\left\{-\log p(x)\right\}
\end{align}$$

is the expected neg-log-likelihood. It is a measure of the distribution
$p$ itself, not of a specific sample. It measures, how much *in average*
Expand All @@ -48,10 +49,10 @@ by a uniform discrete random variable of cardinality
$P\mkern-1pt{}P(p)$".

Given a gaussian distribution
$p(x) \propto \exp\{-{\frac{1}{2}}(x-\mu)^2/\sigma^2\}$, the
$p(x) \propto \exp\{-{\textstyle\frac{1}{2}}(x-\mu)^2/\sigma^2\}$, the
neg-log-likelihood of a specific sample $x$ is
$-\log p(x) = -{\frac{1}{2}}(x-\mu)^2/\sigma^2 + \textit{const}$. This
can be thought of as the *square error* of $x$ from $\mu$, and its
$-\log p(x) = -{\textstyle\frac{1}{2}}(x-\mu)^2/\sigma^2 + \textit{const}$.
This can be thought of as the *square error* of $x$ from $\mu$, and its
expectation (entropy) is the mean square error. Generally, the
neg-log-likelihood $-\log p(x)$ often relates to an error or loss
function.
Expand All @@ -61,8 +62,9 @@ Cross-Entropy

The cross-entropy

$$\begin{aligned}
H(p,q) = - \sum_x p(x) \log q(x) = \mathbb{E}_{p(x)}\!\left\{-\log q(x)\right\}\end{aligned}$$
$$\begin{align}
H(p,q) = - \sum_x p(x) \log q(x) = \mathbb{E}_{p(x)}\!\left\{-\log q(x)\right\}
\end{align}$$

is also an expected neg-log-likelihood, but expectation is
w.r.t.&nbsp;$p$, while the nll is w.r.t.&nbsp;$q$. This corresponds to
Expand All @@ -81,8 +83,9 @@ $p_{\bar y}(\cdot) = [0,..,0,1,0,..,0]$ with $1$ for $y=\bar y$. The
cross-entropy is then nothing but the neg-log-likelihood of the true
class label under the learned model:

$$\begin{aligned}
H(p_{\bar y}, q_\theta(\cdot|x)) = - \log q_\theta(\bar y|x) ~.\end{aligned}$$
$$\begin{align}
H(p_{\bar y}, q_\theta(\cdot|x)) = - \log q_\theta(\bar y|x) ~.
\end{align}$$

Note that we could equally cast a square error loss as a cross-entropy:
If $y$ is continuous, $q_\theta(y|x)$ Gaussian around a mean prediction
Expand All @@ -96,16 +99,18 @@ Relative Entropy (KL-divergence)
The Kullback-Leibler divergence, also called relative entropy, is
defined as

$$\begin{aligned}
$$\begin{align}
D\big(p\,\big\Vert\,q\big)
&= \sum_x p(x) \log\frac{p(x)}{q(x)}
= \mathbb{E}_{p(x)}\!\left\{\log\frac{p(x)}{q(x)}\right\} ~.\end{aligned}$$
= \mathbb{E}_{p(x)}\!\left\{\log\frac{p(x)}{q(x)}\right\} ~.
\end{align}$$

Given our definitions above, we can rewrite it as

$$\begin{aligned}
$$\begin{align}
D\big(p\,\big\Vert\,q\big)
&= H(p, q) - H(p) ~.\end{aligned}$$
&= H(p, q) - H(p) ~.
\end{align}$$

Note that we described $H(p)$ as the expected code length when encoding
$p$-samples using a $p$-model; and $H(p, q)$ as the expected code length
Expand Down
3 changes: 2 additions & 1 deletion docs/notes/entropy.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,12 @@
layout: home
title: "Entropy, Information, Cross-Entropy, and ML as Minimal Description Length"
date: 2024-08-15
order: 1
tags: note
---

*[Marc Toussaint](https://www.user.tu-berlin.de/mtoussai/), Learning &
Intelligent Systems Lab, TU Berlin, {{ page.date | date: '%B %d, %Y' }}*
Intelligent Systems Lab, TU Berlin,* {{ page.date | date: '%B, %Y' }}

[[pdf version](../pdfs/entropy.pdf)]

Expand Down
Loading

0 comments on commit b25bd43

Please sign in to comment.