Skip to content

Commit

Permalink
energy note
Browse files Browse the repository at this point in the history
  • Loading branch information
MarcToussaint committed Aug 19, 2024
1 parent 605fce4 commit 2f0d593
Show file tree
Hide file tree
Showing 16 changed files with 406 additions and 86 deletions.
5 changes: 3 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,11 @@ parts:
## Lecture Notes

<ul>
{% for page in site.pages %}
{% assign sorted = site.pages | sort: 'date' %}
{% for page in sorted %}
{% if page.tags == 'note' %}
<li>
<a href="{{site.baseurl}}{{page.url}}">{{ page.title }} ({{ page.date }})</a>
<a href="{{site.baseurl}}{{page.url}}">{{ page.title }} ({{ page.date | date: '%B %d, %Y' }})</a>
</li>
{% endif %}
{% endfor %}
Expand Down
10 changes: 0 additions & 10 deletions docs/notes/_hide/energy.md

This file was deleted.

155 changes: 155 additions & 0 deletions docs/notes/energy.inc
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
Probabilities & Energy
----------------------

Given a density $p(x)$, we call

$$\begin{aligned}
E(x) = -\log p(x) + c ~,\end{aligned}$$

(for any choice of offset $c\in{\mathbb{R}}$) an energy function.
Conversely, given an energy function $E(x)$, the corresponding density
(called Boltzmann distribution) is

$$\begin{aligned}
p(x) = \textstyle\frac 1Z \exp\lbrace-E(x)\rbrace ~,\end{aligned}$$

where $Z$ is the normalization constant to ensure $\int_x p(x) = 1$, and
$c=-\log Z$. From the perspective of physics, one can motivate why
$E(x)$ is called "energy" and derive these relations from other
principles, as mentioned below. Here we first motivate the relation more
minimalistically as follows:

Probabilities are *multiplicative* (axiomatically): E.g., the likelihood
of i.i.d.&nbsp;data $D = \lbrace x_i \rbrace_{i=1}^n$ is the product

$$\begin{aligned}
P(D) = \prod_{i=1}^n p(x_i) ~.\end{aligned}$$

We often want to rewrite this with a log to have an *additive*
expression

$$\begin{aligned}
E(D) = -\log P(D) = \sum_{i=1}^n E(x_i) ~,\quad E(x_i) = -\log p(x_i) ~.\end{aligned}$$

The minus is a convention so that we can call the quantity $E(D)$ a
*loss* or *error* &ndash; something we want to minimize instead of
maximize. We can show that whenever we want to define a quantity $E(x)$
that is a function of probabilities (i.e., $E(x) =
f(p(x))$ for some $f$) and that is additive, then it *needs* to be
defined as $E(x) = -\log p(x)$ (modulo a constant, where the minus is
just a convention):

Let $P(D)$, $E(D)$ be two functions over a space of sets $D$, with
properties (1) $P(D)$ is multiplicative, $P(D_1 \cup D_2) = P(D_1)
P(D_2)$, (2) $E(D)$ is additive $E(D_1\cup D_2) = E(D_1) + E(D_2)$, and
(3) there is a mapping $P(D) = \textstyle\frac 1Z f(E(D))$ between both.
Then it follows that

$$\begin{aligned}
P(D_1\cup D_2)
&= P(D_1)~ P(D_2) = \textstyle\frac 1{Z_1} f(E(D_1))~ \textstyle\frac 1{Z_2} f(E(D_2)) \\
P(D_1\cup D_2)
&= \textstyle\frac 1{Z_0} f(E(D_1\cup D_2)) = \textstyle\frac 1{Z_0} f(E(D_1) + E(D_2)) \\
\Rightarrow\quad \textstyle\frac 1{Z_1} f(E_1) \textstyle\frac 1{Z_2} f(E_2)
&= \textstyle\frac 1{Z_0} f(E_1+E_2) ~, \label{exp}\end{aligned}$$

where we defined $E_i=E(D_i)$. The only function to fulfill the last
equation for any $E_1,E_2\in{\mathbb{R}}$ is the exponential function
$f(E) = \exp\lbrace -\beta E \rbrace$ with arbitrary coefficient $\beta$
(and minus sign being a convention, $Z_0 = Z_1
Z_2$). Boiling this down to an individual element $x\in D$, we have

$$\begin{aligned}
p(x) = \textstyle\frac 1Z \exp\lbrace-\beta E(x)\rbrace ~,\quad\beta E(x) = -\log p(x) - \log Z ~.\end{aligned}$$

Partition Function
------------------

The normalization constant $Z$ is called partition function. *Knowing
the partition function means knowing probabilities instead of only
energies.* This makes an important difference:

Energies $E(x)$ are typically assumed to be known (in physics: from
first principles, in AI: can be point-wise evaluated for each $x$).
However, the partition function $Z$ is typically apriori unknown and
evaluating it is a hard global problem: The number $Z$ is a global
property of the function $E(\cdot)$, while the energy $E(x)$ of a single
state $x$ can be evaluated from properties of $x$ only.

To appreciate the importance of the partition function, consider we want
to sample from a distribution $p(x)$ and we generate random samples $x$.
For each $x$ we can evaluate $p(x)$ and decide whether it is "good or
bad", e.g., whether we reject it in rejection sampling with probability
$1-p(x)$. Now consider the same situation, but we are missing the
partition function $Z$. That is, we only have access to the energy
$E(x)$, but not the probability $p(x)$. Sampling now becomes much
harder, as we have no absolute reference about which energy $E(x)$ is
actually "good or bad". Samplers in this case therefore need to resort
to relative comparison between energies at different positions, e.g.,
before and after a random step and use the Metropolis Hasting ratio to
decide whether the step was "good or bad". However, such methods are
local: the chance to randomly step from one mode to a distant other mode
may be very low (very long mixing times). If you knew the partition
function, you would have an absolute scale to estimate the probability
mass of each mode/cluster, and the *total* probability mass of all the
modes you have yet found; but without the partition function there is no
way at all to tell whether out there there might be other modes, other
clusters with even lower energies that might, in absolute terms, attract
a lot more probability mass. In this view, knowing the partition
function is fundamentally global information about the absolute scaling
of probabilities.

Concerning the word "partition function": We clarified that with knowing
$Z$ we know probabilities $p(x)$ instead of only energies $E(x)$. We can
say the same for partitions of the full state space (e.g.&nbsp;modes, or
energy levels): If you want to know how much probability mass $p(i)$ is
in each partition $i$, you need the partition function
(w.r.t.&nbsp;$i$). In statistical physics, partitions are often defined
as states with same energy, and the question of how populated they are
is highly relevant. This is where the word has its origin.

Energy & Probability in Physics
-------------------------------

The set $D$ in the proof above is a set of i.i.d.&nbsp;random variables.
In physics, the analogy is a system composed of particles: Each particle
has its own state. The full-system state is combinatorial in the
particle states and full-system state probabilities multiplicative.
Axiomatically, each particle also has a quantity called energy which is
additive. Now, an interesting question is why that particular quantity
that physicists call "energy" has a one-to-one relation $f$ to
probabilities &ndash; details can be found under the keyword "derivation
of canonical ensemble" (e.g.&nbsp;Wikipedia), but two brief comments may
clarify the essentials:

First, the one-to-one relation between energy and probability (in
physics) only holds in the thermodynamic equilibrium. Second, systems
(such as the canonical ensemble) have a fixed total energy, which is the
sum of energies of all particles. Perhaps one can imagine that if one
particle has particular high energy (e.g.&nbsp;almost all energy), there
is not much energy left for the other particles, which means that they
need to populate low energy states. If only discrete energy levels
exist, one can count the combinatorices of how many full-system states
exist depending which energy levels are populated &ndash; in the
previous example the combinatorics is low because many particles are
confined to low energy states, showing that the probability for this one
particle to populate such a high energy level is low. In general,
computing these combinatorics explicitly leads to the so-called
Boltzmann distribution $p(i) \propto \exp\lbrace-\beta E_i\rbrace$ of a
particle to be in energy level $i$: Lower energy states have higher
probability, intuitively because for limited total energy more particles
can be in these states, leading to higher combinatorics of microstates.
In other words, in the thermodynamic equilibrium all feasible
microscopic full-system states are equally likely, but the probability
of one particle of the ensemble to have energy $i$ goes with the
Boltzmann distribution.

The coefficient $\beta$ tells us how exactly energies translate to
probabilities, and should intuitively depend on the total energy of the
full system: if the full system has little total energy (is "cold"),
higher energy states should become less likely. In physics, the factor
is $\beta=\textstyle\frac{1}{k_B T}$ with temperature $T$ and the
Boltzmann constant $k_B$ which translates the system temperature to the
average (thermal) particle energy $k_B T$. That is, the word
"temperature" roughly means average particle energy, and in AI is often
a freely chosen scaling factor of energies.
15 changes: 15 additions & 0 deletions docs/notes/energy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
---
layout: home
title: "Probabilities, Energy, Boltzmann & Partition Function"
date: 2024-08-19
tags: note
---

*[Marc Toussaint](https://www.user.tu-berlin.de/mtoussai/), Learning &
Intelligent Systems Lab, TU Berlin, {{ page.date | date: '%B %d, %Y' }}*

[[pdf version](../pdfs/energy.pdf)]

{% include_relative energy.inc %}

{% include note-footer.md %}
55 changes: 36 additions & 19 deletions docs/notes/entropy.inc
Original file line number Diff line number Diff line change
Expand Up @@ -3,25 +3,26 @@ Entropy

Entropy is *expected neg-log-likelihood*, which can also be desribed as
expected surprise, expected code length, or expected error. The notes
below are to explain this.
below are to explain this, as well as clarify the relation between ML
objectives and minimal description length.

Let $p(x)$ be the probability or likelihood of a sample $x$.

Consider $x \in \{A,B,C,D\}$. If the distribution over these four
symbols was $p(x) = [.7, .1, .1, .1]$, then you would be less surprised
to see a sample $x=A$ than a sample $x=B$. If you want an optimal
encoding/compression of samples from this distribution, you would
allocate a shorter code for the frequent symbol $A$, and a longer code
for the infrequent symbols $B,C,D$ (cf.&nbsp;Huffman code).
symbols was $p(\cdot) = [.7, .1, .1, .1]$, then you would be less
surprised to see a sample $x=A$ than a sample $x=B$. If you want an
optimal encoding/compression of samples from this distribution, you
would allocate a shorter code for the frequent symbol $A$, and a longer
code for the infrequent symbols $B,C,D$ (cf.&nbsp;Huffman code).

The neg-log-likelihood $-\log p(x)$ is also called *surprise*. It
provides the optimal *code length* you should assign to a symbol $x$
(modulo rounding): E.g., for $p(x) = [.5, .25, .25, .0]$, $-\log_2 p(x)$
equals 1 bit for symbol $A$, 2 bits for symbols $B,C$, and $D$ never
appears. When using $\log_2$, bits is the unit of neg-log-likelihood and
can also be thought of as unit of *information*: a sample $A$ surprises
you less and only gives 1 bit of information, a sample $B$ surprises you
more and gives you 2 bits of information.
(modulo rounding): E.g., for $p(\cdot) = [.5, .25, .25, .0]$,
$-\log_2 p(x)$ equals 1 bit for symbol $x=A$, 2 bits for symbols $B$ and
$C$, and $D$ never appears. When using $\log_2$, bits is the unit of
neg-log-likelihood and can also be thought of as unit of *information*:
a sample $A$ surprises you less and only gives 1 bit of information, a
sample $B$ surprises you more and gives you 2 bits of information.

The entropy

Expand All @@ -30,10 +31,21 @@ H(p) &= - \sum_x p(x) \log p(x) = \mathbb{E}_{p(x)}\!\left\{-\log p(x)\right\}\e

is the expected neg-log-likelihood. It is a measure of the distribution
$p$ itself, not of a specific sample. It measures, how much *in average*
samples of $p(x)$ surprise you. Or: What is the average coding length of
samples of $p(x)$. Or: What is the average information given by samples
of $p(x)$. The unit of entropy can be thought of as bits (referring to
$\log_2$).
samples of $p$ surprise you. Or: What is the average coding length of
samples of $p$. Or: What is the average information given by samples of
$p$. The unit of entropy is bits (when using $\log_2$). That is, we can
specifically say "the average information given by a sample of $p$ is as
much as the information given by $H(p)$ uniform binary variables".

Sometimes another unit of information is used: The *perplexity* of $p$
is defined as $P\mkern-1pt{}P(p) = 2^{H(p)}$ (when using $\log_2$),
which is one-to-one with entropy but uses different units of
information. Note that a discrete uniform random variable of cardinality
$P\mkern-1pt{}P$ has entropy $H = \log_2 P\mkern-1pt{}P$, which explains
the definition of perplexity. Therefore, we can now say "the average
information given by a sample of $p$ is as much as the information given
by a uniform discrete random variable of cardinality
$P\mkern-1pt{}P(p)$".

Given a gaussian distribution
$p(x) \propto \exp\{-{\frac{1}{2}}(x-\mu)^2/\sigma^2\}$, the
Expand All @@ -53,7 +65,7 @@ $$\begin{aligned}
H(p,q) = - \sum_x p(x) \log q(x) = \mathbb{E}_{p(x)}\!\left\{-\log q(x)\right\}\end{aligned}$$

is also an expected neg-log-likelihood, but expectation is
w.r.t.&nbsp;$p$, while the nnl is w.r.t.&nbsp;$q$. This corresponds to
w.r.t.&nbsp;$p$, while the nll is w.r.t.&nbsp;$q$. This corresponds to
the situation that samples are actually drawn from $p$, but you model or
encode them using $q$. So your measure of surprise or code length is
relative to $q$. In ML, $q$ is typically a learned distribution or
Expand All @@ -64,8 +76,8 @@ In ML, cross-entropy is often used as a classification loss function
$\ell(\bar y, q_\theta(y|x))$ between the true class label $\bar y$ and
the predicted class distribution $q_\theta(y|x)$ (for some input $x$).
For each individual data point, the true class label $\bar y$ is not
probabilistic, but we can use it to defines a one-hot-encoding
$p_{\bar y}(y) = [0,..,0,1,0,..,0]$ with $1$ for $y=\bar y$. The
probabilistic, but we can use it to define a one-hot-encoding
$p_{\bar y}(\cdot) = [0,..,0,1,0,..,0]$ with $1$ for $y=\bar y$. The
cross-entropy is then nothing but the neg-log-likelihood of the true
class label under the learned model:

Expand Down Expand Up @@ -123,3 +135,8 @@ than this entropy. In this view, the KL-divergence
$D\big(p\,\big\Vert\,q_\theta\big)$ is equal to the cross-entropy of
your model, but subtracting $H(p)$ as the baseline. It measures only
your error above the minimal achievable error.

The above establishes the close relation between Machine Learning and
compression or Minimal Description Length: When ML minimizes a
cross-entropy or KL-divergence, then it also minimizes the expected code
length when encoding $p$-samples (data) using the learned $q$-model.
7 changes: 6 additions & 1 deletion docs/notes/entropy.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
---
layout: home
title: "Entropy, Information, Cross-Entropy & KL-Divergence"
title: "Entropy, Information, Cross-Entropy & ML as Minimal Description Length"
date: 2024-08-15
tags: note
---

*[Marc Toussaint](https://www.user.tu-berlin.de/mtoussai/), Learning &
Intelligent Systems Lab, TU Berlin, {{ page.date | date: '%B %d, %Y' }}*

[[pdf version](../pdfs/entropy.pdf)]

{% include_relative entropy.inc %}

{% include note-footer.md %}
7 changes: 5 additions & 2 deletions docs/notes/latex-macros.inc
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@
%\newcommand{\argmax}[1]{{\rm arg}\!\max_{#1}}
%\newcommand{\argmin}[1]{{\rm arg}\!\min_{#1}}
\DeclareMathOperator*{\argmax}{argmax}
\DeclareMathOperator*{\argmin}{argmin}
%\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator{\sign}{sign}
\DeclareMathOperator{\acos}{acos}
\DeclareMathOperator{\unifies}{unifies}
Expand Down Expand Up @@ -497,6 +497,7 @@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newcommand{\notetitle}{\begin{document}}
\usepackage{hyperref}
\newcommand{\doclink}[2]{#1 [<#2>]}
\newcommand{\codetip}[1]{\begin{shaded} Code tutorials: #1 \end{shaded}}
Expand All @@ -519,4 +520,6 @@
%% %% %\index{#1@{\hyperref[key:#1]{#1 (\arabic{mysec}:\arabic{mypage})}}|phantom}
%% %% \addtocounter{mypage}{-1}
%% }
%

%%% !!! the last line % is important! It comments the first line of the included file...
%
15 changes: 8 additions & 7 deletions docs/notes/make.sh
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
for input in ./*.tex
for input in ../../notes/*.tex
do
echo "=============== ${input}"
cat latex-macros.inc ${input} > z.tex
pandoc z.tex --ascii -o z.md
cat z.md > ${input%.*}.inc
rm z.md
rm z.tex
filename="${input##*/}"
echo "=============== ${input} ${filename}"
cat latex-macros.inc ${input} > z.tex
pandoc z.tex --ascii -o z.md
cat z.md > ${filename%.*}.inc
rm z.md
rm z.tex
done
11 changes: 6 additions & 5 deletions latex/macros.tex
Original file line number Diff line number Diff line change
Expand Up @@ -134,8 +134,8 @@
\newcommand{\comma}{~,\quad}
\newcommand{\period}{~.\quad}
\newcommand{\del}{\partial}
\newcommand{\Del}[1]{\frac{\del}{\del #1}}
\newcommand{\Hes}[1]{\frac{\del^2}{\del #1^2}}
\newcommand{\Del}[1]{\textstyle\frac{\del}{\del #1}}
\newcommand{\Hes}[1]{\textstyle\frac{\del^2}{\del #1^2}}
% \newcommand{\quabla}{\Delta}
\newcommand{\point}{$\bullet~~$}
Expand Down Expand Up @@ -208,10 +208,11 @@
\newcommand{\seqq}[1]{\textsf{#1}}
\newcommand{\floor}[1]{\lfloor#1\rfloor}
\newcommand{\Exp}[2][]{\mathbb{E}_{#1}\!\left\{#2\right\}}
\newcommand{\Expno}[1][]{\mathbb{E}_{#1}}
\newcommand{\Var}[2][]{\text{Var}_{#1}\{#2\}}
\newcommand{\cov}[2][]{\text{cov}_{#1}\{#2\}}
%\newcommand{\Exp}[2]{\left\langle{#2}\right\rangle_{#1}}
%\newcommand{\Exp}[2]{\left\langle{#2}\right\rangle_{#1}}
\newcommand{\ex}{\setminus}
\providecommand{\href}[2]{{\color{blue}USE PDFLATEX!}}
Expand Down Expand Up @@ -302,6 +303,8 @@
\newcommand{\zT}{{\underline z}}
\newcommand{\Sum}{\textstyle\sum}
\newcommand{\Int}{\textstyle\int}
\newcommand{\Frac}{\textstyle\frac}
\newcommand{\Max}{\textstyle\max}
\newcommand{\Prod}{\textstyle\prod}
Expand Down Expand Up @@ -494,8 +497,6 @@
\newcommand{\step}{{\delta}}
\newcommand{\lsstop}{\r_{\text{ls}}}
\newcommand{\stepmax}{\step_{\text{max}}}
\newcommand{\Min}{\text{min}}
\newcommand{\Max}{\text{max}}
\definecolor{boxcol}{rgb}{.85,.9,.92}
Expand Down
Loading

0 comments on commit 2f0d593

Please sign in to comment.