energy note

MarcToussaint · Aug 19, 2024 · 2f0d593 · 2f0d593
1 parent 605fce4
commit 2f0d593
Show file tree

Hide file tree

Showing 16 changed files with 406 additions and 86 deletions.
diff --git a/docs/index.md b/docs/index.md
@@ -21,10 +21,11 @@ parts:
 ## Lecture Notes
 
 <ul>
-{% for page in site.pages %}
+{% assign sorted = site.pages | sort: 'date' %}
+{% for page in sorted %}
   {% if page.tags == 'note' %}
   <li>
-    <a href="{{site.baseurl}}{{page.url}}">{{ page.title }} ({{ page.date }})</a>
+    <a href="{{site.baseurl}}{{page.url}}">{{ page.title }} ({{ page.date | date: '%B %d, %Y' }})</a>
   </li>
   {% endif %}
 {% endfor %}

diff --git a/docs/notes/_hide/energy.md b/docs/notes/_hide/energy.md
diff --git a/docs/notes/energy.inc b/docs/notes/energy.inc
@@ -0,0 +1,155 @@
+Probabilities & Energy
+----------------------
+
+Given a density $p(x)$, we call
+
+$$\begin{aligned}
+E(x) = -\log p(x) + c ~,\end{aligned}$$
+
+(for any choice of offset $c\in{\mathbb{R}}$) an energy function.
+Conversely, given an energy function $E(x)$, the corresponding density
+(called Boltzmann distribution) is
+
+$$\begin{aligned}
+p(x) = \textstyle\frac 1Z \exp\lbrace-E(x)\rbrace ~,\end{aligned}$$
+
+where $Z$ is the normalization constant to ensure $\int_x p(x) = 1$, and
+$c=-\log Z$. From the perspective of physics, one can motivate why
+$E(x)$ is called "energy" and derive these relations from other
+principles, as mentioned below. Here we first motivate the relation more
+minimalistically as follows:
+
+Probabilities are *multiplicative* (axiomatically): E.g., the likelihood
+of i.i.d.&nbsp;data $D = \lbrace x_i \rbrace_{i=1}^n$ is the product
+
+$$\begin{aligned}
+P(D) = \prod_{i=1}^n p(x_i) ~.\end{aligned}$$
+
+We often want to rewrite this with a log to have an *additive*
+expression
+
+$$\begin{aligned}
+E(D) = -\log P(D) = \sum_{i=1}^n E(x_i) ~,\quad E(x_i) = -\log p(x_i) ~.\end{aligned}$$
+
+The minus is a convention so that we can call the quantity $E(D)$ a
+*loss* or *error* &ndash; something we want to minimize instead of
+maximize. We can show that whenever we want to define a quantity $E(x)$
+that is a function of probabilities (i.e., $E(x) =
+f(p(x))$ for some $f$) and that is additive, then it *needs* to be
+defined as $E(x) = -\log p(x)$ (modulo a constant, where the minus is
+just a convention):
+
+Let $P(D)$, $E(D)$ be two functions over a space of sets $D$, with
+properties (1) $P(D)$ is multiplicative, $P(D_1 \cup D_2) = P(D_1)
+P(D_2)$, (2) $E(D)$ is additive $E(D_1\cup D_2) = E(D_1) + E(D_2)$, and
+(3) there is a mapping $P(D) = \textstyle\frac 1Z f(E(D))$ between both.
+Then it follows that
+
+$$\begin{aligned}
+P(D_1\cup D_2)
+&= P(D_1)~ P(D_2) = \textstyle\frac 1{Z_1} f(E(D_1))~ \textstyle\frac 1{Z_2} f(E(D_2)) \\
+P(D_1\cup D_2)
+&= \textstyle\frac 1{Z_0} f(E(D_1\cup D_2)) = \textstyle\frac 1{Z_0} f(E(D_1) + E(D_2)) \\
+\Rightarrow\quad \textstyle\frac 1{Z_1} f(E_1) \textstyle\frac 1{Z_2} f(E_2)
+&= \textstyle\frac 1{Z_0} f(E_1+E_2) ~, \label{exp}\end{aligned}$$
+
+where we defined $E_i=E(D_i)$. The only function to fulfill the last
+equation for any $E_1,E_2\in{\mathbb{R}}$ is the exponential function
+$f(E) = \exp\lbrace -\beta E \rbrace$ with arbitrary coefficient $\beta$
+(and minus sign being a convention, $Z_0 = Z_1
+ Z_2$). Boiling this down to an individual element $x\in D$, we have
+
+$$\begin{aligned}
+p(x) = \textstyle\frac 1Z \exp\lbrace-\beta E(x)\rbrace ~,\quad\beta E(x) = -\log p(x) - \log Z ~.\end{aligned}$$
+
+Partition Function
+------------------
+
+The normalization constant $Z$ is called partition function. *Knowing
+the partition function means knowing probabilities instead of only
+energies.* This makes an important difference:
+
+Energies $E(x)$ are typically assumed to be known (in physics: from
+first principles, in AI: can be point-wise evaluated for each $x$).
+However, the partition function $Z$ is typically apriori unknown and
+evaluating it is a hard global problem: The number $Z$ is a global
+property of the function $E(\cdot)$, while the energy $E(x)$ of a single
+state $x$ can be evaluated from properties of $x$ only.
+
+To appreciate the importance of the partition function, consider we want
+to sample from a distribution $p(x)$ and we generate random samples $x$.
+For each $x$ we can evaluate $p(x)$ and decide whether it is "good or
+bad", e.g., whether we reject it in rejection sampling with probability
+$1-p(x)$. Now consider the same situation, but we are missing the
+partition function $Z$. That is, we only have access to the energy
+$E(x)$, but not the probability $p(x)$. Sampling now becomes much
+harder, as we have no absolute reference about which energy $E(x)$ is
+actually "good or bad". Samplers in this case therefore need to resort
+to relative comparison between energies at different positions, e.g.,
+before and after a random step and use the Metropolis Hasting ratio to
+decide whether the step was "good or bad". However, such methods are
+local: the chance to randomly step from one mode to a distant other mode
+may be very low (very long mixing times). If you knew the partition
+function, you would have an absolute scale to estimate the probability
+mass of each mode/cluster, and the *total* probability mass of all the
+modes you have yet found; but without the partition function there is no
+way at all to tell whether out there there might be other modes, other
+clusters with even lower energies that might, in absolute terms, attract
+a lot more probability mass. In this view, knowing the partition
+function is fundamentally global information about the absolute scaling
+of probabilities.
+
+Concerning the word "partition function": We clarified that with knowing
+$Z$ we know probabilities $p(x)$ instead of only energies $E(x)$. We can
+say the same for partitions of the full state space (e.g.&nbsp;modes, or
+energy levels): If you want to know how much probability mass $p(i)$ is
+in each partition $i$, you need the partition function
+(w.r.t.&nbsp;$i$). In statistical physics, partitions are often defined
+as states with same energy, and the question of how populated they are
+is highly relevant. This is where the word has its origin.
+
+Energy & Probability in Physics
+-------------------------------
+
+The set $D$ in the proof above is a set of i.i.d.&nbsp;random variables.
+In physics, the analogy is a system composed of particles: Each particle
+has its own state. The full-system state is combinatorial in the
+particle states and full-system state probabilities multiplicative.
+Axiomatically, each particle also has a quantity called energy which is
+additive. Now, an interesting question is why that particular quantity
+that physicists call "energy" has a one-to-one relation $f$ to
+probabilities &ndash; details can be found under the keyword "derivation
+of canonical ensemble" (e.g.&nbsp;Wikipedia), but two brief comments may
+clarify the essentials:
+
+First, the one-to-one relation between energy and probability (in
+physics) only holds in the thermodynamic equilibrium. Second, systems
+(such as the canonical ensemble) have a fixed total energy, which is the
+sum of energies of all particles. Perhaps one can imagine that if one
+particle has particular high energy (e.g.&nbsp;almost all energy), there
+is not much energy left for the other particles, which means that they
+need to populate low energy states. If only discrete energy levels
+exist, one can count the combinatorices of how many full-system states
+exist depending which energy levels are populated &ndash; in the
+previous example the combinatorics is low because many particles are
+confined to low energy states, showing that the probability for this one
+particle to populate such a high energy level is low. In general,
+computing these combinatorics explicitly leads to the so-called
+Boltzmann distribution $p(i) \propto \exp\lbrace-\beta E_i\rbrace$ of a
+particle to be in energy level $i$: Lower energy states have higher
+probability, intuitively because for limited total energy more particles
+can be in these states, leading to higher combinatorics of microstates.
+In other words, in the thermodynamic equilibrium all feasible
+microscopic full-system states are equally likely, but the probability
+of one particle of the ensemble to have energy $i$ goes with the
+Boltzmann distribution.
+
+The coefficient $\beta$ tells us how exactly energies translate to
+probabilities, and should intuitively depend on the total energy of the
+full system: if the full system has little total energy (is "cold"),
+higher energy states should become less likely. In physics, the factor
+is $\beta=\textstyle\frac{1}{k_B T}$ with temperature $T$ and the
+Boltzmann constant $k_B$ which translates the system temperature to the
+average (thermal) particle energy $k_B T$. That is, the word
+"temperature" roughly means average particle energy, and in AI is often
+a freely chosen scaling factor of energies.
diff --git a/docs/notes/energy.md b/docs/notes/energy.md
@@ -0,0 +1,15 @@
+---
+layout: home
+title:  "Probabilities, Energy, Boltzmann & Partition Function"
+date: 2024-08-19
+tags: note
+---
+
+*[Marc Toussaint](https://www.user.tu-berlin.de/mtoussai/), Learning &
+Intelligent Systems Lab, TU Berlin, {{ page.date  | date: '%B %d, %Y' }}*
+
+[[pdf version](../pdfs/energy.pdf)]
+
+{% include_relative energy.inc %}
+
+{% include note-footer.md %}
diff --git a/docs/notes/entropy.inc b/docs/notes/entropy.inc
@@ -3,25 +3,26 @@ Entropy
 
 Entropy is *expected neg-log-likelihood*, which can also be desribed as
 expected surprise, expected code length, or expected error. The notes
-below are to explain this.
+below are to explain this, as well as clarify the relation between ML
+objectives and minimal description length.
 
 Let $p(x)$ be the probability or likelihood of a sample $x$.
 
 Consider $x \in \{A,B,C,D\}$. If the distribution over these four
-symbols was $p(x) = [.7, .1, .1, .1]$, then you would be less surprised
-to see a sample $x=A$ than a sample $x=B$. If you want an optimal
-encoding/compression of samples from this distribution, you would
-allocate a shorter code for the frequent symbol $A$, and a longer code
-for the infrequent symbols $B,C,D$ (cf.&nbsp;Huffman code).
+symbols was $p(\cdot) = [.7, .1, .1, .1]$, then you would be less
+surprised to see a sample $x=A$ than a sample $x=B$. If you want an
+optimal encoding/compression of samples from this distribution, you
+would allocate a shorter code for the frequent symbol $A$, and a longer
+code for the infrequent symbols $B,C,D$ (cf.&nbsp;Huffman code).
 
 The neg-log-likelihood $-\log p(x)$ is also called *surprise*. It
 provides the optimal *code length* you should assign to a symbol $x$
-(modulo rounding): E.g., for $p(x) = [.5, .25, .25, .0]$, $-\log_2 p(x)$
-equals 1 bit for symbol $A$, 2 bits for symbols $B,C$, and $D$ never
-appears. When using $\log_2$, bits is the unit of neg-log-likelihood and
-can also be thought of as unit of *information*: a sample $A$ surprises
-you less and only gives 1 bit of information, a sample $B$ surprises you
-more and gives you 2 bits of information.
+(modulo rounding): E.g., for $p(\cdot) = [.5, .25, .25, .0]$,
+$-\log_2 p(x)$ equals 1 bit for symbol $x=A$, 2 bits for symbols $B$ and
+$C$, and $D$ never appears. When using $\log_2$, bits is the unit of
+neg-log-likelihood and can also be thought of as unit of *information*:
+a sample $A$ surprises you less and only gives 1 bit of information, a
+sample $B$ surprises you more and gives you 2 bits of information.
 
 The entropy
 
@@ -30,10 +31,21 @@ H(p) &= - \sum_x p(x) \log p(x) = \mathbb{E}_{p(x)}\!\left\{-\log p(x)\right\}\e
 
 is the expected neg-log-likelihood. It is a measure of the distribution
 $p$ itself, not of a specific sample. It measures, how much *in average*
-samples of $p(x)$ surprise you. Or: What is the average coding length of
-samples of $p(x)$. Or: What is the average information given by samples
-of $p(x)$. The unit of entropy can be thought of as bits (referring to
-$\log_2$).
+samples of $p$ surprise you. Or: What is the average coding length of
+samples of $p$. Or: What is the average information given by samples of
+$p$. The unit of entropy is bits (when using $\log_2$). That is, we can
+specifically say "the average information given by a sample of $p$ is as
+much as the information given by $H(p)$ uniform binary variables".
+
+Sometimes another unit of information is used: The *perplexity* of $p$
+is defined as $P\mkern-1pt{}P(p) = 2^{H(p)}$ (when using $\log_2$),
+which is one-to-one with entropy but uses different units of
+information. Note that a discrete uniform random variable of cardinality
+$P\mkern-1pt{}P$ has entropy $H = \log_2 P\mkern-1pt{}P$, which explains
+the definition of perplexity. Therefore, we can now say "the average
+information given by a sample of $p$ is as much as the information given
+by a uniform discrete random variable of cardinality
+$P\mkern-1pt{}P(p)$".
 
 Given a gaussian distribution
 $p(x) \propto \exp\{-{\frac{1}{2}}(x-\mu)^2/\sigma^2\}$, the
@@ -53,7 +65,7 @@ $$\begin{aligned}
 H(p,q) = - \sum_x p(x) \log q(x) = \mathbb{E}_{p(x)}\!\left\{-\log q(x)\right\}\end{aligned}$$
 
 is also an expected neg-log-likelihood, but expectation is
-w.r.t.&nbsp;$p$, while the nnl is w.r.t.&nbsp;$q$. This corresponds to
+w.r.t.&nbsp;$p$, while the nll is w.r.t.&nbsp;$q$. This corresponds to
 the situation that samples are actually drawn from $p$, but you model or
 encode them using $q$. So your measure of surprise or code length is
 relative to $q$. In ML, $q$ is typically a learned distribution or
@@ -64,8 +76,8 @@ In ML, cross-entropy is often used as a classification loss function
 $\ell(\bar y, q_\theta(y|x))$ between the true class label $\bar y$ and
 the predicted class distribution $q_\theta(y|x)$ (for some input $x$).
 For each individual data point, the true class label $\bar y$ is not
-probabilistic, but we can use it to defines a one-hot-encoding
-$p_{\bar y}(y) = [0,..,0,1,0,..,0]$ with $1$ for $y=\bar y$. The
+probabilistic, but we can use it to define a one-hot-encoding
+$p_{\bar y}(\cdot) = [0,..,0,1,0,..,0]$ with $1$ for $y=\bar y$. The
 cross-entropy is then nothing but the neg-log-likelihood of the true
 class label under the learned model:
 
@@ -123,3 +135,8 @@ than this entropy. In this view, the KL-divergence
 $D\big(p\,\big\Vert\,q_\theta\big)$ is equal to the cross-entropy of
 your model, but subtracting $H(p)$ as the baseline. It measures only
 your error above the minimal achievable error.
+
+The above establishes the close relation between Machine Learning and
+compression or Minimal Description Length: When ML minimizes a
+cross-entropy or KL-divergence, then it also minimizes the expected code
+length when encoding $p$-samples (data) using the learned $q$-model.
diff --git a/docs/notes/entropy.md b/docs/notes/entropy.md
@@ -1,10 +1,15 @@
 ---
 layout: home
-title:  "Entropy, Information, Cross-Entropy & KL-Divergence"
+title:  "Entropy, Information, Cross-Entropy & ML as Minimal Description Length"
 date: 2024-08-15
 tags: note
 ---
 
+*[Marc Toussaint](https://www.user.tu-berlin.de/mtoussai/), Learning &
+Intelligent Systems Lab, TU Berlin, {{ page.date  | date: '%B %d, %Y' }}*
+
+[[pdf version](../pdfs/entropy.pdf)]
+
 {% include_relative entropy.inc %}
 
 {% include note-footer.md %}
diff --git a/docs/notes/latex-macros.inc b/docs/notes/latex-macros.inc
@@ -174,7 +174,7 @@
   %\newcommand{\argmax}[1]{{\rm arg}\!\max_{#1}}
   %\newcommand{\argmin}[1]{{\rm arg}\!\min_{#1}}
   \DeclareMathOperator*{\argmax}{argmax}
-  \DeclareMathOperator*{\argmin}{argmin}
+  %\DeclareMathOperator*{\argmin}{argmin}
   \DeclareMathOperator{\sign}{sign}
   \DeclareMathOperator{\acos}{acos}
   \DeclareMathOperator{\unifies}{unifies}
@@ -497,6 +497,7 @@
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
+\newcommand{\notetitle}{\begin{document}}
 \usepackage{hyperref}
 \newcommand{\doclink}[2]{#1 [<#2>]}
 \newcommand{\codetip}[1]{\begin{shaded} Code tutorials: #1 \end{shaded}}
@@ -519,4 +520,6 @@
 %% %%   %\index{#1@{\hyperref[key:#1]{#1 (\arabic{mysec}:\arabic{mypage})}}|phantom}
 %% %%   \addtocounter{mypage}{-1}
 %% }
-%
+
+%%% !!! the last line % is important! It comments the first line of the included file...
+%
diff --git a/docs/notes/make.sh b/docs/notes/make.sh
@@ -1,9 +1,10 @@
-for input in ./*.tex
+for input in ../../notes/*.tex
 do
-  echo "=============== ${input}"
-  cat latex-macros.inc ${input} > z.tex
-  pandoc z.tex --ascii -o z.md
-  cat z.md > ${input%.*}.inc
-  rm z.md
-  rm z.tex
+    filename="${input##*/}"
+    echo "=============== ${input} ${filename}"
+    cat latex-macros.inc ${input} > z.tex
+    pandoc z.tex --ascii -o z.md
+    cat z.md > ${filename%.*}.inc
+    rm z.md
+    rm z.tex
 done
diff --git a/latex/macros.tex b/latex/macros.tex
@@ -134,8 +134,8 @@
   \newcommand{\comma}{~,\quad}
   \newcommand{\period}{~.\quad}
   \newcommand{\del}{\partial}
-  \newcommand{\Del}[1]{\frac{\del}{\del #1}}
-  \newcommand{\Hes}[1]{\frac{\del^2}{\del #1^2}}
+  \newcommand{\Del}[1]{\textstyle\frac{\del}{\del #1}}
+  \newcommand{\Hes}[1]{\textstyle\frac{\del^2}{\del #1^2}}
 
 %  \newcommand{\quabla}{\Delta}
   \newcommand{\point}{$\bullet~~$}
@@ -208,10 +208,11 @@
   \newcommand{\seqq}[1]{\textsf{#1}}
   \newcommand{\floor}[1]{\lfloor#1\rfloor}
   \newcommand{\Exp}[2][]{\mathbb{E}_{#1}\!\left\{#2\right\}}
+  \newcommand{\Expno}[1][]{\mathbb{E}_{#1}}
   \newcommand{\Var}[2][]{\text{Var}_{#1}\{#2\}}
   \newcommand{\cov}[2][]{\text{cov}_{#1}\{#2\}}
 
-%\newcommand{\Exp}[2]{\left\langle{#2}\right\rangle_{#1}}
+  %\newcommand{\Exp}[2]{\left\langle{#2}\right\rangle_{#1}}
   \newcommand{\ex}{\setminus}
 
   \providecommand{\href}[2]{{\color{blue}USE PDFLATEX!}}
@@ -302,6 +303,8 @@
 \newcommand{\zT}{{\underline z}}
 \newcommand{\Sum}{\textstyle\sum}
 \newcommand{\Int}{\textstyle\int}
+\newcommand{\Frac}{\textstyle\frac}
+\newcommand{\Max}{\textstyle\max}
 \newcommand{\Prod}{\textstyle\prod}
 
 
@@ -494,8 +497,6 @@
 \newcommand{\step}{{\delta}}
 \newcommand{\lsstop}{\r_{\text{ls}}}
 \newcommand{\stepmax}{\step_{\text{max}}}
-\newcommand{\Min}{\text{min}}
-\newcommand{\Max}{\text{max}}
 
 
 \definecolor{boxcol}{rgb}{.85,.9,.92}