representation_learning.qmd

# Representation Learning {#sec-representation_learning}

## Introduction

In this book we have seen many ways to represent visual signals: in the
spatial domain versus frequency domain, with pyramids and filter
responses, and more. We have seen that the choice of representation is
critical: each type of representation makes some operations easy and
others hard. We also saw that deep neural networks can be thought of as
transforming the data, layer by layer, from one representation into
another and another. If representations are so important, why not set
our objective to be "come up with the best representation possible." In
this chapter, we will try to do exactly that.

This chapter is primarily an investigation of *objectives*. We will
identify different objective functions that capture properties of what
it means to be a good representation. Then we will train deep nets to
optimize these objectives, and investigate the resulting learned
representations.

## Problem Setting

Before diving in, let us present the basic problem statement and the
notation we will be using throughout this chapter.

The goal of representation learning is to learn to map from datapoints,
$\mathbf{x} \in \mathcal{X}$, to abstract representations,
$\mathbf{z} \in \mathcal{Z}$, as schematized in
@fig-representation_learning-rep_learning_schematic:

![The goal in this chapter is to learn to map from datapoints to abstract representations, which are typically simpler and more useful than the raw data.](figures/representation_learning/rep_learning_schematic.png){width="52%" #fig-representation_learning-rep_learning_schematic}

We call this mapping an **encoding** and learn an function
$f: \mathcal{X} \rightarrow \mathcal{Z}$. Typically both $\mathbf{x}$
and $\mathbf{z}$ are high-dimensional vectors, and $\mathbf{z}$ is
called a **vector embedding** of $\mathbf{x}$. The mapping $f$ is trained so that
$\mathbf{z}$ has certain desirable properties: common desiderata include
that $\mathbf{z}$ be lower dimensional than $\mathbf{x}$; that the
distribution of $\mathbf{z}$ values, that is, $p(\mathbf{z})$, has a
simple structure (such as being the unit normal distribution); and that
the dimensions of $\mathbf{z}$ be independent factors of variation, that
is, the representation should be
**disentangled** @bengio2013representation. In this way $\mathcal{Z}$ is
a simpler, more abstracted, or better organized representational space
than $\mathcal{X}$, as reflected in the shapes in
@fig-representation_learning-rep_learning_schematic.

## What Makes for a Good Representation?

A good representation is one that makes subsequent problem solving
easier. Below we discuss several desiderata of a good representation.

### Compression

A good representation is one that is parsimonious and captures just the
essential characteristics of the data necessary for tasks of human
interest. There are at least three senses in which compressed
representations are desirable.

1.  Compressed representations require less memory to store.

2.  Compression is a way to achieve invariance to nuisance factors.
    Depending on what the representation will be used for, nuisances
    might include camera noise, lighting conditions, and viewing angle.
    If we can factor out those nuisances and discard them from our
    representation, then the representation will be more useful for
    tasks like object recognition where camera and lighting may change
    without affecting the semantics of the observed object.

3.  Compression is an embodiment of *Occam's razor*: among competing
    hypotheses that all explain the data equally well, the simplest is
    most likely to be true. As we will see below, and in the next
    chapter, many representation learning algorithms seek simple
    representations from which you can regenerate the raw data (e.g.,
    autoencoders). Such representations explain the data in the formal
    sense that they assign high likelihood to the data. This is the same
    sense to which the idea of Bayesian Occam's razor applies @sec-problem_of_generalization-star_problem_revisited;
    see @mackay2003information for a mathematical treatment of this
    idea. Therefore, we can state Occam's razor for representation
    learning: among two representations that fit the data equally well,
    give preference to the more compressed.


:::{.column-margin}
This version of Occam's razor is also known as the **minimum description length principle** @grunwald2007minimum.
:::


Many representation learning algorithms capture these goals by
penalizing or constraining the complexity of the learned representation.
Such a penalty must be paired with some other objective or else it will
lead to a degenerate solution: just make the representation contain
nothing at all. Usually we try to compress the signal in certain ways
while preserving other aspects of the information it carries.
*Autoencoders* and *contrastive learning*, which we will describe
subsequently, are two representation learning algorithms based on the
idea of compression.

### Prediction

The whole purpose of having a visual system is to be able to take
actions that achieve desirable future outcomes. Predicting the future is
therefore very important, and we often want representations that act as
a good substrate for prediction.

The idea of prediction can be generalized to involve more than just
predicting the *future*. Instead we may want to predict the past (i.e.,
given when I'm seeing today, what happened yesterday?), or we may want
to predict gaps in our measurements, a problem known as **imputation**.
Filling in missing pixels in a corrupted image is an example of
imputation, as is superresolving a movie to play at a higher framerate.
We may even want to perform more abstract kinds of predictions, like
predicting what some other agent is thinking about (a problem called
theory of mind). In general, prediction can refer to the inference of
any arbitrary property of the world given observed data. Facilitating
the prediction of important world properties---the future, the past,
mental states, cause and effect, and so on---is perhaps the defining
property of a good representation.

Most representation learning algorithms in vision are about learning
compressed encodings of the world that are also predictive of the future
(and therefore a good substrate for decision making). Beyond just
compression and prediction, some works have explored other goals,
including that a representation be
disentangled @bengio2013representation, interpretable @koh2020concept,
and actionable @soatto2013actionable (you can use it as an effective
substrate for control). All these goals---compression, prediction,
interpretability, and so on---are not in contrast to each other but
rather overlap; for example, more compressed representations may be
simpler and hence easier for a human to interpret.

### Types of Representation Learners

Representation learning algorithms are mostly differentiated by the
objective function they optimize, as well as the constraints on their
hypothesis space. We explore a variety of common objectives in the
following sections, and summarize how these relate to the core
principles of compression and prediction in @tbl-representation_learning-types_of_representation_learners.

:::{#tbl-representation_learning-types_of_representation_learners}


| **Learning Method**  | **Learning Principle** | **Short Summary**                                      |
|----------------------|------------------------|--------------------------------------------------------|
| Autoencoding         | Compression            | Remove redundant information                           |
| Contrastive          | Compression            | Achieve invariance to viewing transformations          |
| Clustering           | Compression            | Quantize continuous data into discrete categories       |
| Future prediction    | Prediction             | Predict the future                                     |
| Imputation           | Prediction             | Predict missing data                                   |
| Pretext tasks        | Prediction             | Predict abstract properties of your data               |


One way to categorize different representation learning methods.
:::

## Autoencoders {#sec-representation_learning-autoencoders}

**Autoencoders** are one of the oldest and most common kinds of representation
learners @rumelhart1985learning, @ballard1987modular. An autoencoder is
a function that maps data back to itself (hence the "auto"), but via a
low-dimensional representational **bottleneck**, as shown in
@fig-representation_learning-autoencoder_diagram:

![{\small (left) An autoencoder maps from points in data space, to points in representation space, and back. (right) An example of running an autoencoder on an input image of a bird.](figures/representation_learning/autoencoder_diagram.png){width="100%" #fig-representation_learning-autoencoder_diagram}

On the right of this figure is an example of an autoencoder applied to
an image. At first glance this might not seem very useful: the output is
the same as the input! The key trick is to impose constraints on the
intermediate representation $\mathbf{z}$, so that it becomes useful. The
most common constraint is compression: $\mathbf{z}$ is a
low-dimensional, compressed representation of $\mathbf{x}$.

To be precise, an autoencoder $F$ consists of two parts, an **encoder**
$f$ and a **decoder** $g$, with $F = g \circ f$. The encoder,
$f: \mathbb{R}^N \rightarrow \mathbb{R}^M$, maps high-dimensional data
$\mathbf{x} \in \mathbb{R}^N$ to a vector embedding
$\mathbf{z} \in \mathbb{R}^M$. Typically, the key property is that
$M < N$, that is, we have performed **dimensionality reduction** (although autoencoders can also be
constructed without this property, in which case they are not doing
dimensionality reduction but instead may place other explicit or
implicit constraints on $\mathbf{z}$). The decoder,
$g: \mathbb{R}^M \rightarrow \mathbb{R}^N$ performs the inverse mapping
to $f$, and ideally $g$ is exactly the inverse function $f^{-1}$.
Because it may be impossible to perfectly invert $f$, we use a loss
function to penalize how far we are from perfect inversion, and a
typical choice is the squared error **reconstruction loss**,
$\mathcal{L}_{\texttt{recon}}(\hat{\mathbf{x}}, \mathbf{x}) = \left\lVert\hat{\mathbf{x}} - \mathbf{x}\right\rVert_2^2$.

Then the learning problem is:
$$f^*,g^* = \mathop{\mathrm{arg\,min}}_{f,g} \mathbb{E}_{\mathbf{x}} \left\lVert g(f(\mathbf{x})) - \mathbf{x}\right\rVert_2^2 $${#eq-representation_learning-autoencoder_learning_problem}

The idea is to find a lower dimensional representation of the data from
which we are able to reconstruct the data. An autoencoder is fine with
throwing away any redundant features of the raw data but is not happy
with actually losing information. The particular loss function
determines what kind of information is preferred when the bottleneck
cannot support preserving *all* the information and a hard choice has to
be made. The $L_2$ loss decomposes as
$\left\lVert g(f(\mathbf{x})) - \mathbf{x}\right\rVert_2^2 = \sqrt{\sum_i (g(f(\mathbf{x}))_i - x_i)^2}$,
that is, a sum over individual pixel errors. This means that it only
cares about matching each individual pixel intensity
$g(f(\mathbf{x}))_i$ to the ground truth $x_i$ and does not directly
penalize patch-level statistics of $\mathbf{x}$ (i.e., statistics
$\phi(\mathbf{x})$ that do not factorize as a sum $\sum_i \psi_i(x_i)$
over per-pixel functions $\psi_i$ for any possible set of functions
$\{\psi_i\}$). Autoencoders can also be constructed using loss functions
that penalize higher-order statistics of $\mathbf{x}$, and this allows,
for example, penalizing errors in the reconstruction of edges, textures,
and other perceptual structures beyond just pixels @snell2017learning.

The learning diagram for the basic $L_2$ autoencoder looks like this:

![](figures/representation_learning/autoencoder_learning_diagram.png){width="100%"}

The optimizer is arbitrary but a typical choice would be gradient
descent. The functional form of $f$ and $g$ are typically deep neural
nets. The output is a learned data encoder $f$. We also get a learned
decoder $g$ as a byproduct, which has its own uses, but, for the purpose
of representation learning, is usually discarded and only used as
scaffolding for training what we really care about, namely $f$.

Closely related to autoencoders are other dimensionality reduction
algorithms like **principle components ananlysis** (**PCA**). In fact,
an $L_2$ autoencoder, for which both $f$ and $g$ are linear functions,
learns an $M$-dimensional embedding that spans the same subspace as a
PCA projection to $M$-dimensions @bourlard1988auto.


:::{.column-margin}
Autoencoders may not seem like much at first glance, but they actually appear all over the place, and many methods in this book can be considered to be special kinds of autoencoders, if you squint. Two examples: the steerable pyramid from @sec-image_pyramids is an autoencoder, and the CycleGAN algorithm from @sec-generative_models is also an autoencoder. See if you can find more examples.
:::

### Experiment: Do Autoencoders Learn Useful Representations?

It is clear from the above that autoencoders will learn a *compressed*
representation of the data, but do they learn a *useful* representation?
Of course the answer to this question depends on what we will use the
representation for. Let's explore how autoencoders work on a simple data
domain, consisting just of colored circles, triangles, and squares. The
data consists of 64,000 images, samples of which are shown in
@fig-representation_learning-shapes_dataset_random_samples.

![A sample from the toy dataset we will work with in this chapter.](figures/representation_learning/shapes_dataset_random_samples.png){width="27.5%" #fig-representation_learning-shapes_dataset_random_samples}

Each shape has a randomized size, position, and rotation. Each shape's
color is sampled from one of eight color classes (orange, green, purple,
etc.) plus a small random perturbation.

For the autoencoder architecture we use a convolutional encoder and
decoder, each with six convolutional layers interspersed with relu
nonlinearities and a 128-dimensional bottleneck (i.e., $M=128$). We
train this autoencoder for 20,000 steps of stochastic gradient descent,
using the Adam optimizer @kingma2014adam with a batch size of 128.

After training, does this autoencoder obtain a good representation of
the data? To answer this question, we need ways of evaluating the
quality of a representation. There are many ways and indeed how to
evaluate representations is an open area of research. But here we will
stick with a very simple approach: see if the nearest neighbors, in
representational space, are meaningful.

We can test this in two ways: (1) for a given query, visualize the
images in the dataset whose embeddings are nearest neighbors to the
query's embedding, and (2) measure the accuracy of a
one-nearest-neighbor classifier in embedding space. Below, in
@fig-representation_learning-AE_results_shapes_dataset, we show both
these analyses.


:::{#fig-representation_learning-AE_results_shapes_dataset layout-ncol=2}
![](figures/representation_learning/AE_NN_viz.png){ #fig-representation_learning-AE_results_shapes_dataset-a}

![](figures/representation_learning/AE_NN_probe.png){ #fig-representation_learning-AE_results_shapes_dataset-b}

Fig (a) Nearest neighbors, in autoencoder embedding space, to a set of query images. (b) Classification accurracy of a one-nearest-neighbor classifier of color and shape using the embeddings at each layer of the autoencoder's encoder.
:::

Recall that every layer of a neural net can be considered as an
embedding (representation) of the data. On the left we show the nearest
neighbors to a set of query images, using the layer 6 embeddings of the
data as the feature space in which to measure distance (i.e., nearness).
Since we have a six-layer encoder, layer 6 is the bottleneck layer, the
output of the full encoder $f$. Notice that the neighbors are indeed
similar to the queries in terms of their colors, shapes, positions, and
rotations; it seems the autoencoder produced a meaningful representation
of the data!

Next we will probe a bit deeper, and ask, how effective are these
embeddings at classifying key properties of the data? On the right we
show the accuracy of a one-nearest-neighbor color classifier (between
the eight color classes) and a shape classifier (circle vs. triangle vs.
square) applied on embedding vectors at each layer of the autoencoder's
encoder. The zeroth layer corresponds to measuring distance in the raw
pixel space; interestingly, the color classifier does its best on this
raw representation of the data. That's because the pixels are a more or
less direct representation of color. Color classification performance
gets worse and worse as we go deeper in the encoder. Conversely, shape
is not explicit in raw pixels, so measuring distance in pixel-space does
not yield good shape nearest neighbors. Deeper layers of the encoder
give representations that are increasingly sensitive to shape
similarity, and shape classification performance gets better. In
general, there is no one representation that is universally the best.
Each is good at capturing some properties of the data and bad at
capturing others. As we go deeper into an autoencoder, the embeddings
tend to become more abstracted and therefore better at capturing
abstract properties like shape and worse at capturing superficial
properties like color. For many tasks---object recognition, geometry
understanding, future prediction---the salient information is rather
abstracted compared to the raw data, and therefore for these tasks
deeper embeddings tend to work better.

## Predictive Encodings
We have already seen many kinds of predictive learning, indeed almost any function can be thought of as making a prediction. In predictive representation learning, the goal is not to make predictions per se, but to use prediction as a way to train a useful representation. The way to do this is first encode the data into an embedding $\mathbf{z}$, then map from the embedding to your target prediction. This gives a composition of functions, just like with the autoencoder, that can be trained with a prediction task and yield a good data encoder. The prediction task is a **pretext task** for learning good representations.

:::{.column-margin}
Neuroscientists think the brain also uses prediction to better encode sensory signals, but focus on a different part of the problem. The idea of **predictive coding** states that the sensory cortex only transmits the difference between its predictions and the actual observed signal @huang2011predictive. This chapter presents how to learn a representation that can make good predictions in the first place. Predictive coding focuses on one thing you can do with such a representation: use it to compress future signals by just transmitting the surprises.
:::

Different kinds of prediction tasks have been proposed for learning good
representations, and depending on the properties you want in your
representation different prediction tasks will be best. Examples include
predicting future frames in a video @recasens2021broaden and predicting
the next pixel in an image given a sequence of preceding pixels
@chen2020generative. It is also possible to use an image's semantic
class as the prediction target. In that case, the prediction problem is
identical to training an image classifier, but the goal is very
different. Rather than obtaining a good classifier at the end, our goal
is instead to obtain a good image encoding (which is predictive of
semantics) @donahue2014decafUSEdecafcitation. These three examples are
visualized below
(@fig-representation_learning-predictive_learning_examples).

![Examples of different pretext tasks.](figures/representation_learning/predictive_learning_examples.png){width="95%" #fig-representation_learning-predictive_learning_examples}

[]{#fig-representation_learning-predictive_learning_examples
label="fig-representation_learning-predictive_learning_examples"}


:::{.column-margin}
These tasks look a lot like supervised learning. We
avoid calling it "supervised" because that connotes that we have
examples of input-output pairs on the target task. Here that would be
input data and exemplar output representations. But we don't have that.
The supervision in this setup is a pretext task that we hope induces
good representations.
:::


Let us now describe the predictive learning problem more formally. Let
$\mathbf{y}$ be the prediction target. Then predictive representation
learning looks like this:

![](figures/representation_learning/predictive_learning_diagram.png){width="100%"}

where $D$ is some distance function, for example, $L_2$. Just like with
the autoencoder, $f$ and $g$ are usually neural nets but may be any
family of functions; often $g$ is just a single linear layer. Unlike
with autoencoders, there is no standard setting for the relative
dimensionalities of $N$, $M$, and $K$; instead it depends on the
prediction task. If the task is image classification, then $N$ will be
the (large) dimensionality of the input pixels, $M$ will usually be much
lower dimensional, and $K$ will be the number of image classes (to
output $K$-dimensional class probability vectors).

### Object Detectors Emerge from Scene-Level Supervision

The real power of these pretexts tasks lies not in solving the tasks
themselves but in acquiring useful image embeddings $\mathbf{z}$ as a
byproduct of solving the pretext task. The amazing thing that ends up
happening is that $z$-space may have *emergent structure* that was not
explicit in either the raw training data nor the pretext task. As a case
study, consider the work of @zhou2014object. They trained a
convolutional neural net (CNN) to perform scene classification, which
asks whether the image shows a living room, bedroom, kitchen, and so on.
The net did wonderfully at that task, but that wasn't the point. The
researchers instead peeled apart the net and looked at which input
images were causing different neurons within the net to fire. What they
found was that there were neurons, on hidden layers in the net, that
fired selectively whenever the input image was of a specific object
class. For example, one particular neuron would fire when the input was
a staircase, and another neuron would fire predominantly for inputs that
were rafts. The subsequent images
(@fig-representation_learning-obj_detectors_emerge) show four of the top
images that activate these two particular neurons. These particular
neurons are on convolutional layers, so really each is a filter
response; the highlighted regions indicate where the feature map for
that filter exceeds a threshold.


![Visualizing two neural receptive fields in a scene classifier neural net.](figures/representation_learning/obj_detectors_emerge1.png){width="100%" #fig-representation_learning-obj_detectors_emerge}

What this shows is that *object* detectors, that is, neurons that
selectively fire when they see a particular object class, emerge in the
hidden layers of a CNN trained only to perform *scene* classification.
This makes sense in retrospect---how else would the net recognize scenes
if not first by identifying their constituent objects?---but it was
quite a shock for the community to see it for the first time. It gave
some evidence that the way we humans parse and recognize scenes may
match the way CNNs also internally parse and recognize scenes.

## Self-Supervised Learning

Predictive learning is great when we have good prediction targets that
induce good representations. What if we don't have labeled targets
provided to us? Instead we could try to cook up targets out of the raw
data itself; for example, we could decide that the top right pixel's
color will be the "label" of the image. This idea is called
**self-supervision**. It looks like this:

![](figures/representation_learning/self_supervised_learning_diagram.png){width="100%"}

where $V_1$ and $V_2$ are two different functions of the *full data
tensor* $\mathbf{X}$. For example, $V_1$ might be the left side of the
image $\mathbf{X}$ and $V_2$ could be the right side, so the pretext
task is to predict the right side of an image from its left side. In
fact, several of the examples we gave previously for predictive learning
are of the self-supervised variety: supervision for predicting a future
frame, or a next pixel, can be cooked up just by splitting a video into
past and future frames, or splitting an image into previous and next
pixels in a raster-order sequence.

## Imputation

**Imputation** is a special case of self-supervised learning, where the prediction
targets are missing elements of the input data. For example, predicting
missing pixels is an imputation problem, as is colorizing a black and
white photo (i.e., predicting missing color channels).
@fig-representation_learning-imputation_examples gives several examples
of these imputation tasks.

![Many common pretext tasks are special cases of imputation on missing values in the data tensor.](figures/representation_learning/imputation_examples.png){width="100%" #fig-representation_learning-imputation_examples}


Imputation---whether over spatial masks or missing channels---can result
in effective visual
representations @vincent2008extracting, @pathak2016context, @he2022masked, @zhang2016colorful, @larsson2016learning, @zhang2017split.
Notice that predicting future frames and next pixels (our examples from
@fig-representation_learning-predictive_learning_examples) are also
imputation problems. 

Above we described how object detectors emerge as a
byproduct of training a net to perform scene classification. What do you
think emerges as a byproduct of training a net to perform colorization?

It may surprise you to find that the answer is object detectors once
again! This certainly surprised us when we saw the results in
@fig-representation_learning-obj_detectors_in_colorization, which are
taken from @zhang2016colorful.

![Visualizing two neural receptive fields in the colorization model from \cite{zhang2016colorful}. Images generated by Andrew Owens and Richard Zhang.](figures/representation_learning/obj_detectors_in_colorization1.png){width="100%" #fig-representation_learning-obj_detectors_in_colorization}

In fact, object detectors emerge in CNNs for just about any reasonable
pretext task: scene recognition, colorization, inpainting missing
pixels, and more. What may be going on is that these things we call
"objects" are not just a peculiarity of human perception but rather map
onto some fundamentally useful structure out there in the world, and any
visual system tasked with understanding our world would arrive at a
similar representation, carving up the sensory array into objects and
other kinds of **perceptual groups**. This idea will be explored in
greater detail in the next chapter.

## Abstract Pretext Tasks

Other varieties of self-supervised learning set up more abstract
prediction problems, rather than just aiming to predict missing data.
For example, we may try to predict if an image has been rotated 90
degrees @komodakis2018unsupervised, or we may aim to predict the
relative position of two image patches given their
appearance @doersch2015unsupervised. These pretext tasks can induce
effective visual representations because solving them requires learning
about semantic and geometric regularities in the world, such as that
clouds tend to appear near the top of an image or that the trunk of a
tree tends to appear below its branches.

## Clustering

One way to compress a signal is dimensionality reduction, which we saw
an example of previously with the autoencoder. Another way is to
quantize the signal into discrete categories, an operation known also as
**clustering**. Mathematically, clustering is a function
$f: \{\mathbf{x}^{(i)}\}_{i=1}^N \rightarrow \{1,\dots,k\}$, that is, a
mapping from the members of a dataset $\{\mathbf{x}^{(i)}\}_{i=1}^N$ to
$k$ integer classes ($k$ can potentially be unbounded). Representing
integers with one-hot codes, clustering is shown in
@fig-representation_learning-clustering_f_diagram.

:::{.column-margin}
Another name for clustering, more common in the representation learning literature, is **vector quantization**.
:::

![You can think of clustering as being just like image labeling, except that that labels are self-discovered rather than being predefined.](figures/representation_learning/clustering_f_diagram.png){width="40%" #fig-representation_learning-clustering_f_diagram}

Clustering follows from the principle of compression: if we can well
summarize a signal with just a discrete category label, then this
summary can serve as a lighter weight and more abstracted substrate for
further reasoning. You will already be familiar with clustering because
we it in our natural language everyday. For example, consider the words
"antelope," "giraffe," and "zebra." Those words are discrete category
labels (i.e., integers) that summarize huge conceptual sets (just think
of all the individual lives and richly diverse personalities you are
lumping together with the simple word "antelope"). Words, then, are
clusters! They are mappings from data to integers, and clustering is the
problem of making up new words for things.

:::{.column-margin}
Words, of course, are given additional structure when used in a language (grammar, connotations, etc.) beyond just being a set of clusters. The same kind
of structure can be added on top of visual clusters.
:::


Many clustering algorithms not only partition the data but also compute
a representation of the data within each cluster; this representation is
sometimes called a **code vector**. The most common code vector is the
**cluster center**, $\mu$, that is, the mean value of all datapoints
assigned to the cluster. Clusters can also be represented by other
statistics of the data assigned to them, such as the variance of this
data or some arbitrary embedding vector, but this is less common. The
set of cluster centers, $\{\mu_i\}_{i=1}^k$, is a representation of a
whole data*set*. They summarize the main modes of behavior in the
dataset.

### $K$-Means

There are many types of clustering algorithm but we will illustrate the
basic principles with just one example, perhaps the most popular
clustering algorithm of them all, $\mathbf{k}$**-means**. $K$-means is a
clustering algorithm that partitions the data into $k$ clusters. Each
datapoint $\mathbf{x}^{(i)}$ is assigned to a cluster indexed by an
integer $a_i \in \{1, \ldots, k\}$. Each cluster is given a code vector
$\mathbf{z} \in \mathbb{R}^M$. The $k$-means objective is to minimize
the distance between each datapoint and the code vector of the cluster
it is assigned to: $$\begin{aligned}
    J_{\texttt{kmeans}} &= \sum_{i=1}^N \left\lVert\mathbf{z}_{a_i} - \mathbf{x}^{(i)}\right\rVert^2_2 \quad\quad \triangleleft \quad\text{$k$-means objective}
\end{aligned}$$ This way, the code vector assigned to each datapoint
will be a good approximation to the value of that datapoint, and we will
have a faithful, but compressed, representation of the dataset.

There are two free parameter sets to optimize over: the code vectors and
the cluster assignments. The cluster assignments can be represented with
a clustering function
$f: \{\mathbf{x}^{(i)}\}_{i=1}^N \rightarrow \{1,\dots,k\}$:
$$\begin{aligned}
    a_i = f(\mathbf{x}^{(i)})
\end{aligned}$$ We will represent the code vectors for each cluster with
a function $g: \{1,\ldots,k\} \rightarrow \mathbb{R}^M$, where the data
dimensionality is $M$: $$\begin{aligned}
    \mathbf{z}_{a_i} = g(a_i)
\end{aligned}$$ Both $f$ and $g$ can be implemented as lookup tables,
since for both the input is a countable set. The $k$-means algorithm
amounts to just filling in these two lookup tables.

Now we are ready to present the full $k$-means algorithm, viewed as a
learning algorithm. As you will see below, the learning diagram looks
almost the same as for an autoencoder! The key differences are (1) the
bottleneck is discrete integers rather than continuous vectors, and (2)
the optimizer is slightly different (we will delve into it
subsequently).

![](figures/representation_learning/kmeans_learning_diagram.png){width="100%"}


To recap, there are two functions we are optimizing over: the encoder
$f$ and the decoder $g$. The $f$ is parameterized by a set of integers
$\{a_i\}_{i=1}^N$, $a_i \in \{1,\ldots,k\}$, which specify the cluster
assignment for each datapoint. The decoder $g$ is parameterized by a set
of $k$ code vectors $\{\mathbf{z}_j\}_{j=1}^k$,
$\mathbf{z}_j \in \mathbb{R}^{M}$, one for each cluster, so we have
$F(\mathbf{x}^{(i)}) = g(f(\mathbf{x}^{(i)})) = g(a_i) = \mathbf{z}_{a_i}$.
The $g$ is differentiable with respect to its parameters but $f$ is not,
because the parameters of $f$ are discrete variables. This means that
gradient descent will not be a suitable optimization algorithm (the
gradient $\frac{\partial f}{\partial a_i}$ is undefined). Instead, we
will use the optimization strategy described next.

#### Optimizing $k$-means

$K$-means uses an optimization algorithm called **block coordinate
descent**. This algorithm splits up the optimization parameters into
multiple subsets (blocks). Then it alternates between *fully* minimizing
the objective with respect to each block of parameters. In $k$-means
there are two parameter blocks: $\{a_i\}_{i=1}^N$ and
$\{\mathbf{z}_j\}_{j=1}^k$. So, we need to derive update rules that find
the minimizer of the $k$-means objective with respect to each block. It
turns out there are simple solutions for both:

$$\begin{aligned}
    a_i &\leftarrow \mathop{\mathrm{arg\,min}}_{j \in \{1,\ldots,k\}} \left\lVert\mathbf{z}_j - \mathbf{x}^{(i)}\right\rVert_2^2 \quad\quad \triangleleft\quad\text{Assign datapoint to nearest cluster}
\end{aligned}$${#eq-representation_learning-kmeans_update_assignments}

$$\begin{aligned}
    \mathbf{z}_j &\leftarrow \frac{1}{N}\sum_{i=1}^N \mathbb{1}(a_i = j)\mathbf{x}^{(i)} \quad\quad \triangleleft\quad\text{Set code vector to cluster center}
\end{aligned}$${#eq-representation_learning-kmeans_update_means}

These steps are repeated until a fixed point is reached, which occurs
when all the datapoints that are assigned to each cluster are closest to
that cluster's code vector.

Why are these the correct updates? @eq-representation_learning-kmeans_update_assignments is
straightforward: it's just a brute force enumeration of all $k$ possible
assignments, from which we select the one that minimizes the $k$-means
objective for that datapoint, given the current set of code vectors
$\{\mathbf{z}_j\}_{j=1}^k$.

@eq-representation_learning-kmeans_update_means is
slightly trickier. It finds the optimal codes $\{\mathbf{z}_j\}_{j=1}^k$
given the current set of assignments $\{a_i\}_{i=1}^N$. To see why
@eq-representation_learning-kmeans_update_means is optimal, first
rewrite the $k$-means objective as follows: 

$$\begin{aligned}    J_{\texttt{kmeans}} &= \sum_{i=1}^N \left\lVert\mathbf{z}_{a_i} - \mathbf{x}^{(i)}\right\rVert^2_2\\
    &= \sum_{j=1}^k \sum_{i=1}^N \mathbb{1}(a_i = j) \left\lVert\mathbf{z}_j - \mathbf{x}^{(i)}\right\rVert^2_2
\end{aligned}$$ 

Looking at one code vector (the $j$-th) in isolation, we are seeking: 

$$\begin{aligned}
    \mathop{\mathrm{arg\,min}}_{\mathbf{z}_j} \sum_{i^{\prime} \text{ s.t. } a_{i^\prime}=j} \left\lVert\mathbf{z}_j - \mathbf{x}^{(i^{\prime})}\right\rVert^2_2
\end{aligned}$$ 

where $i^{\prime}$ enumerates all the datapoints for
which $a_{i^\prime} = j$. Recall that the point that minimizes the sum
of squared distances from a dataset is the mean of the dataset. This
yields our update in @eq-representation_learning-kmeans_update_means:
just set each code to be the mean of all datapoints assigned to that
code's cluster.

Now we can see where the name $k$-means comes from: the optimal code
vectors are the cluster *means*. The algorithm is very simple: compute
the cluster means given current data assignments; reassign each
datapoint to nearest cluster mean; recompute the means; and so on until
convergence. An example of applying this algorithm to a simple dataset
of two-dimensional (2D) points is given in
@fig-representation_learning-kmeans_ex.

:::{#fig-representation_learning-kmeans_ex layout-ncol="4"}

![](figures/representation_learning/kmeans_ex_step1.jpg){#fig-representation_learning-kmeans_ex_step1}

![](figures/representation_learning/kmeans_ex_step2.jpg){#fig-representation_learning-kmeans_ex_step2}

![](figures/representation_learning/kmeans_ex_step3.jpg){#fig-representation_learning-kmeans_ex_step3}

![](figures/representation_learning/kmeans_ex_step4.jpg){#fig-representation_learning-kmeans_ex_step4}

Iterations of $k$-means applied to a simple 2D dataset, with $k=5$. The code vectors assigned to each cluster are marked with an x. (a) Initialization. (b) Update code vector to be cluster means. (c) Update assignments. (d) Converged solution (which occurs here after four updates of both codes and assignments)
:::

### $K$-Means from Multiple Perspectives

This section has presented $k$-means as a representation learning
algorithm. In other texts you may encounter other views on $k$-means,
for example, as a way of interpreting your data or as a simple
generative model. These views are all complementary and all
simultaneously true. Here we showed that $k$-means is like an
autoencoder with a discrete bottleneck. In the next chapters we will
encounter generative models, including one called a Gaussian mixture
model (GMM). $K$-means can also be viewed as a vanilla form of a GMM.
Later we will show that another kind of autoencoder, called a
variational autoencoder (VAE), is a continuous version of a GMM. So we
have a rich tapestry of relationships: autoencoders and VAEs are
continous versions of $k$-means and GMMs, respectively. GMMs and VAEs
are formal probabilistic models that, respectively, extend $k$-means and
autoencoders (which do not come with probabilistic semantics). This is
just one set of connections that can be made. In this book, be on the
lookout for more connections like this. It can be confusing at first to
see multiple different perspectives on the same method: Which one is
correct? But rarely is there a single correct perspective. It is useful
to identify the delta between each new model you encounter and all the
models you already know. Usually there is a small delta with respect to
*many* things you already know, and rarely is there no meaningful
connection between any two arbitrary models. As an exercise, try picking
a random concept on a random page in this book (or, for a challenge, any
book on your bookshelf). What is the delta between that concept and
$k$-means, or between that concept and a different concept on any other
random page? In our experience, it will often be surprisingly small (but
sometimes it takes a lot of effort to see the connection).

### Clustering in Vision

In vision, the problem of clustering is related to the idea of
**perceptual grouping**, which we cover in detail in @sec-perceptual_organization. We humans see the world as
organized into different levels of perceptual structure: contours and
surfaces, objects and events. These structures are groupings, or
clusters, of related visual elements: a contour is a group of points
that form a line, an object is a group of parts that form a cohesive,
nameable whole, and so on. Algorithms like $k$-means, in a suitable
feature space, can discover them.

## Contrastive Learning {#sec-representation_learning-contrastive_learning}
Dimensionality reduction and clustering algorithms learn compressed representations by creating an **information bottleneck**, that is, by constraining the number of bits available in the representation. An alternative compression strategy is to *supervise* what information should be thrown away. **Contrastive learning** is one such approach where a representation is supervised to be *invariant* to certain viewing transformations, resulting in a compressed representation that only captures the properties that are common between the different data **views**. Two different data views could correspond to two different cameras looking at the same scene or two different imaging modalities, such as color and depth, and we will see more examples subsequently.

:::{.column-margin}
Contrastive learning is actually more closely related to clustering than it may at first seem. Contrastive learning maps similar datapoints to similar embeddings. Clustering is just the extreme version of this where there are only $k$ distinct embeddings and similar datapoints get mapped to the *exact same* embedding.
:::

Learning invariant representations is classic goal of computer vision.
Recall that this was one of the reasons we used convolutional image
filters: convolution is equivariant with camera translation, and
invariance can then be achieved simply by pooling over filter responses.
CNNs, through their convolutional architecture, bake translation
invariance into the *hypothesis space*. In this section we will see how
to incentivize invariances instead through the *objective* function.

The idea is to simply penalize deviations from the invariance we want.
Suppose $T$ is a transformation we wish our representation to be
invariant to. Then we may use a loss of the form
$\left\lVert f(T(\mathbf{x}))-f(\mathbf{x})\right\rVert_2^2$ to learn an
encoder $f$ that is invariant to $T$. We call such a loss an
**alignment** loss @wang2020hypersphere.

That seems easy enough, but you may have noticed a flaw: What if $f$
just learns to output the zero vector all the time? Trivial alignment
can be achieved when there is representational collapse, and all
datapoints get mapped to the same arbitrary vector.

Contrastive learning fixes this issue by coupling an alignment loss with
a second loss that pushes apart embeddings of datapoints for which we do
not want an invariant representation. The supervision for contrastive
learning comes in the form of *positive pairs* and *negative pairs*.
Positive pairs are two datapoints we wish to align in $z$-space; if we
wish for invariance to $T$ then a positive pair should be constructed as
$\{\mathbf{x}, \mathbf{x}^{+}\}$ with $\mathbf{x}^{+} = T(\mathbf{x})$.
Negative pairs, $\{\mathbf{x},\mathbf{x}^-\}$, are two datapoints that
should be represented differently in $z$-space. Commonly, negative pairs
are randomly constructed by sampling two datapoints independently and
identically from the same data distribution, that is,
$\mathbf{x} \sim p_{\texttt{data}}(\mathbf{x})$ and
$\mathbf{x}^{-} \sim p_{\texttt{data}}(\mathbf{x})$. Given such data
pairings, the objective is to pull together the positive pairs and push
apart the negative pairs, as illustrated in
@fig-representation_learning-contrastive_learning_diagram.

![Contrastive learning.](figures/representation_learning/contrastive_learning_diagram.png){width="75%" #fig-representation_learning-contrastive_learning_diagram}

This kind of contrastive learning results in an embedding that is
invariant to a transformation $T$. Extending this to achieve invariance
to a *set* of transformations $\{T_1, \ldots, T_n\}$ is straightforward:
just apply the same loss for each of $T_1, \ldots, T_n$.

A second kind of contrastive learning is based on co-occurrence, where
the goal is to learn a common representation of all co-occurring
signals. This form of contrastive learning is useful for learning, for
example, an audiovisual representation where the embedding of an image
matches the embedding of the sound for that same scene. Or, returning to
our colorization example, we can learn an image representation where the
embedding of the grayscale channels matches the embedding of the color
channels. In both these cases we are learning to align co-occurring
sensory signals. This kind of contrastive learning is schematized in
@fig-representation_learning-contrastive_learning_colorization.

![Contrastive learning from multiple views of the data. Figure inspired by \cite{tian2020contrastive](figures/representation_learning/contrastive_learning_colorization.png){width="53%" #fig-representation_learning-contrastive_learning_colorization}

In @fig-representation_learning-contrastive_learning_colorization, we
refer to the two co-occurring signals---color and grayscale---as two
different **views** of the total data tensor $\mathbf{X}$, just like we
did in the previous sections: $\mathbf{x} = V_1(\mathbf{X})$,
$\mathbf{y} = V_1(\mathbf{X})$. You can think of these views either as
resulting from sensory co-occurrences or as two transformations of
$\mathbf{X}$, where the transformation in the color example is channel
dropping. Thus, the two kinds of contrastive learning we have presented
are really one and the same: any two signals can be considered
transformations of a combined total signal, and any signal and its
transformation can be considered two co-occurring ways of measuring the
underlying world.

Nonetheless, it is often easiest to conceptualize these two approaches
separately, and next we give learning diagrams for each:

![](figures/representation_learning/contrastive_learning_transformations.png){width="100%"}

![](figures/representation_learning/contrastive_learning_cooccurrence_diagram.png){width="100%"}

In these diagrams, $D$ is a distance function. Above we give just one
simple form for the contrastive objective; many variations have been
proposed. Three of the most popular are (1) Hadsell et al.'s
"constrastive loss" @hadsell2006dimensionality (an older definition of
the term, now overloaded with our more general notion of a contrastive
loss being the broader family of any loss that pulls together positive
samples and pushes apart negative samples), (2) the **triplet loss**
@chechik2010large, and (3) the **InfoNCE loss** @oord2018representation.
Hadsell et al.'s contrastive loss and the triplet loss add the concept
of a **margin** to the vanilla formulation: they only push/pull when the
distance is less than a specified amount $m$ (called the margin),
otherwise points are considered far enough apart (or close enough
together). The InfoNCE loss is a variation that treats establishing a
contrast as a classification problem: it tries to move points apart
until you can classify the positive sample, for a given anchor,
separately from all the negatives. The general formulation of these
losses takes as input an anchor $\mathbf{x}$, a positive example
$\mathbf{x}^+$, and one or more negative examples $\mathbf{x}^-$. The
positive and negative may be defined based on transformations,
coocurrences, or something else. The full learning objective is to sum
over many samples of anchors, positives, and negatives, producing a
sampled set evaluated according to the losses as follows:
$$\begin{aligned}
    \mathcal{L}(\mathbf{x}, \mathbf{x}^+, \mathbf{x}^-) &= \max(D(f(\mathbf{x}), f(\mathbf{x}^+)-m_{\texttt{pos}},0) - \nonumber \\ &\quad \max(m_{\texttt{neg}} - D(f(\mathbf{x}), f(\mathbf{x}^-),0) \quad \triangleleft \quad \text{Hadsell et al. ``contrastive''}\\
    \mathcal{L}(\mathbf{x}, \mathbf{x}^+, \mathbf{x}^-) &= \max(D(f(\mathbf{x}), f(\mathbf{x}^+)) - D(f(\mathbf{x}), f(\mathbf{x}^-)) + m, 0) \quad\quad \triangleleft \quad \text{triplet}\\
    \mathcal{L}(\mathbf{x}, \mathbf{x}^+, \{\mathbf{x}_i^-\}_{i=1}^N) &= -\log \frac{e^{f(\mathbf{x})^\mathsf{T}f(\mathbf{x}^+)/\tau}}{e^{f(\mathbf{x})^\mathsf{T}f(\mathbf{x}^+)/\tau} + \sum_i e^{f(\mathbf{x})^\mathsf{T}f(\mathbf{x}_i^-)/\tau}} \quad\quad\quad\quad\quad\quad\triangleleft \quad \text{InfoNCE}
\end{aligned}$${#eq-representation_learning-infonce}  Notice that the InfoNCE loss is a log softmax
over a vector of scores $f_1(\mathbf{x})^\mathsf{T}f_2(c)/\tau$ with
$c \in \{\mathbf{x}^+, \mathbf{x}_1^-, \ldots, \mathbf{x}_N^-\}$; you
can therefore think of this loss as corresponding to a classification
problem where the ground truth class is $\mathbf{x}^+$ and the other
possible classes are $\{\mathbf{x}_1^-, \ldots, \mathbf{x}_N^-\}$ (refer
to @sec-intro_to_learning to revisit softmax
classification).

### Alignment and Uniformity

Wang and Isola @wang2020hypersphere showed that the contrastive loss
(specifically the InfoNCE form) encourages two simple properties of the
embeddings: alignment and uniformity. We have already seen that
alignment is the property that two views in a positive pair will map to
the same point in embedding point, that is, the mapping is invariant to
the difference between the views. **Uniformity** comes from the negative
term, which encourages embeddings to spread out and tend toward an
evenly spread, uniform distribution. Importantly, for this to work out
mathematically, the embeddings must be *normalized*, that is, each
embedding vector must be a unit vector. Otherwise, the negative term can
push embeddings toward being infinitely far apart from each other.
Fortunately, it is standard practice in contrastive learning (and many
other forms of representation learning) to apply $L_2$ normalization to
the embeddings. The result is that the embeddings will tend toward a
uniform distribution over the surface of the $M$-dimensional
hypersphere, where $M$ is the dimensionality of the embeddings. See
theorem 1 in @wang2020hypersphere for a formal statement of this fact.

A result of this analysis is we may explicit decompose contrastive
learning into one loss for alignment and another for uniformity, with
the following forms: $$\begin{aligned}
    \mathcal{L}_{\texttt{align}}(f;\alpha) &= \mathbb{E}_{(\mathbf{x},\mathbf{x}^+) \sim p_{\texttt{pos}}} [\left\lVert f(\mathbf{x}) - f(\mathbf{x}^+)\right\rVert_2^{\alpha}]\\
    \mathcal{L}_{\texttt{unif}}(f;t) &= \log \mathbb{E}_{\mathbf{x} \sim p_{\texttt{data}}, \, \mathbf{x}^- \sim p_{\texttt{data}}} [e^{-t\left\lVert f(\mathbf{x}) - f(\mathbf{x}^-)\right\rVert_2^2}]\\
    \mathcal{L}(f;\alpha,t,\lambda) &= \mathcal{L}_{\texttt{align}}(f;\alpha) + \lambda \mathcal{L}_{\texttt{unif}}
\end{aligned}$$ where $p_{\texttt{pos}}$ is the distribution of positive
pairs and $\alpha$, $t$, and $\lambda$ are hyperparameters of the
losses.

### Experiment: Designing Embeddings with Contrastive Learning {#sec-representation_learning-expt_designing_embeddings_with_contrastive_learning}

Using the alignment and uniformity objective defined previously, we will
now illustrate how one can design embeddings with desired invariances.
We will use the shape dataset described in @sec-representation_learning-autoencoders, and will use the
same encoder architecture and optimizer as from that section (a CNN with
six layers, Adam optimizer, 20,000 iterations of stochastic gradient
descent, batch size of 128). In contrast to the autoencoder experiment,
however, we will set the dimensionality of the embedding to $M=2$, so
that we can visualize it in a 2D plot.

![Contrastive learning on colored shapes using two different transformations for creating positive pairs. The choice of transformation controls which features the embedding becomes sensitive to and which it becomes invariant to.](figures/representation_learning/align_unif_results_shapes_dataset.png){width="100%" #fig-representation_learning-align_unif_results_shapes_dataset}

Suppose we wish to obtain an embedding that is sensitive to color and
invariant to shape. Then we should choose a view transformation
$T_{\texttt{c}}$ that preserves color while changing shape. A simple
choice that turns out to work is for $T_{\texttt{c}}(\mathbf{x})$ to
simply output a crop from the image $\mathbf{x}$ (this transformation
does not really change the object's shape but still ends up resulting in
shape-invariant embeddings because color is a much more obvious cue for
the CNN to pick up on and use for solving the contrastive problem). The
result of contrastive learning, using
$L_{\texttt{align}} + L_{\texttt{unif}}$ on data generated from
$T_{\texttt{c}}$ is shown on the top row of
@fig-representation_learning-align_unif_results_shapes_dataset. Notice
that the trained $f$ maps colors to be clustered into the color classes
of this toy data, and that these classes become spread out more or less
uniformly across the surface of a circle in embedding space (a
one-dimensional hypersphere, because the embeddings are 2D and
normalized).

We can repeat the same experiment but with a view transformation
$T_{\texttt{s}}$ designed to be invariant to color. To do so we simply
use the same transformation as for $T_{\texttt{c}}$ (i.e., cropping)
plus add a random shift in hue, brightness, and saturation. Training on
views generated from $T_{\texttt{s}}$ results in the embeddings on the
bottom row of
@fig-representation_learning-align_unif_results_shapes_dataset. Now the
embeddings become invariant to color but cluster shapes and spread out
the three shape types to be roughly uniformly spaced around the
hypersphere.

## Concluding Remarks

This chapter, like much of this book, is about representations.
Different representations make different problems easy, or hard. It is
important to remember that every representation involves a set of
tradeoffs: a good representation for one task may be a bad
representation for another task. One of the current goals of computer
vision research is to find *general-purpose representations*, and what
this means is not that the representation will be good for all tasks but
that it will be good for a wide variety of the tasks that humans care
about, which is a very tiny subset of all possible tasks.