class: middle, center, title-slide
Lecture 5: Convolutional networks
Prof. Gilles Louppe
[email protected]
???
R: Put back the slide on the hierarchical composition of patterns R: At the same time, explain why we typically increase the number of filters as we go deeper in the network. R: use figure 10.21 of udl for model vs performance on imagenet
count: false class: middle
How to make neural networks see?
- Visual perception
- Convolutions
- Pooling
- Convolutional networks
class: middle
In 1959-1962, David Hubel and Torsten Wiesel identify the neural basis of information processing in the visual system. They are awarded the Nobel Prize of Medicine in 1981 for their discovery.
.grid.center[
.kol-4-5.center[.width-80[]]
.kol-1-5[
.width-100.circle[].width-100.circle[]]
]
class: middle, black-slide
.center[
<iframe width="640" height="480" src="https://www.youtube.com/embed/IOHayh06LJ4?&loop=1&start=0" frameborder="0" volume="0" allowfullscreen></iframe>]
class: middle, black-slide
.center[
<iframe width="640" height="480" src="https://www.youtube.com/embed/OGxVfKJqX5E?&loop=1" frameborder="0" volume="0" allowfullscreen></iframe>]
???
During their recordings, they noticed a few interesting things:
- the neurons fired only when the line was in a particular place on the retina,
- the activity of these neurons changed depending on the orientation of the line, and
- sometimes the neurons fired only when the line was moving in a particular direction.
Can we equip neural networks with inductive biases tailored for vision?
- Locality (as in simple cells)
- Invariance translation (as in complex cells)
- Hierarchical compositionality (as in hypercomplex cells)
class: middle
.center[Invariance and equivariance to translation.]
.footnote[Credits: Simon J.D. Prince, Understanding Deep Learning, 2023.]
???
- The classification of the shifted image should be the same as the classification of the original image.
- The segmentation of the shifted image should be the shifted segmentation of the original image.
class: middle
In 1980, Fukushima proposes a direct neural network implementation of the hierarchy model of the visual nervous system of Hubel and Wiesel.
.grid[ .kol-2-3.width-90.center[] .kol-1-3[
- Built upon convolutions and enables the composition of a feature hierarchy.
- Biologically-inspired training algorithm, which proves to be largely inefficient.
] ]
.footnote[Credits: Kunihiko Fukushima, Neocognitron: A Self-organizing Neural Network Model, 1980.]
class: middle
In the 1980-90s, LeCun trains a convolutional network by backpropagation. He advocates for end-to-end feature learning in image classification.
.center.width-70[![](figures/lec5/lenet-1990.png)]
.footnote[Credits: LeCun et al, Handwritten Digit Recognition with a Back-Propagation Network, 1990.]
class: middle
A convolutional layer applies the same linear transformation locally everywhere while preserving the signal structure.
.center.width-65[![](figures/lec5/1d-conv.gif)]
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
???
Draw vertically.
class: middle
For the one-dimensional input
.italic[
Technically,
class: middle
Convolutions can implement differential operators:
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
For the 2d input tensor
???
Draw: Explain the intuition behind the sum of element-wise products which reduces to an inner product between the kernel and a region of the input.
class: middle
The 2d convolution can be extended to tensors with multiple channels.
For the 3d input tensor
class: middle
A convolutional layer is defined by a set of
class: middle
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
Convolutions have three additional parameters:
- The padding specifies the size of a zeroed frame added arount the input.
- The stride specifies a step size when moving the kernel across the signal.
- The dilation modulates the expansion of the filter without adding weights.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
Padding is useful to control the spatial dimension of the output feature map, for example to keep it constant across layers.
.center[ .width-45[] .width-45[] ]
.footnote[Credits: Dumoulin and Visin, A guide to convolution arithmetic for deep learning, 2016.]
class: middle
Stride is useful to reduce the spatial dimension of the feature map by a constant factor.
.footnote[Credits: Dumoulin and Visin, A guide to convolution arithmetic for deep learning, 2016.]
class: middle
The dilation modulates the expansion of the kernel support by adding rows and columns of zeros between coefficients.
Having a dilation coefficient greater than one increases the units receptive field size without increasing the number of parameters.
.footnote[Credits: Dumoulin and Visin, A guide to convolution arithmetic for deep learning, 2016.]
class: middle
Formally, a function
Parameter sharing used in a convolutional layer causes the layer to be equivariant to translation.
.caption[If an object moves in the input image, its representation will move the same amount in the output.]
.footnote[Credits: LeCun et al, Gradient-based learning applied to document recognition, 1998.]
???
- Equivariance is useful when we know some local function is useful everywhere (e.g., edge detectors).
- Convolution is not equivariant to other operations such as change in scale or rotation.
As a guiding example, let us consider the convolution of single-channel tensors
???
Do this on the tablet for 1D convolutions. Draw the MLP and Wx product.
class: middle
The convolution operation can be equivalently re-expressed as a single matrix multiplication:
- the convolutional kernel
$\mathbf{u}$ is rearranged as a sparse Toeplitz circulant matrix, called the convolution matrix: $$\mathbf{U} = \begin{pmatrix} 1 & 4 & 1 & 0 & 1 & 4 & 3 & 0 & 3 & 3 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 4 & 1 & 0 & 1 & 4 & 3 & 0 & 3 & 3 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 4 & 1 & 0 & 1 & 4 & 3 & 0 & 3 & 3 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 4 & 1 & 0 & 1 & 4 & 3 & 0 & 3 & 3 & 1 \end{pmatrix}$$ - the input
$\mathbf{x}$ is flattened row by row, from top to bottom: $$v(\mathbf{x}) = \begin{pmatrix} 4 & 5 & 8 & 7 & 1 & 8 & 8 & 8 & 3 & 6 & 6 & 4 & 6 & 5 & 7 & 8 \end{pmatrix}^T$$
Then,
$$\mathbf{U}v(\mathbf{x}) =
\begin{pmatrix}
122 & 148 & 126 & 134
\end{pmatrix}^T$$
which we can reshape to a
class: middle
The same procedure generalizes to
- the convolutional kernel is rearranged as a sparse Toeplitz circulant matrix
$\mathbf{U}$ of shape$(H-h+1)(W-w+1) \times HW$ where- each row
$i$ identifies an element of the output feature map, - each column
$j$ identifies an element of the input feature map, - the value
$\mathbf{U}_{i,j}$ corresponds to the kernel value the element$j$ is multiplied with in output$i$ ;
- each row
- the input
$\mathbf{x}$ is flattened into a column vector$v(\mathbf{x})$ of shape$HW \times 1$ ; - the output feature map
$\mathbf{x} \circledast \mathbf{u}$ is obtained by reshaping the$(H-h+1)(W-w+1) \times 1$ column vector$\mathbf{U}v(\mathbf{x})$ as a$(H-h+1) \times (W-w+1)$ matrix.
Therefore, a convolutional layer is a special case of a fully
connected layer:
???
Insist on how inductive biases are enforced through architecture:
- locality is enforced through sparsity and band structure
- equivariance is enforced through replication and weight sharing
Training:
- The backward pass is not implemented naively as the backward pass of a fully connected layer.
- The backward pass is also a convolution!
class: middle
.center[Fully connected vs convolutional layers.]
.footnote[Credits: Simon J.D. Prince, Understanding Deep Learning, 2023.]
class: middle
class: middle
When the input volume is large, pooling layers can be used to reduce the input dimension while preserving its global structure, in a way similar to a down-scaling operation.
Consider a pooling area of size
- Max-pooling produces a tensor
$\mathbf{o} \in \mathbb{R}^{C \times r \times s}$ such that$$\mathbf{o}_{c,j,i} = \max_{n < h, m < w} \mathbf{x}_{c,rj+n,si+m}.$$ - Average pooling produces a tensor
$\mathbf{o} \in \mathbb{R}^{C \times r \times s}$ such that$$\mathbf{o}_{c,j,i} = \frac{1}{hw} \sum_{n=0}^{h-1} \sum_{m=0}^{w-1} \mathbf{x}_{c,rj+n,si+m}.$$
Pooling is very similar in its formulation to convolution.
class: middle
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
Formally, a function
Pooling layers provide invariance to any permutation inside one cell, which results in (pseudo-)invariance to local translations.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
???
This helpful if we care more about the presence of a pattern rather than its exact position.
class: middle
class: middle
A convolutional network is generically defined as a composition of convolutional layers (
class: middle
The most common convolutional network architecture follows the pattern:
where:
-
$\texttt{*}$ indicates repetition; -
$\texttt{POOL?}$ indicates an optional pooling layer; -
$N \geq 0$ (and usually$N \leq 3$ ),$M \geq 0$ ,$K \geq 0$ (and usually$K < 3$ ); - the last fully connected layer holds the output (e.g., the class scores).
class: middle
Some common architectures for convolutional networks following this pattern include:
-
$\texttt{INPUT} \to \texttt{FC}$ , which implements a linear classifier ($N=M=K=0$ ). -
$\texttt{INPUT} \to [\texttt{FC} \to \texttt{ReLU}]{*K} \to \texttt{FC}$ , which implements a$K$ -layer MLP. -
$\texttt{INPUT} \to \texttt{CONV} \to \texttt{ReLU} \to \texttt{FC}$ . -
$\texttt{INPUT} \to [\texttt{CONV} \to \texttt{ReLU} \to \texttt{POOL}]\texttt{*2} \to \texttt{FC} \to \texttt{ReLU} \to \texttt{FC}$ . -
$\texttt{INPUT} \to [[\texttt{CONV} \to \texttt{ReLU}]\texttt{*2} \to \texttt{POOL}]\texttt{*3} \to [\texttt{FC} \to \texttt{ReLU}]\texttt{*2} \to \texttt{FC}$ .
???
Note that for the last architecture, two
class: center, middle, black-slide
class: middle, center
(demo)
???
https://poloclub.github.io/cnn-explainer/
Notebook lec5
.
.footnote[Credits: Bianco et al, 2018.]
class: middle
Composition of two
.footnote[Credits: Dive Into Deep Learning, 2020.]
class: middle, black-slide
.center[
<iframe width="640" height="480" src="https://www.youtube.com/embed/FwFduRA_L6Q?&loop=1&start=0" frameborder="0" volume="0" allowfullscreen></iframe>]
.center[LeNet-1 (LeCun et al, 1993)]
class: middle
.grid[
.kol-3-5[
Composition of a 8-layer convolutional neural network with a 3-layer MLP.
The original implementation was made of two parts such that it could fit within two GPUs. ] .kol-2-5.center[.width-100[] .caption[LeNet vs. AlexNet] ] ]
.footnote[Credits: Dive Into Deep Learning, 2020.]
class: middle
.grid[
.kol-2-5[
Composition of 5 VGG blocks consisting of
The network depth increased up to 19 layers, while the kernel sizes reduced to 3. ] .kol-3-5.center[.width-100[] .caption[AlexNet vs. VGG] ] ]
.footnote[Credits: Dive Into Deep Learning, 2020.]
class: middle
The .bold[effective receptive field] is the part of the visual input that affects a given unit indirectly through previous convolutional layers.
- It grows linearly with depth when chaining convolutional layers.
- It grows exponentially with depth when pooling layers (or strided convolutions) are interleaved with convolutional layers.
class: middle
.footnote[Credits: Simon J.D. Prince, Understanding Deep Learning, 2023.]
exclude: true class: middle
.grid[ .kol-4-5[
Composition of two
Each inception block is itself defined as a convolutional network with 4 parallel paths.
.center.width-80[] .caption[Inception block] ] .kol-1-5.center[.width-100[]] ]
.footnote[Credits: Dive Into Deep Learning, 2020.]
class: middle
.grid[ .kol-4-5[
Composition of convolutional and pooling layers organized in a stack of residual blocks. Extensions consider more residual blocks, up to a total of 152 layers (ResNet-152).
.center.width-80[]
.center.caption[Regular ResNet block vs. ResNet block with
.footnote[Credits: Dive Into Deep Learning, 2020.]
class: middle
Training networks of this depth is made possible because of the .bold[skip connections]
.footnote[Credits: Dive Into Deep Learning, 2020.]
class: middle
class: middle
class: middle
.footnote[Credits: Tan and Le, 2019.]
???
We empirically observe that different scaling dimensions are not independent. Intuitively, for higher resolution images, we should increase network depth, such that the larger receptive fields can help capture similar features that include more pixels in bigger images. Correspondingly, we should also increase network width when resolution is higher, in order to capture more fine-grained patterns with more pixels in high resolution images. These intuitions suggest that we need to coordinate and balance different scaling dimensions rather than conventional single-dimension scaling
exclude: true class: middle
.footnote[Credits: Tan and Le, 2019.]
class: middle
class: middle
Understanding what is happening in deep neural networks after training is complex and the tools we have are limited.
In the case of convolutional neural networks, we can look at:
- the network's kernels as images
- internal activations on a single sample as images
- distributions of activations on a population of samples
- derivatives of the response with respect to the input
- maximum-response synthetic samples
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
LeNet's first convolutional layer, all filters.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
LeNet's second convolutional layer, first 32 filters.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
AlexNet's first convolutional layer, first 20 filters.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
Convolutional networks can be inspected by looking for synthetic input images
These samples can be found by gradient ascent on the input space:
Here,
class: middle
.center[VGG-16, convolutional layer 1-1, a few of the 64 filters]
.footnote[Credits: Francois Chollet, How convolutional neural networks see the world, 2016.]
class: middle count: false
.center[VGG-16, convolutional layer 2-1, a few of the 128 filters]
.footnote[Credits: Francois Chollet, How convolutional neural networks see the world, 2016.]
class: middle count: false
.center[VGG-16, convolutional layer 3-1, a few of the 256 filters]
.footnote[Credits: Francois Chollet, How convolutional neural networks see the world, 2016.]
class: middle count: false
.center[VGG-16, convolutional layer 4-1, a few of the 512 filters]
.footnote[Credits: Francois Chollet, How convolutional neural networks see the world, 2016.]
class: middle count: false
.center[VGG-16, convolutional layer 5-1, a few of the 512 filters]
.footnote[Credits: Francois Chollet, How convolutional neural networks see the world, 2016.]
class: middle
The network appears to learn a hierarchical composition of patterns:
- The first layers of the network seem to encode basic features such as direction and color.
- These basic features are then combined to form more complex textures, such as grids and spots.
- Finally, these textures are further combined to create increasingly intricate patterns.
exclude: true
What if we build images that maximize the activation of a chosen class output?
--
exclude: true count: false
The left image is predicted with 99.9% confidence as a magpie!
.grid[ .kol-1-2.center[] .kol-1-2.center[] ]
.footnote[Credits: Francois Chollet, How convolutional neural networks see the world, 2016.]
exclude: true class: middle, black-slide
.center[
<iframe width="600" height="400" src="https://www.youtube.com/embed/SCE-QeDfXtA?&loop=1&start=0" frameborder="0" volume="0" allowfullscreen></iframe>]
.bold[Deep Dream.] Start from an image
.italic["Deep hierarchical neural networks are beginning to transform neuroscientists’ ability to produce quantitatively accurate computational models of the sensory systems, especially in higher cortical areas where neural response properties had previously been enigmatic."]
.footnote[Credits: Yamins et al, Using goal-driven deep learning models to understand sensory cortex, 2016.]
class: end-slide, center count: false
The end.
???
Checklist: You want to classify cats and dogs.
- What kind of network would you use?
- What architecture would you use? (list of layers, output size, etc)
- What size of kernels would you use? How many kernels per layer?
- What kind of pooling would you use?
- What kind of activation function would you use?
- What loss function would you use?
- How would you initialize the weights?
- How would you train the network?