-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathgradient_descent.qmd
158 lines (94 loc) · 18.4 KB
/
gradient_descent.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
# Gradient-Based Learning Algorithms {#sec-gradient_descent}
## Introduction
Once you have specified a learning problem (loss function, hypothesis space, parameterization), the next step is to find the parameters that minimize the loss. This is an optimization problem, and the most common optimization algorithm we will use is **gradient descent**. Gradient descent is like a skier making their way down a snowy mountain, where the shape of the mountain is the loss function.
There are many varieties of gradient descent, and we will call this whole family **gradient-based learning algorithms**. All share the same basic idea: at some operating point, calculate the direction of steepest descent, then use this direction to find a new operating point with lower loss.
:::{.column-margin}
We use the term **operating point** to refer to a particular point (setting of the parameters) where we are currently evaluating the loss.
:::
## Technical Setting
In this chapter, we consider the task of minimizing a cost function $J: \cdot \rightarrow \mathbb{R}$, which is a function that maps some arbitrary input to a scalar cost.
In learning problems, the domain of $J$ is the training data and the parameters $\theta$. We will often consider the training data to be fixed and only denote the objective as a function of the parameters, $J(\theta)$. Our goal is to solve:
$$
\theta^* = \arg\min_{\theta} J(\theta)
$$
Pretty much all optimizers work by some iterative process, where they update the parameters to be better and better. Different optimizers differ in how the parameter update function works. The update function gets to view some information about the loss landscape, then uses that information to update the parameters, as shown in @fig-gradient_descent-optimization_schematic.
![General optimization loop.](./figures/gradient_descent/optimization_schematic.png){#fig-gradient_descent-optimization_schematic width=50%}
In the simplest setting, called **zeroth-order optimization**, the update function only gets to observe the value $J(\theta)$. The only way, then, to find $\theta$'s that minimize the loss is to sample different values for $\theta$ and move toward the values that are lower.
For **gradient-based optimization**, also called **first-order optimization**, the update function takes as input the gradient of the cost with respect to the parameters at the current operating point, $\nabla_{\theta}J(\theta)$. This reveals hugely useful information about the loss that directly tells us how to minimize it: just move in the direction of steepest descent, that is, the gradient direction.
Higher-order optimization methods observe higher-order derivatives of the loss, such as the Hessian $H$, which tells you how the landscape is locally curving. The Hessian is costly to compute, but many methods use approximations to the Hessian, or other properties related to loss curvature, and these are growing in popularity~\cite{martens2015optimizing,foret2020sharpness}.
## Basic Gradient Descent
The simplest version of gradient descent just takes a step in the gradient direction of length proportional to the gradient magnitude. This algorithm is described in @alg-gradient_descent_basic_gradient_descent.
![Gradient descent `GD`. Optimizing a cost function $J: \theta \rightarrow \mathbb{R}$ by descending the gradient $\nabla_{\theta} J$.](./figures/gradient_descent/alg1.png){#alg-gradient_descent_basic_gradient_descent}
This algorithm has two hyperparameters, the **learning rate** $\eta$, which controls the step size (learning rate times gradient magnitude), and the number of steps $K$. If the learning rate is sufficiently small and the initial parameter vector $\theta^0$ is random, then this algorithm will almost surely converge to a local minimum of $J$ as $K \rightarrow \infty$~\cite{lee2016gradient}. However, to descend more quickly, it can be useful to set the learning rate to a higher value.
## Learning Rate Schedules
A generally useful strategy is to start with a high value for $\eta$ and then decay it until convergence according to a **learning rate schedule**. Researchers have come up with innumerable schedules, and they generally work by calling some function $\texttt{lr}(\eta^0,k)$ to get the learning rate on each iteration of descent:
$$
\eta^{k} = \texttt{lr}(\eta^0,k)
$$
Generally, we want an update rule where $\eta^{k+1} < \eta^k$, so that we take smaller steps as we approach the minimizer. A few simple and popular approaches are given below:
$$
\begin{aligned}
\texttt{lr}(\eta^0,k) &= \beta^{-k} \eta^0 &\quad\quad \triangleleft\quad \text{exponential decay}\\
\texttt{lr}(\eta^0,k) &= \beta^{-\lfloor k/M \rfloor} \eta^0 &\quad\quad \triangleleft\quad \text{stepwise exponential decay}\\
\texttt{lr}(\eta^0,k) &= \frac{(K - k)}{K} \eta^0 &\quad\quad \triangleleft\quad \text{linear decay}
\end{aligned}
$$
:::{.column-margin}
One downside of linear decay is that it depends on the total number of steps $K$. This makes it hard to compare optimization runs of different lengths. This is something to also be aware of in more advanced learning rate schedules, such as cosine decay~\cite{loshchilov2016sgdr}, which also have different behavior for different settings of $K$.
:::
The $\beta$ and $M$ are additional hyperparameters of these methods. The general approach of learning rate decay is summarized in @alg-gradient_descent_gradient_descent_with_lr_decay.
![Gradient descent with learning rate decay algorithm.](./figures/gradient_descent/alg2.png){#alg-gradient_descent_gradient_descent_with_lr_decay}
Variations on this algorithm include only decaying the learning rate when a plateau is reached (i.e., when the loss is not decreasing for many iterations in a row), or decaying the learning rate according to more complex nonlinear schedules, such as one shaped like a cosine function~\cite{loshchilov2016sgdr}.
## Momentum
Could we do a smarter update than just taking a step in the direction of the gradient? Of the countless ideas that have been proposed, one of the few that has stuck is **momentum**~\cite{polyak1964some,sutskever2013importance}. Momentum makes the analogy to skiing even more precise: momentum is like the inertia of the skier, carrying them over the little bumps and imperfections in the ski slope and increasing their speed as they descend along a straight path. In math, momentum just means that we set the parameter update to be a direction $\mathbf{v}^{k+1}$, given by a weighted combination of the previous update direction, $\mathbf{v}^{k}$, plus the current negative gradient:
$$
\mathbf{v}^{k+1} = \mu \mathbf{v}^{k} - \eta\nabla_{\theta} J(\theta^k)
$$
The weight $\mu$ in this combination is a new hyperparameter, sometimes simply called the momentum. The full algorithm is given in @alg-gradient_descent_gradient_descent_with_momentum.
![Gradient descent with momentum algorithm.](./figures/gradient_descent/alg3.png){#alg-gradient_descent_gradient_descent_with_momentum}
@fig-gradient_descent-momentum_out1 shows how momentum affects gradient descent for a simple objective $J = \texttt{abs}(\theta)$ (absolute value of $\theta$). As can be seen in the figure, some momentum can help the convergence rate ($\mu = 0.5$) but too much momentum will cause the trajectory to overshoot the optimum and even when the optimum loss is achieved, the trajectory might not stop (@fig-gradient_descent-momentum_out1 $\mu = 0.95$).
![(left) A simple loss function $J = \texttt{abs}(\theta)$. (right) Optimization trajectory for three different settings of momentum $\mu$. White line indicates value of the parameter at each iteration of optimization, starting at top and progressing to bottom. Color is value of the loss. Red dot is location where loss first reaches within $0.01$ of optimal value.](./figures/gradient_descent/momentum_out1.png){#fig-gradient_descent-momentum_out1}
It is also possible to come up with other kinds of momentum, which bias the update direction based on some accumulated information from previous updates. Two popular alternatives, which you can read up on elsewhere, are Nesterov's accelerated gradient~\cite{nesterov1983method} and Adam~\cite{kingma2014adam}.
## What Kinds of Functions Can Be Minimized with Gradient Descent?
What if a function is not differentiable? Can we still use gradient descent? Sometimes! The property we need is that we can get a meaningful signal as to how to perturb the function's parameters in order to reduce the loss. This property is *not* the same as differentiability defined in math textbooks. A function may be differentiable but not give useful gradients (e.g., if the gradient is zero everywhere), and a function may be nondifferentiable (at certain points) but still allow for meaningful gradient-based updates (e.g., \texttt{abs}).
@fig-gradient_descent-grad_descent_simple_examples gives examples of different types of functions being minimized with gradient descent. @fig-gradient_descent-grad_descent_simple_examples (b) and @fig-gradient_descent-grad_descent_simple_examples (d) are cases where the function is discontinuous, and the analytical derivative is undefined at the discontinuity. Surprisingly, in @fig-gradient_descent-grad_descent_simple_examples (b), this is not a problem for gradient descent. This is because the gradient descent algorithm we are using here (the one used by Pytorch~\cite{paszke2019pytorch}) uses a **one-sided derivative** at the discontinuity, that is, we set the gradient at the discontinuity to be equal to the gradient value an infinitesimal step away from the discontinuity in a fixed arbitrary direction. Under the hood, for each atomic discontinuous operation, Pytorch requires that we define its gradients at the discontinuities, and the one-sided gradient is a standard choice. This is why it can be fine in deep learning to use functions like rectified nonlinear units (ReLU), which are common in deep networks and have discontinuous gradients.
@fig-gradient_descent-grad_descent_simple_examples (c) and (e) give cases where the function is continuous, but the gradients are not well-behaved. In @fig-gradient_descent-grad_descent_simple_examples (c) we have a gradient that has nearly **vanished**, that is, it is near zero everywhere, and gradient descent, with a fixed learning rate, will therefore be slow. @fig-gradient_descent-grad_descent_simple_examples (e) shows the opposite scenario: the gradient at the minimizer goes to infinity; we call this an **exploding gradient**, and this leads to failures of convergence.
Finally, @fig-gradient_descent-grad_descent_simple_examples (f) shows one more problematic case: when there are multiple minima, gradient descent can get stuck in a suboptimal minimum. Which minimum we arrive at will depend on where we initialized $x$.
![How gradient descent behaves on various functions.** In each subplot, the left shows the function $J$, with the red point representing the solution found by gradient descent (GD) with $\eta=0.01$ and $\mu=0.9$. The right shows the trajectory of $x$ values over iterations of GD, plotted on top of $J$ at each iteration. (a) As $\eta$ goes to zero, GD converges for convex functions. (b) Discontinuities pose no essential problem, as long as the gradient is defined on either side. (c) A nearly flat function will exhibit very slow descent. (d) Piecewise constant functions are problematic because the gradient completely vanishes. (e) For the function $J=\texttt{sqrt}(\texttt{abs}(\theta))-0.25$, the gradient goes to infinity at the minimizer, causing instability. (f) When $J$ has multiple local minima, we may not find the global minimum.](./figures/gradient_descent/grad_descent.png){#fig-gradient_descent-grad_descent_simple_examples}
### Gradient-Like Optimization for Functions without Good Gradients {#sec-gradient_descent-zeroth_order}
What about minimizing functions like @fig-gradient_descent-grad_descent_simple_examples (d), where the gradient is zero almost everywhere? This is a case where gradient descent truly struggles. However, it is often possible to transform such a problem into one that can be treated with gradient descent. Remember that the key property of a gradient, from the perspective of optimization, is that it is a locally loss-minimizing direction in parameter space. Most gradient-based optimizers don't really need true gradients; instead, their update functions are compatible with a broader family of local loss-minimizing directions, $\mathbf{v}$.
Besides the true gradient, what are some other good choices for $\mathbf{v}$? One common idea is to set $\mathbf{v}$ to be the gradient of a **surrogate loss** function, which is a function, $J_{\texttt{surr}}$, with meaningful (non-zero) gradients that approximates $J$. An example might be a smoothed version of $J$. Another way to get $\mathbf{v}$ is to compute it by sampling perturbations of $\theta$, and seeing which perturbation leads to lower loss. In this strategy, we evaluate $J(\theta+\epsilon)$ for a set of perturbations $\epsilon$, then move toward the $\epsilon$'s that decreased the loss. Approaches of this kind are sometimes called **evolution strategies**~\cite{beyer2002evolution, salimans2017evolution}, and a basic version of this algorithm is given in @alg-gradient_descent_ES.
![Evolution strategy algorithm.](./figures/gradient_descent/alg4.png){#alg-gradient_descent_ES}
As shown in @fig-gradient_descent-sampling_out1, this algorithm can successfully minimize the function in @fig-gradient_descent-grad_descent_simple_examples (c).
![Using @alg-gradient_descent_ES) to minimize a nondifferentiable (zero-gradient) loss, using $\sigma=1$, $M=10$, and $\eta=0.02$.](./figures/gradient_descent/sampling_out1.png){#fig-gradient_descent-sampling_out1 width=80%}
### Gradient Clipping
What about @fig-gradient_descent-grad_descent_simple_examples (e), where the gradient explodes near the optimum? Is there anything we can do to improve the optimization of this function? To combat exploding gradients, a useful trick is **gradient clipping**, which just means clamping the magnitude of the gradient to some maximum value. @alg-gradient_descent_grad_clipping describes this approach.
![Gradient clipping algorithm.](./figures/gradient_descent/alg5.png){#alg-gradient_descent_grad_clipping}
:::{.column-margin}
`clip` is the "clipping" function: $\texttt{clip}(v, -m, m) = \max(\min(v,m),-m)$
:::
This algorithm indeed successfully minimizes our example of the exploding gradient, as can be seen in @fig-gradient_descent-clipped_out1.
![Using `GD` with clipping to minimize a loss with exploding gradients, using $m=0.1$.](./figures/gradient_descent/clipped_out1.png){width=80% #fig-gradient_descent-clipped_out1}
## Stochastic Gradient Descent {#sec-gradient_descent-SGD}
One problem with the gradient-based methods we have seen so far is that the gradient may in fact be very expensive to compute, and this is often the case for learning problems. This is because learning problems typically have the form that $J$ is the average of losses incurred on each training datapoint. Computing $\nabla_{\theta}J(\theta)$ requires computing the gradient for each element in the average, that is, the gradient of the function being learned evaluated at the location of each datapoint in the training set. If we train on a big dataset, say 1 million training points, then to perform just \textit{one} step of gradient descent requires computing 1 million gradients!
To make this clear, we will write out $J$ as an explicit function of the training data $\{\mathbf{x}^{(i)}, \mathbf{y}^{(i)}\}_{i=1}^N$. For typical learning problems, $\nabla_{\theta} J(\theta, \{\mathbf{x}^{(i)}, \mathbf{y}^{(i)}\}_{i=1}^N)$ decomposes as follows:
$$
\begin{align}
\nabla_{\theta} J(\theta, \{\mathbf{x}^{(i)}, \mathbf{y}^{(i)}\}_{i=1}^N) &=
\nabla_{\theta} \frac{1}{N}\sum_{i=1}^N \mathcal{L}(f_{\theta}(\mathbf{x}^{(i)}), \mathbf{y}^{(i)})\\
&= \frac{1}{N}\sum_{i=1}^N \nabla_{\theta} \mathcal{L}(f_{\theta}(\mathbf{x}^{(i)}), \mathbf{y}^{(i)})
\end{align}
$$
For large $N$, computing this sum is very expensive. Suppose instead we randomly subsample (without replacement) a \textit{batch} of terms from this sum, $\{\mathbf{x}^{(b)}, \mathbf{y}^{(b)}\}_{b=1}^B$, where $B$ is the **batch size**. We then compute an \textit{estimate} of the total gradient as the average gradient over this batch as follows:
$$
\begin{align}
\tilde{\mathbf{g}} = \frac{1}{N}\sum_{b=1}^B \nabla_{\theta} \mathcal{L}(f_{\theta}(\mathbf{x}^{(b)}), \mathbf{y}^{(b)})
\end{align}
$$
If we sample a large batch, where $B$ is almost as large as $N$, then the average over the $B$ terms should be roughly the same as the average over all $N$ terms. If we sample a smaller batch, then our estimate of the gradient will be less accurate but faster to compute. Therefore we have a tradeoff between accuracy and speed, and we can navigate this tradeoff with the hyperparameter $B$.
The variant of gradient descent that uses this idea is called **stochastic gradient descent** (SGD), because each iteration of descent uses a different randomly (stochastically) sampled batch of training data to estimate the gradient.
The full description of SGD is given in @alg-gradient_descent_SGD.
![Stochastic gradient descent algorithm.](./figures/gradient_descent/alg6.png){#alg-gradient_descent_SGD}
`SGD` has a number of useful properties beyond just being faster to compute than `GD`. Because each step of descent is somewhat random, $\texttt{SGD}$ can jump over small bumps in the loss landscape, as long those bumps disappear for some randomly sampled batches. Another important property is that $\texttt{SGD}$ can implicitly regularize the learning problem. For example, for linear problems (i.e., $f_\theta$ is linear), then if there are multiple parameter settings that minimize the loss, $\texttt{SGD}$ will often converge to the solution with minimum parameter norm~\cite{zhang2021understanding}.
## Concluding Remarks
The study of optimization can fill dozens of textbooks and thousands of academic papers. But fortunately for us, modern machine learning has converged on just a few very simple optimization methods that are used in practice. We will soon encounter deep learning, which is the main kind of machine learning used for computer vision. In deep learning, gradient-based optimization is the workhorse. Believe it or not, the handful of algorithms described above are enough to train most state-of-the-art deep learning models. Every year there are new elaborations on these ideas, and second-order methods are ever on the horizon, yet the basic concepts remain quite simple: compute a local estimate of the shape of the loss landscape, then, based on this shape, take a small step toward a lower loss.