Skip to content

Latest commit

 

History

History
42 lines (34 loc) · 1.35 KB

03-Gradient_descent.md

File metadata and controls

42 lines (34 loc) · 1.35 KB

Gradient descent

  • Minimize some objection function J(theta).
  • Update parameters in opposite direction of the gradient of J.

Variants:

  • Batch gradient descent: the original, computes loss over whole training dataset at a time.
  • Stochastic gradient descent: one per example.
  • Mini-batch gradient descent.

Batch gradient descent:

  • Guaranteed to converge to the global minimum for convex.
  • Local minimum for non-convex.

Stochastic gradient descent:

  • It is stochastic = random.
  • SGD performs a parameter update for each training example.
  • Advantage: enables it to jump to a new and potentially better local minima.
  • Side effect: complicates convergence.

Mini-batch gradient descent:

  • Best of both worlds.
  • Reduces variance of parameter updates.
  • More stable convergence.
  • Algorithm of choice for NN training.
  • Can use some neat optimisations: GPU matrix factorisation.
  • Learning rate "schedules" reduce etha at preset times
  • Not all the features have the same variance. A universal learning rate doesn't make sense here.
  • Saddle points are challenging: many local minima.

Gradient descent optimization:

  • Momentum
  • Adagrad
  • Adadelta
  • RMSprop

Momentum optimisation:

  • Problems with "ravines" aka parts of space where one dimension varies way more than the other.
  • It is like adding air resistance.

Adam optimisation:

  • Adaptative Moment Estimation (Adam).