Skip to content

Latest commit

 

History

History
55 lines (54 loc) · 2.72 KB

dl_12.md

File metadata and controls

55 lines (54 loc) · 2.72 KB

An overview of gradient descent optimization algorithms

Sebastian Ruder (2017)

Key points

  • Batch GD:
    • Update once per iteration of whole training set --> intractable!
    • Can't train online
    • Guaranteed to converge global minimum (convex function) or local minimum (else)
  • SGD:
    • Update for every training sample --> allows online training
    • Much higher variance due to fluctuating objective function
    • Much faster (fewer repeated gradient computations)
    • Fluctuations are useful for getting to a better local minimum, but exact convergence needs a decreasing learning rate to prevent overshoot!
  • Mini-batch GD:
    • Mini-batch of n examples: reduce variance (more stable convergence) and avoid redundant computations --> best of both worlds!
    • n: 50-256
    • Also needs decreasing learning rate (no good convergence guaranteed)
  • Challenged of mini-batch SGD:
    • Choosing proper learning rate
    • Learning rate schedules are fixed and fail to take into account data properties (different rates for different parameters)
    • Non-convex objective function: difficulty comes from saddle points (gradient is roughly 0 in all directions)
  • Momentum:
    • Increase gradient for the constant dimension, decrease for the fluctuating ones
  • Nesterov accelerated gradient (NAG):
    • Knowing to slow down when slope will be going up again
    • Compute gradient for predicted next values --> prevent overshoot
  • Adagrad:
    • Different learning rate for every parameter at every timestep
    • Larger updates for infrequent parameters, smaller ones for frequent
    • Accumulates grad^2 (variance) in denominator to eventually stop
  • Adadelta:
    • Solves adagrad issue of learning rate becoming very small
    • Sum of grad^2 (variance) as exponentially decaying moving average (EDMA)
    • No need for default learning rate (based on param^2)
  • RMSprop:
    • Like adadelta, but needs default learning rate to be set
    • Divide by grad^2 (variance) of gradient (large fluctuations = large value)
  • Adam:
    • EDMA of grad^2 (variance) and EDMA of grad (mean)
    • Counteract zero-bias at initial steps
    • Performs better than others!
  • AdaMax:
    • Instead of scaling with L2 (adam), use L^inf (max)
    • Not as susceptible to zero-bias --> no need for correction
  • Nadam:
    • NAG + adam (NAG > momentum)
    • More accurate step in gradient direction by updating parameters with momentum step before computing gradient
  • What to use?
    • Sparse data or fast convergence + deep net? --> adaptive learning rate
    • Adam seems best overall (due to bias correction)
  • Other strategies:
    • Shuffling, curriculum learning (useful order of examples)
    • Batch norm: avoid need of dropout
    • Early stopping
    • Gradient noise