Sebastian Ruder (2017)
- Batch GD:
- Update once per iteration of whole training set --> intractable!
- Can't train online
- Guaranteed to converge global minimum (convex function) or local minimum (else)
- SGD:
- Update for every training sample --> allows online training
- Much higher variance due to fluctuating objective function
- Much faster (fewer repeated gradient computations)
- Fluctuations are useful for getting to a better local minimum, but exact convergence needs a decreasing learning rate to prevent overshoot!
- Mini-batch GD:
- Mini-batch of n examples: reduce variance (more stable convergence) and avoid redundant computations --> best of both worlds!
- n: 50-256
- Also needs decreasing learning rate (no good convergence guaranteed)
- Challenged of mini-batch SGD:
- Choosing proper learning rate
- Learning rate schedules are fixed and fail to take into account data properties (different rates for different parameters)
- Non-convex objective function: difficulty comes from saddle points (gradient is roughly 0 in all directions)
- Momentum:
- Increase gradient for the constant dimension, decrease for the fluctuating ones
- Nesterov accelerated gradient (NAG):
- Knowing to slow down when slope will be going up again
- Compute gradient for predicted next values --> prevent overshoot
- Adagrad:
- Different learning rate for every parameter at every timestep
- Larger updates for infrequent parameters, smaller ones for frequent
- Accumulates grad^2 (variance) in denominator to eventually stop
- Adadelta:
- Solves adagrad issue of learning rate becoming very small
- Sum of grad^2 (variance) as exponentially decaying moving average (EDMA)
- No need for default learning rate (based on param^2)
- RMSprop:
- Like adadelta, but needs default learning rate to be set
- Divide by grad^2 (variance) of gradient (large fluctuations = large value)
- Adam:
- EDMA of grad^2 (variance) and EDMA of grad (mean)
- Counteract zero-bias at initial steps
- Performs better than others!
- AdaMax:
- Instead of scaling with L2 (adam), use L^inf (max)
- Not as susceptible to zero-bias --> no need for correction
- Nadam:
- NAG + adam (NAG > momentum)
- More accurate step in gradient direction by updating parameters with momentum step before computing gradient
- What to use?
- Sparse data or fast convergence + deep net? --> adaptive learning rate
- Adam seems best overall (due to bias correction)
- Other strategies:
- Shuffling, curriculum learning (useful order of examples)
- Batch norm: avoid need of dropout
- Early stopping
- Gradient noise