Skip to content
Christopher Layne edited this page Nov 23, 2024 · 35 revisions

WIP

First Things First

image

There are three important notes about the optimizers.

  • You can press the three dots (...) to open the settings for that optimizer.
  • The default settings are the optimizer defaults, and are likely not the best for training diffusion models.
  • OneTrainer store the last settings you set for each optimizer, you can recall them by pressing "Load Defaults".

The standard optimizers:

  • Adagrad - Adaptive Gradient Algorithm
  • ADAM - Adaptive Moment Estimation
  • ADAMW - Adaptive Moment Estimation with weight decay
  • LAMB - Layer-wise Adaptive Moments optimizer for Batch training
  • LARS - Layer-wise Adaptive Rate Scaling
  • LION - Linear Integration Of Neurons
  • RMSPROP - Root Mean Square Propagation
  • SGD - Stochastic Gradient Descent

All standard optimizers are also available in 8bit.

If you aren't put off by complex mathematical formulas, a good introductory technical video that discusses some of these optimizers is here

Note: The 8 bit versions save VRAM by using 8-bit quantization. There is a quality trade off for doing this.


The DAdapt Family of Optimizers

(Distributed Adaptive Decay Adaptation with Parametrized Timestep). These optimizers are adaptive versions of standard optimizers

  • DADAPT_ADA_GRAD
  • DADAPT_ADAM
  • DADAPT_ADAN
  • DADAPT_LION
  • DADATP_SGD

The Unique Adaptive Optimizer

  • Prodigy - A special optimizer and a favorite as it is easy to use but it can also be hard to get the exact results you are looking for. While pure ADAM or ADAMW can provide better results, Prodigy is a good starting point if you have no idea what learning rate to use or number of steps are needed. Prodigy is VRAM intensive (especially for 1024x models) and is likely only able to be used for LoRA or Embeddings training unless you have massive amounts of VRAM. Prodigy is ADAMW under the hood. Tensorboard can be used to see what learning rate Prodigy stabilizes at for use in AdamW or AdamW 8bit.

The Special

  • Adafactor - Adafactor can be used as a low memory ADAMW with a constant (or other) scheduler or as an adaptive optimizer with the Adafactor scheduler.

Prodigy Specific Settings Details

First and formost, note that Prodigy comes from https://github.com/konstmish/prodigy There are some excellent details of using it in the top level README there.

OneTrainer Default Values Listed. Recommended Values Listed After Default, if applicable

Beta1: 0.9 Beta2: 0.999, Recommended 0.99 to 0.999
Beta3: Blank EPS: 1e-08
Weight Decay: 0.0 Recommended 0.001 to 0.01 Decouple: True
Bias Correction: False Recommended True (unless constant scheduler/sdxl?) Safeguard Warmup: False Recommended True?
Initial D: 1e-06 D coefficient: 1.0 Recommended See Note
Growth Rate: inf FSDP in use: False
Learning rate: 1

Prodigy Notes: Learning Rate should be 1 and must be the same for both text encoders and unet. Otherwise, the lora will do nothing. The Learning Rate Scaler option should also be left at the default value of NONE. Consider not using any training level warmup steps as Prodigy has its own warmup method.

For Safeguard warmup, it might be better to set true for short training, but can be either true/false for longer (epoch=100) training. For Bias Correction: sadly this is another settings where the best is really, "it depends, try both " :(

To adjust Prodigy you should use D coefficient instead, it acts as a LR multiplier. Using a D coefficient of 0.5 will slow Prodigy while a D coefficient of 2 will speed it up. Using a COSINE function can help force Prodigy to learn details at the end of a training run as Prodigy will never decrease the learning rate on it's own. Using a Beta 2 of 0.99 has been shown to help Prodigy converge better. Using weight decay can help Prodigy not overtrain. Other techniques include dropout for LoRAs (found on the LoRA tab). Enabling Bias correction helps Prodigy act more like ADAM.

Capture d'écran 2024-05-02 133822

D coeff effect on LR with Cosine

  • Blue: 1.2
  • Black: 1
  • Yellow: 0.5

Adafactor Specific Settings Details (Adaptive)

Default Values Listed. Recommended Values Listed After Default, if applicable

EPS: 1e-30 EPS2: .001
Clip Threshold: 1.0 Decay Rate: -0.8
Beta 1: Blank (Leave Blank unless you really know what this does) Weight Decay: 0.0
Scale Parameter: True Relative Step: True
Warmup Initialization: False Stochastic Rounding: True
Fuse Back Pass: False

Adafactor in adaptive mode notes:

  • It will ignore any LR value.
  • Adafactor with relative step and scale parameter sets a learning rate too high for a Fine Tune. If you want to use adafactor for Fine Tunes, see the next setting. Use the Adafactor scheduler along with these settings for adaptive learning. Adafactor in relative step is slower as it needs to do more calculations to determine it's next move. Entering a value for Beta 1 for adafactor will significantly increase VRAM usage due to an EMA type function being enabled. Enabling "Fused Back Pass" can significantly reduce VRAM usage without any loss in quality but this option is not compatible with gradient accumulation.

Adafactor Specific Settings Details (Low Memory ADAMW)

Default Values Listed. Recommended Values Listed After Default, if applicable

EPS: 1e-30 EPS2: .001
Clip Threshold: 1.0 Decay Rate: -0.8
Beta 1: Blank (Leave Blank unless you really know what this does) Weight Decay: 0.0 Recommended 0.0 to 0.01
Scale Parameter: True Required to be FALSE Relative Step: True Required to be FALSE
Warmup Initialization: False Stochastic Rounding: True
Fuse Back Pass: False

Adafactor in constant mode notes: A standard scheduler (constant, cosine, etc.) must be used with adafactor in this mode. Use a similar learning rate to AdamW or the default for your training task. Adafactor in this mode uses less VRAM than ADAMW but more than ADAMW 8bit without the 8bit quantization loss in quality. Entering a value for Beta 1 for adafactor will significantly increase VRAM usage due to an EMA type function being enabled. Enabling "Fused Back Pass" can significantly reduce VRAM usage without any loss in quality but this option is not compatible with gradient accumulation.


ADAM(W) Specific Settings Details


All The Options

  • adam_w_mode: Whether to use weight decay correction for Adam optimizer. type: boolean
  • alpha: Smoothing parameter for RMSprop and others. type: float
  • amsgrad: Whether to use the AMSGrad variant for Adam. type: bool
  • Beta1: optimizer_momentum term. type: float
  • Beta2: Coefficients for computing running averages of gradient. type: float
  • Beta3: Coefficient for computing the Prodigy stepsize. type: float
  • bias_correction: Whether to use bias correction in optimization algorithms like Adam. type: bool
  • block_wise: Whether to perform block-wise model update. type: bool
  • capturable: Whether some property of the optimizer can be captured. type: bool
  • centered: Whether to center the gradient before scaling. Great for stabilizing the training process. type: bool
  • clip_threshold: Clipping value for gradients. type: float
  • Initial D: Initial D estimate for D-adaptation. type: float
  • D Coefficient: Coefficient in the expression for the estimate of d. type: float
  • Dampening: Dampening for optimizer_momentum. type: float
  • Decay Rate: Rate of decay for moment estimation. type: float
  • Decouple: Use AdamW style optimizer_decoupled weight decay. type: bool
  • Differentiable: Whether the optimization function is optimizer_differentiable. type: bool
  • EPS: A small value to prevent division by zero. type: float
  • EPS 2: A small value to prevent division by zero. type: float
  • ForEach: 'Whether to use a foreach implementation if available. This implementation is usually faster. type: bool
  • FSDP in Use: Flag for using sharded parameters. type: bool
  • Fused: Whether to use a fused implementation if available. This implementation is usually faster and requires less memory. type: bool
  • Growth Rate: Limit for D estimate growth rate. type: float
  • Initial Accumulator Value: Initial value for Adagrad optimizer. type: float
  • Is Paged: Whether the optimizer's internal state should be paged to CPU. type: bool
  • Log Every: Intervals at which logging should occur. type: int
  • LR Decay: Rate at which learning rate decreases. type: float}
  • Max Unorm: Maximum value for gradient clipping by norms. type: float
  • Maximize: Whether to optimizer_maximize the optimization function. type: bool
  • Min 8bit Size: Minimum tensor size for 8-bit quantization. type: int
  • Optimizer Momentum: Factor to accelerate SGD in relevant direction. type: float
  • Nesterov: Whether to enable Nesterov optimizer_momentum. type: bool
  • No Prox: Whether to use proximity updates or not. type: bool
  • Optim Bits: Number of bits used for optimization. type: int
  • Percentile Clipping: Gradient clipping based on percentile values. type: float
  • Relative Step: Whether to use a relative step size. type: bool
  • Safeguard Warmup: Avoid issues during warm-up stage. type: bool
  • Scale Parameter: Whether to scale the parameter or not. type: bool
  • Stochastic Rounding: Stochastic rounding for weight updates. Improves quality when using bfloat16 weights. type: bool
  • Bias Correction: Turn on Adam's bias correction. type: bool
  • Use Triton: Whether Triton optimization should be used. type: bool
  • Warmup Initialization: Whether to warm-up the optimizer initialization. type: bool
  • Weight Decay: Regularization to prevent overfitting. type: float