Training

Sections

Note screenshot may not be up to date as OT receives continuous enhancements, a tooltip is available for each parameter.

Optimizer Info

Additional info can be found for optimizers here

Train Text Encoder (1 and 2)

The text encoder LR overrides the base LR if set.

SDXL Includes 2 text encoders (TENC1 - CLIP-ViT/L and TENC2 - OpenCLIP-ViT/G). It has been suggested that TENC1 works better with tags and TENC2 works better with natural language, but this is not proven and based more upon testing observation and feeling. Trying to determine the best way to make the text encoders act in concert to get the result you want is one of the biggest challenges with SDXL finetuning. Most success stories have had access to commercial grade hardware, and not consumer grade. For reference, here is the hardware information for the original CLIP models..

Train UNet

The unet encoder LR overrides the base LR if set.

Text and Unet encoders

Masked Training

With masked training, you can instruct the model to focus on some parts of your training images. For example, if you want to train a subject, but less the background, this setting will help. To enable masked training, you need to add a mask for every training image. This mask is a black and white image file, where white regions define what should get the focus, and black regions less. The files need the same name as the images, with an added "-masklabel.png" extension. Masks must be in the png format. These masks can be created in the tools section, both automatically or using a painting function. When doing masked training, OneTrainer will see nothing where the mask is.

Note: have also a look at this discussion for masked training.

The options available for mask training are:
Unmasked probability: the number of steps that will be run without the mask. Should be thought of as a percentage. Value: 0 to 1. Default: .1

Unmasked Weight: The weighted loss value during unmasked training of the area where the mask is. Should be thought of as a percentage. Value: 0 to 1. Default: .1

Normalize Masked Area Loss: a toggle (on/off), should be used when the masked area is very large (example: jewelry). For smaller size masks, this will increase the smooth loss, which is likely unwanted. For example, runs with this on have made the smooth loss go from .06 to .12 with masks that are less than 50% of the image.

Validation

Validation is a technical way to determine when your training starts to overfit. A prediction is performed on every image in the validation concept, and the loss is averaged within the concept. More info on loss and what validation does can be found here.

Steps to enable it:

Enable validation in the general tab and set the interval for validation calcultion (like for sample).
Add validation concept(s), flagged as validation concept and captionned as you would do in any image generation program.
For validation concepts, don't use your training images - seems obvious but worth to be said. For very small datasets where you can't get more images, a crop is fine. You also can use totally different images in another concept to get another information (validation loss should grow up slowly).
Validation concepts (flagged as such) doesn't impact a training.
Results can be seen in the Tensorboard: a validation loss graph for each validation concept.

Main settings

The most important settings you'll want to adapt are the Learning Rate, Learning Rate Min Factor, Scheduler and Optimizer. Then come the epochs, batch size and accumulation steps. Others can be kept by default as a beguinning.

The training resolution is set by defaut depending on the based model. You can train multiple resolutions at the same time by specifying a comma separated list of numbers in the resolution field. Each step will train on a randomly selected resolution from the list. It has been noticed that multi resolution training can greatly improve the model quality.

Ex of multi resolution values (idea is to step up and down by 64px):

SD1.5: 384,448,512,576,640
SDXL: 896,960,1024,1088,1152

See also Multi Resolution Training for more information and optionaly use resolution override on the concepts.

Note: you also can train at a specific resolution like 896x1152 but in that case it supports only one training resolution.

Gradient Checkpointing

Reduces the memory usage but increase the training time.

There are three options available:

Off: Highest VRAM usage but fastest training speed.
On: Reduces VRAM usage with some impact on training speed.
CPU Offloaded: VRAM reduction by using System memory, with some impact on training speed.

When using CPU Offloaded mode: To enable it, set "Gradient checkpointing" to CPU_OFFLOADED, then set the "Layer offload fraction" to a value between 0 and 1. Higher values will use more system RAM instead of VRAM.

Important considerations:

VRAM usage is not reduced much when training unet models like SD1.5 or SDXL
VRAM usage is still suboptimal when training Flux or SD3.5-M and using an offloading fraction near 0.5
Accumulation steps must be set to 1 when Fine Tuning with CPU Offloading, as Fused Back Pass does not support Accumulation Steps

Fine Tuning with CPU Offloading on requires an optimizer that supports Fused Back Pass:

ADAMW
ADAM
ADAFACTOR
CAME

Reddit publication on this feature, technical name is RAM offloading if you search more info on the Discord server. Tech info is on Github.

Optimizer

Refer to this page here

Learning Rate Scheduler

The scheduler method will caluculate the learning progression based on the initial learning rate value (set in the Learning Rate field).

The Learning Rate Min Factor determines the value of the final learning rate. Ex: if set to 0.1 the final LR will be 10% of the initial LR, when set to 0 it doesn't change anything on the calculated LR.

Ex with Cosine: Blue curve is with Learning Rate Min Factor = 0, red is with Learning Rate Min Factor = 0.3.

LRminFactor1

Note: their are several learning rate fields in addition to the base learning rate (simply called Learning Rate). These are for the text encoder(s), Unet and embeddings (for additional embeddings),if they are empty they will use the base LR, if set they will overwrite the base LR value.

Constant: fixed learning rate.
Linear: linear learning rate decay from the initial learning rate to 0.

Cosine

This scheduler decay fast as the beguinning, slower at the end.

Cosine with restarts
Cosine with hard restarts
REX: reverse exponential learning rate decay starting from the initial learning rate and ending at 0. Also known as the "brute", can perform very fast training in some cases. Don't set any learning rate warmup step when using it.

Custom Scheduler

Custom Schedule Information can be found here: https://github.com/Nerogar/OneTrainer/wiki/Custom-Scheduler

Train Data Type

Internally, this sets the mixed precision data type when doing the forward pass through the model. This setting trades precision for speed during training. Not all data types are supported on all GPUs. In practice, float16 only slightly reduces the quality, while providing a significant speed boost. But for best quality use float32 (reduce speed).

Epochs / batch / accumulation steps

Epochs: an epoch is a cycle where all your images are trained. Depending on the batch size and dataset it can use one or more iterations.
Batch size: number of images sent to the GPU for processing.
Accumulation steps: multiplier of the batch size. Ex: if you want a batch of 16 but are limited to batch 4, set the accumulation step to 4.

Attention

A mechanism that has a big effect on speed and memory consumption.

Windows: prefer XFORMERS
Linux: prefer SDP

EMA (Exponential Moving Average)

A moving average is a statistical tool to determine the direction of a trend. Exponential Moving Average is a type of Moving Average, which applies more weight to the most recent data points than those which happened in past.

Only useful for bigger datasets with multiple concepts as EMA will reduce diversity, more EMA is less diversity. If your dataset isn't too complex with wide variation, then leave it OFF. If you can't get good results with EMA OFF then try to enable it. For datasets of hundreds or thousands of images, set EMA Decay to 0.9999. For smaller datasets, set it to 0.999 or even 0.998.

There are some claims that it needs to be adjusted based on LR as well https://arxiv.org/pdf/2312.02696

Footer

Display the progression of the training, epochs and steps used by epoch.

Overview

Home

Overview

Learning

Training

Getting Started

The Program - Tab Explanation

General

Model

Data

Concepts

Training

Optimizers

Custom Scheduler

Sampling

Backup and Saving

Tools

Additional Embeddings

Cloud

Embedding

Lora

More info

Infos, Guides and Lessons Learnt

Misc Info

Diffusion Models

Guides

One Trainer March 2024 Guide

Run One Trainer on Runpod

Other Tools - Helpful Links

Lessons Learnt

Frequently Asked Questions

Lessons Learnt and Tutorials

For Developers

Dev Corner

Developing on Clouds

Quick Start for Developers

CLI Training

Docker Image

Embedding Training

Project Structure

RAM Offloading

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training

Sections

Optimizer Info

Train Text Encoder (1 and 2)

Train UNet

Masked Training

Validation

Main settings

Gradient Checkpointing

Optimizer

Learning Rate Scheduler

Train Data Type

Epochs / batch / accumulation steps

Attention

EMA (Exponential Moving Average)

Footer

Overview

Training

More info

For Developers

Clone this wiki locally