-
-
Notifications
You must be signed in to change notification settings - Fork 168
Training
Note screenshot may not be up to date as OT receives continuous enhancements, a tooltip is available for each parameter.
![training](https://private-user-images.githubusercontent.com/129741936/292671123-972e8598-38f5-4eb4-b6e4-ffac0cfd03ae.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk3MTMxNTMsIm5iZiI6MTczOTcxMjg1MywicGF0aCI6Ii8xMjk3NDE5MzYvMjkyNjcxMTIzLTk3MmU4NTk4LTM4ZjUtNGViNC1iNmU0LWZmYWMwY2ZkMDNhZS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjE2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxNlQxMzM0MTNaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1hYjlmYTU2NzAwZGFiYjM2Mjg3NjY5ZjUyMGFlZDMyOTA2NDU0ZmZlYzUwYTdmMTE0MWMxODY3OGNmNTBiMzJhJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.REVsb5dDihy2Pn3ud6A_Iqs3pBacuC6zvzbd4VkR73k)
Additional info can be found for optimizers here
The text encoder LR overrides the base LR if set.
SDXL Includes 2 text encoders (TENC1 - CLIP-ViT/L and TENC2 - OpenCLIP-ViT/G). It has been suggested that TENC1 works better with tags and TENC2 works better with natural language, but this is not proven and based more upon testing observation and feeling. Trying to determine the best way to make the text encoders act in concert to get the result you want is one of the biggest challenges with SDXL finetuning. Most success stories have had access to commercial grade hardware, and not consumer grade. For reference, here is the hardware information for the original CLIP models..
The unet encoder LR overrides the base LR if set.
With masked training, you can instruct the model to focus on some parts of your training images. For example, if you want to train a subject, but less the background, this setting will help. To enable masked training, you need to add a mask for every training image. This mask is a black and white image file, where white regions define what should get the focus, and black regions less. The files need the same name as the images, with an added "-masklabel.png" extension. Masks must be in the png format. These masks can be created in the tools section, both automatically or using a painting function. When doing masked training, OneTrainer will see nothing where the mask is.
Note: have also a look at this discussion for masked training.
The options available for mask training are:
Unmasked probability: the number of steps that will be run without the mask. Should be thought of as a percentage. Value: 0 to 1. Default: .1
Unmasked Weight: The weighted loss value during unmasked training of the area where the mask is. Should be thought of as a percentage. Value: 0 to 1. Default: .1
Normalize Masked Area Loss: a toggle (on/off), should be used when the masked area is very large (example: jewelry). For smaller size masks, this will increase the smooth loss, which is likely unwanted. For example, runs with this on have made the smooth loss go from .06 to .12 with masks that are less than 50% of the image.
Validation is a technical way to determine when your training starts to overfit. A prediction is performed on every image in the validation concept, and the loss is averaged within the concept. More info on loss and what validation does can be found here.
Steps to enable it:
- Enable validation in the general tab and set the interval for validation calcultion (like for sample).
- Add validation concept(s), flagged as validation concept and captionned as you would do in any image generation program.
- For validation concepts, don't use your training images - seems obvious but worth to be said. For very small datasets where you can't get more images, a crop is fine. You also can use totally different images in another concept to get another information (validation loss should grow up slowly).
- Validation concepts (flagged as such) doesn't impact a training.
- Results can be seen in the Tensorboard: a validation loss graph for each validation concept.
The most important settings you'll want to adapt are the Learning Rate, Learning Rate Min Factor, Scheduler and Optimizer. Then come the epochs, batch size and accumulation steps. Others can be kept by default as a beguinning.
The training resolution is set by defaut depending on the based model. You can train multiple resolutions at the same time by specifying a comma separated list of numbers in the resolution field. Each step will train on a randomly selected resolution from the list. It has been noticed that multi resolution training can greatly improve the model quality.
Ex of multi resolution values (idea is to step up and down by 64px):
- SD1.5: 384,448,512,576,640
- SDXL: 896,960,1024,1088,1152
See also Multi Resolution Training for more information and optionaly use resolution override on the concepts.
Note: you also can train at a specific resolution like 896x1152 but in that case it supports only one training resolution.
Reduces the memory usage but increase the training time.
There are three options available:
- Off: Highest VRAM usage but fastest training speed.
- On: Reduces VRAM usage with some impact on training speed.
- CPU Offloaded: VRAM reduction by using System memory, with some impact on training speed.
When using CPU Offloaded mode: To enable it, set "Gradient checkpointing" to CPU_OFFLOADED, then set the "Layer offload fraction" to a value between 0 and 1. Higher values will use more system RAM instead of VRAM.
Important considerations:
- VRAM usage is not reduced much when training unet models like SD1.5 or SDXL
- VRAM usage is still suboptimal when training Flux or SD3.5-M and using an offloading fraction near 0.5
- Accumulation steps must be set to 1 when Fine Tuning with CPU Offloading, as Fused Back Pass does not support Accumulation Steps
Fine Tuning with CPU Offloading on requires an optimizer that supports Fused Back Pass:
- ADAMW
- ADAM
- ADAFACTOR
- CAME
Reddit publication on this feature, technical name is RAM offloading if you search more info on the Discord server. Tech info is on Github.
Refer to this page here
The scheduler method will caluculate the learning progression based on the initial learning rate value (set in the Learning Rate field).
The Learning Rate Min Factor determines the value of the final learning rate. Ex: if set to 0.1 the final LR will be 10% of the initial LR, when set to 0 it doesn't change anything on the calculated LR.
Ex with Cosine: Blue curve is with Learning Rate Min Factor = 0, red is with Learning Rate Min Factor = 0.3.
Note: their are several learning rate fields in addition to the base learning rate (simply called Learning Rate). These are for the text encoder(s), Unet and embeddings (for additional embeddings),if they are empty they will use the base LR, if set they will overwrite the base LR value.
- Constant: fixed learning rate.
- Linear: linear learning rate decay from the initial learning rate to 0.
![linear](https://private-user-images.githubusercontent.com/129741936/263031730-95237662-3f87-4383-b7f7-850a4da54e76.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk3MTMxNTMsIm5iZiI6MTczOTcxMjg1MywicGF0aCI6Ii8xMjk3NDE5MzYvMjYzMDMxNzMwLTk1MjM3NjYyLTNmODctNDM4My1iN2Y3LTg1MGE0ZGE1NGU3Ni5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjE2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxNlQxMzM0MTNaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1hZmUyNDc0ODBhNGNiYWQ4ZjNkY2UyZjMyYThhMGEwZTU5N2Y5MmIzZmM2MjNkYjIzMTU1YjI5NGRhOTRiZTQ3JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.5228lUMQ4o3P1azpm3FoBoy-n8owCn5GpM61w0HpV-k)
- Cosine
This scheduler decay fast as the beguinning, slower at the end.
![cosine](https://private-user-images.githubusercontent.com/129741936/263034319-1dbcd622-5964-4421-b9e8-4270554d8ed6.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk3MTMxNTMsIm5iZiI6MTczOTcxMjg1MywicGF0aCI6Ii8xMjk3NDE5MzYvMjYzMDM0MzE5LTFkYmNkNjIyLTU5NjQtNDQyMS1iOWU4LTQyNzA1NTRkOGVkNi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjE2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxNlQxMzM0MTNaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1mNzQyMjU1NmQwYzJhNGY3ZWQ3ZTNlY2EyOTBlZjEyNTdlZGYxYmZjNDg3ZjBjNGEyZmEyMGU3MjUyYjk2ZDYyJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.klTZMH7bg40iluDNgDUI5-6FodagbYb6bMii4RzUt8M)
-
Cosine with restarts
-
Cosine with hard restarts
-
REX: reverse exponential learning rate decay starting from the initial learning rate and ending at 0. Also known as the "brute", can perform very fast training in some cases. Don't set any learning rate warmup step when using it.
![REX](https://private-user-images.githubusercontent.com/129741936/263020696-6c6d1ed9-4983-4dd4-aaea-44ae64a279ff.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk3MTMxNTMsIm5iZiI6MTczOTcxMjg1MywicGF0aCI6Ii8xMjk3NDE5MzYvMjYzMDIwNjk2LTZjNmQxZWQ5LTQ5ODMtNGRkNC1hYWVhLTQ0YWU2NGEyNzlmZi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjE2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxNlQxMzM0MTNaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT04NjVlODJmYTM0YzY5OTgwOTgwY2Q2NWZhNjk0YjY5NWFlYWZhNzE0ZjA3MmYzYzYwZWQwZTcxYjYyYmQ0NThkJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.E-UKr9L4qAvSnxRJiMbLPdInoERf65Do5Q2V5dwpSt0)
- Custom Scheduler
Custom Schedule Information can be found here: https://github.com/Nerogar/OneTrainer/wiki/Custom-Scheduler
Internally, this sets the mixed precision data type when doing the forward pass through the model. This setting trades precision for speed during training. Not all data types are supported on all GPUs. In practice, float16 only slightly reduces the quality, while providing a significant speed boost. But for best quality use float32 (reduce speed).
- Epochs: an epoch is a cycle where all your images are trained. Depending on the batch size and dataset it can use one or more iterations.
- Batch size: number of images sent to the GPU for processing.
- Accumulation steps: multiplier of the batch size. Ex: if you want a batch of 16 but are limited to batch 4, set the accumulation step to 4.
A mechanism that has a big effect on speed and memory consumption.
- Windows: prefer XFORMERS
- Linux: prefer SDP
A moving average is a statistical tool to determine the direction of a trend. Exponential Moving Average is a type of Moving Average, which applies more weight to the most recent data points than those which happened in past.
Only useful for bigger datasets with multiple concepts as EMA will reduce diversity, more EMA is less diversity. If your dataset isn't too complex with wide variation, then leave it OFF. If you can't get good results with EMA OFF then try to enable it. For datasets of hundreds or thousands of images, set EMA Decay to 0.9999. For smaller datasets, set it to 0.999 or even 0.998.
There are some claims that it needs to be adjusted based on LR as well https://arxiv.org/pdf/2312.02696
Display the progression of the training, epochs and steps used by epoch.