-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge dev to main #1879
merge dev to main #1879
Conversation
`--loraplus_ratio` added for both TE and UNet Add log for lora+
When trying to load stored latents, if an error occurs, this change will tell you what file failed to load Currently it will just tell you that something failed without telling you which file
This can be used to train away from a group of images you don't want As this moves the model away from a point instead of towards it, the change in the model is unbounded So, don't set it too low. -4e-7 seemed to work well.
If a latent file fails to load, print out the path and the error, then return false to regenerate it
Add LoRA+ support
Adafactor fused backward pass and optimizer step, lowers SDXL (@ 1024 resolution) VRAM usage to BF16(10GB)/FP32(16.4GB)
Bug fix: alpha_mask load
Make timesteps work in the standard way when Huber loss is used
New optimizer:AdEMAMix8bit and PagedAdEMAMix8bit
1) Updates debiased estimation loss function for V-pred. 2) Prevents now-deprecated scaling of loss if ztSNR is enabled.
Different model architectures, such as SDXL, can take advantage of v-pred. It doesn't make sense to include these warnings anymore.
Update debiased estimation loss function to accommodate V-pred
Remove v-pred warnings
Impressive update, thanks to all the contributors! |
@kohya-ss amazing work I tested Fused backward pass on SDXL with Adafactor and it reduced as low as 10200 MB I also tried Fused optimizer groups = 10 and it was like 10500 MB However when enabling Fused backward pass + block swaps it didn't make any more difference Can I reduce VRAM usage any further? SDXL training at 1024x1024 I can train below 8 GB GPUs FLUX dev with block swaps |
important The dependent libraries are updated. Please see Upgrade and update the libraries.
Fixed a bug where the loss weight was incorrect when
--debiased_estimation_loss
was specified with--v_parameterization
. PR #1715 Thanks to catboxanon! See the PR for details.--v_parameterization
is specified in SDXL and SD1.5. PR #1717There was a bug where the min_bucket_reso/max_bucket_reso in the dataset configuration did not create the correct resolution bucket if it was not divisible by bucket_reso_steps. These values are now warned and automatically rounded to a divisible value. Thanks to Maru-mee for raising the issue. Related PR #1632
bitsandbytes
is updated to 0.44.0. Now you can useAdEMAMix8bit
andPagedAdEMAMix8bit
in the training script. PR #1640 Thanks to sdbds!--optimizer_type bitsandbytes.optim.AdEMAMix8bit
(not bnb but bitsandbytes).Fixed a bug in the cache of latents. When
flip_aug
,alpha_mask
, andrandom_crop
are different in multiple subsets in the dataset configuration file (.toml), the last subset is used instead of reflecting them correctly.Fixed an issue where the timesteps in the batch were the same when using Huber loss. PR #1628 Thanks to recris!
Improvements in OFT (Orthogonal Finetuning) Implementation
These changes have made the OFT implementation more efficient and accurate, potentially leading to improved model performance and training stability.
Additional Information
Recommended α value for OFT constraint: We recommend using α values between 1e-4 and 1e-2. This differs slightly from the original implementation of "(α*out_dim*out_dim)". Our implementation uses "(α*out_dim)", hence we recommend higher values than the 1e-5 suggested in the original implementation.
Performance Improvement: Training speed has been improved by approximately 30%.
Inference Environment: This implementation is compatible with and operates within Stable Diffusion web UI (SD1/2 and SDXL).
The INVERSE_SQRT, COSINE_WITH_MIN_LR, and WARMUP_STABLE_DECAY learning rate schedules are now available in the transformers library. See PR #1393 for details. Thanks to sdbds!
--lr_warmup_steps
and--lr_decay_steps
can now be specified as a ratio of the number of training steps, not just the step value. Example:--lr_warmup_steps=0.1
or--lr_warmup_steps=10%
, etc.When enlarging images in the script (when the size of the training image is small and bucket_no_upscale is not specified), it has been changed to use Pillow's resize and LANCZOS interpolation instead of OpenCV2's resize and Lanczos4 interpolation. The quality of the image enlargement may be slightly improved. PR #1426 Thanks to sdbds!
Sample image generation during training now works on non-CUDA devices. PR #1433 Thanks to millie-v!
--v_parameterization
is available insdxl_train.py
. The results are unpredictable, so use with caution. PR #1505 Thanks to liesened!Fused optimizer is available for SDXL training. PR #1259 Thanks to 2kpr!
--fused_backward_pass
option insdxl_train.py
. At this time, only AdaFactor is supported. Gradient accumulation is not available.no
seems to use less memory thanfp16
orbf16
.--full_bf16
option, you can further reduce the memory usage (but the accuracy will be lower). With the same memory usage as before, you can increase the batch size.Tensor.register_post_accumulate_grad_hook(hook)
.Optimizer groups feature is added to SDXL training. PR #1319
--fused_optimizer_groups 10
insdxl_train.py
. Increasing the number of groups reduces memory usage but slows down training. Since the effect is limited to a certain number, it is recommended to specify 4-10.--fused_optimizer_groups
cannot be used with--fused_backward_pass
. When using AdaFactor, the memory usage is slightly larger than with Fused optimizer. PyTorch 2.1 or later is required.LoRA+ is supported. PR #1233 Thanks to rockerBOO!
loraplus_lr_ratio
with--network_args
. Example:--network_args "loraplus_lr_ratio=16"
loraplus_unet_lr_ratio
andloraplus_lr_ratio
can be specified separately for U-Net and Text Encoder.--network_args "loraplus_unet_lr_ratio=16" "loraplus_text_encoder_lr_ratio=4"
or--network_args "loraplus_lr_ratio=16" "loraplus_text_encoder_lr_ratio=4"
etc.network_module
networks.lora
andnetworks.dylora
are available.The feature to use the transparency (alpha channel) of the image as a mask in the loss calculation has been added. PR #1223 Thanks to u-haru!
--alpha_mask
option in the training script or specifyalpha_mask = true
in the dataset configuration file.LoRA training in SDXL now supports block-wise learning rates and block-wise dim (rank). PR #1331
Negative learning rates can now be specified during SDXL model training. PR #1277 Thanks to Cauldrath!
=
like--learning_rate=-1e-7
.Training scripts can now output training settings to wandb or Tensor Board logs. Specify the
--log_config
option. PR #1285 Thanks to ccharest93, plucked, rockerBOO, and VelocityRa!The ControlNet training script
train_controlnet.py
for SD1.5/2.x was not working, but it has been fixed. PR #1284 Thanks to sdbds!train_network.py
andsdxl_train_network.py
now restore the order/position of data loading from DataSet when resuming training. PR #1353 #1359 Thanks to KohakuBlueleaf!--skip_until_initial_step
option to skip data loading until the specified step. If not specified, data loading starts from the beginning of the DataSet (same as before).--resume
is specified, the step saved in the state is used.--initial_step
or--initial_epoch
option to skip data loading until the specified step or epoch. Use these options in conjunction with--skip_until_initial_step
. These options can be used without--resume
(use them when resuming training with--network_weights
).An option
--disable_mmap_load_safetensors
is added to disable memory mapping when loading the model's .safetensors in SDXL. PR #1266 Thanks to Zovjsra!sdxl_train.py
,sdxl_train_network.py
,sdxl_train_textual_inversion.py
, andsdxl_train_control_net_lllite.py
.When there is an error in the cached latents file on disk, the file name is now displayed. PR #1278 Thanks to Cauldrath!
Fixed an error that occurs when specifying
--max_dataloader_n_workers
intag_images_by_wd14_tagger.py
when Onnx is not used. PR #1291 issue #1290 Thanks to frodo821!Fixed a bug that
caption_separator
cannot be specified in the subset in the dataset settings .toml file. #1312 and #1313 Thanks to rockerBOO!Fixed a potential bug in ControlNet-LLLite training. PR #1322 Thanks to aria1th!
Fixed some bugs when using DeepSpeed. Related #1247
Added a prompt option
--f
togen_imgs.py
to specify the file name when saving. Also, Diffusers-based keys for LoRA weights are now supported.