merge dev to main #1879

kohya-ss · 2025-01-17T14:23:08Z

important The dependent libraries are updated. Please see Upgrade and update the libraries.
- bitsandbytes, transformers, accelerate and huggingface_hub are updated.
- If you encounter any issues, please report them.
Fixed a bug where the loss weight was incorrect when --debiased_estimation_loss was specified with --v_parameterization. PR #1715 Thanks to catboxanon! See the PR for details.
- Removed the warning when --v_parameterization is specified in SDXL and SD1.5. PR #1717
There was a bug where the min_bucket_reso/max_bucket_reso in the dataset configuration did not create the correct resolution bucket if it was not divisible by bucket_reso_steps. These values are now warned and automatically rounded to a divisible value. Thanks to Maru-mee for raising the issue. Related PR #1632
bitsandbytes is updated to 0.44.0. Now you can use AdEMAMix8bit and PagedAdEMAMix8bit in the training script. PR #1640 Thanks to sdbds!
- There is no abbreviation, so please specify the full path like --optimizer_type bitsandbytes.optim.AdEMAMix8bit (not bnb but bitsandbytes).
Fixed a bug in the cache of latents. When flip_aug, alpha_mask, and random_crop are different in multiple subsets in the dataset configuration file (.toml), the last subset is used instead of reflecting them correctly.
Fixed an issue where the timesteps in the batch were the same when using Huber loss. PR #1628 Thanks to recris!
Improvements in OFT (Orthogonal Finetuning) Implementation
1. Optimization of Calculation Order:
  - Changed the calculation order in the forward method from (Wx)R to W(xR).
  - This has improved computational efficiency and processing speed.
2. Correction of Bias Application:
  - In the previous implementation, R was incorrectly applied to the bias.
  - The new implementation now correctly handles bias by using F.conv2d and F.linear.
3. Efficiency Enhancement in Matrix Operations:
  - Introduced einsum in both the forward and merge_to methods.
  - This has optimized matrix operations, resulting in further speed improvements.
4. Proper Handling of Data Types:
  - Improved to use torch.float32 during calculations and convert results back to the original data type.
  - This maintains precision while ensuring compatibility with the original model.
5. Unified Processing for Conv2d and Linear Layers:
  - Implemented a consistent method for applying OFT to both layer types.
- These changes have made the OFT implementation more efficient and accurate, potentially leading to improved model performance and training stability.
- Additional Information
  - Recommended α value for OFT constraint: We recommend using α values between 1e-4 and 1e-2. This differs slightly from the original implementation of "(α*out_dim*out_dim)". Our implementation uses "(α*out_dim)", hence we recommend higher values than the 1e-5 suggested in the original implementation.
  - Performance Improvement: Training speed has been improved by approximately 30%.
  - Inference Environment: This implementation is compatible with and operates within Stable Diffusion web UI (SD1/2 and SDXL).
The INVERSE_SQRT, COSINE_WITH_MIN_LR, and WARMUP_STABLE_DECAY learning rate schedules are now available in the transformers library. See PR #1393 for details. Thanks to sdbds!
- See the transformers documentation for details on each scheduler.
- --lr_warmup_steps and --lr_decay_steps can now be specified as a ratio of the number of training steps, not just the step value. Example: --lr_warmup_steps=0.1 or --lr_warmup_steps=10%, etc.
When enlarging images in the script (when the size of the training image is small and bucket_no_upscale is not specified), it has been changed to use Pillow's resize and LANCZOS interpolation instead of OpenCV2's resize and Lanczos4 interpolation. The quality of the image enlargement may be slightly improved. PR #1426 Thanks to sdbds!
Sample image generation during training now works on non-CUDA devices. PR #1433 Thanks to millie-v!
--v_parameterization is available in sdxl_train.py. The results are unpredictable, so use with caution. PR #1505 Thanks to liesened!
Fused optimizer is available for SDXL training. PR #1259 Thanks to 2kpr!
- The memory usage during training is significantly reduced by integrating the optimizer's backward pass with step. The training results are the same as before, but if you have plenty of memory, the speed will be slower.
- Specify the --fused_backward_pass option in sdxl_train.py. At this time, only AdaFactor is supported. Gradient accumulation is not available.
- Setting mixed precision to no seems to use less memory than fp16 or bf16.
- Training is possible with a memory usage of about 17GB with a batch size of 1 and fp32. If you specify the --full_bf16 option, you can further reduce the memory usage (but the accuracy will be lower). With the same memory usage as before, you can increase the batch size.
- PyTorch 2.1 or later is required because it uses the new API Tensor.register_post_accumulate_grad_hook(hook).
- Mechanism: Normally, backward -> step is performed for each parameter, so all gradients need to be temporarily stored in memory. "Fuse backward and step" reduces memory usage by performing backward/step for each parameter and reflecting the gradient immediately. The more parameters there are, the greater the effect, so it is not effective in other training scripts (LoRA, etc.) where the memory usage peak is elsewhere, and there are no plans to implement it in those training scripts.
Optimizer groups feature is added to SDXL training. PR #1319
- Memory usage is reduced by the same principle as Fused optimizer. The training results and speed are the same as Fused optimizer.
- Specify the number of groups like --fused_optimizer_groups 10 in sdxl_train.py. Increasing the number of groups reduces memory usage but slows down training. Since the effect is limited to a certain number, it is recommended to specify 4-10.
- Any optimizer can be used, but optimizers that automatically calculate the learning rate (such as D-Adaptation and Prodigy) cannot be used. Gradient accumulation is not available.
- --fused_optimizer_groups cannot be used with --fused_backward_pass. When using AdaFactor, the memory usage is slightly larger than with Fused optimizer. PyTorch 2.1 or later is required.
- Mechanism: While Fused optimizer performs backward/step for individual parameters within the optimizer, optimizer groups reduce memory usage by grouping parameters and creating multiple optimizers to perform backward/step for each group. Fused optimizer requires implementation on the optimizer side, while optimizer groups are implemented only on the training script side.
LoRA+ is supported. PR #1233 Thanks to rockerBOO!
- LoRA+ is a method to improve training speed by increasing the learning rate of the UP side (LoRA-B) of LoRA. Specify the multiple. The original paper recommends 16, but adjust as needed. Please see the PR for details.
- Specify loraplus_lr_ratio with --network_args. Example: --network_args "loraplus_lr_ratio=16"
- loraplus_unet_lr_ratio and loraplus_lr_ratio can be specified separately for U-Net and Text Encoder.
  - Example: --network_args "loraplus_unet_lr_ratio=16" "loraplus_text_encoder_lr_ratio=4" or --network_args "loraplus_lr_ratio=16" "loraplus_text_encoder_lr_ratio=4" etc.
- network_module networks.lora and networks.dylora are available.
The feature to use the transparency (alpha channel) of the image as a mask in the loss calculation has been added. PR #1223 Thanks to u-haru!
- The transparent part is ignored during training. Specify the --alpha_mask option in the training script or specify alpha_mask = true in the dataset configuration file.
- See About masked loss for details.
LoRA training in SDXL now supports block-wise learning rates and block-wise dim (rank). PR #1331
- Specify the learning rate and dim (rank) for each block.
- See Block-wise learning rates in LoRA for details (Japanese only).
Negative learning rates can now be specified during SDXL model training. PR #1277 Thanks to Cauldrath!
- The model is trained to move away from the training images, so the model is easily collapsed. Use with caution. A value close to 0 is recommended.
- When specifying from the command line, use = like --learning_rate=-1e-7.
Training scripts can now output training settings to wandb or Tensor Board logs. Specify the --log_config option. PR #1285 Thanks to ccharest93, plucked, rockerBOO, and VelocityRa!
- Some settings, such as API keys and directory specifications, are not output due to security issues.
The ControlNet training script train_controlnet.py for SD1.5/2.x was not working, but it has been fixed. PR #1284 Thanks to sdbds!
train_network.py and sdxl_train_network.py now restore the order/position of data loading from DataSet when resuming training. PR #1353 #1359 Thanks to KohakuBlueleaf!
- This resolves the issue where the order of data loading from DataSet changes when resuming training.
- Specify the --skip_until_initial_step option to skip data loading until the specified step. If not specified, data loading starts from the beginning of the DataSet (same as before).
- If --resume is specified, the step saved in the state is used.
- Specify the --initial_step or --initial_epoch option to skip data loading until the specified step or epoch. Use these options in conjunction with --skip_until_initial_step. These options can be used without --resume (use them when resuming training with --network_weights).
An option --disable_mmap_load_safetensors is added to disable memory mapping when loading the model's .safetensors in SDXL. PR #1266 Thanks to Zovjsra!
- It seems that the model file loading is faster in the WSL environment etc.
- Available in sdxl_train.py, sdxl_train_network.py, sdxl_train_textual_inversion.py, and sdxl_train_control_net_lllite.py.
When there is an error in the cached latents file on disk, the file name is now displayed. PR #1278 Thanks to Cauldrath!
Fixed an error that occurs when specifying --max_dataloader_n_workers in tag_images_by_wd14_tagger.py when Onnx is not used. PR #1291 issue #1290 Thanks to frodo821!
Fixed a bug that caption_separator cannot be specified in the subset in the dataset settings .toml file. #1312 and #1313 Thanks to rockerBOO!
Fixed a potential bug in ControlNet-LLLite training. PR #1322 Thanks to aria1th!
Fixed some bugs when using DeepSpeed. Related #1247
Added a prompt option --f to gen_imgs.py to specify the file name when saving. Also, Diffusers-based keys for LoRA weights are now supported.

`--loraplus_ratio` added for both TE and UNet Add log for lora+

When trying to load stored latents, if an error occurs, this change will tell you what file failed to load Currently it will just tell you that something failed without telling you which file

This can be used to train away from a group of images you don't want As this moves the model away from a point instead of towards it, the change in the model is unbounded So, don't set it too low. -4e-7 seemed to work well.

…ctrictions

If a latent file fails to load, print out the path and the error, then return false to regenerate it

as per #1290

Add LoRA+ support

Adafactor fused backward pass and optimizer step, lowers SDXL (@ 1024 resolution) VRAM usage to BF16(10GB)/FP32(16.4GB)

Bug fix: alpha_mask load

…used

Make timesteps work in the standard way when Huber loss is used

New optimizer:AdEMAMix8bit and PagedAdEMAMix8bit

1) Updates debiased estimation loss function for V-pred. 2) Prevents now-deprecated scaling of loss if ztSNR is enabled.

#1715 (comment)

Different model architectures, such as SDXL, can take advantage of v-pred. It doesn't make sense to include these warnings anymore.

Update debiased estimation loss function to accommodate V-pred

Remove v-pred warnings

RalFingerLP · 2025-01-18T18:12:23Z

Impressive update, thanks to all the contributors!

FurkanGozukara · 2025-01-23T23:58:09Z

@kohya-ss amazing work

I tested Fused backward pass on SDXL with Adafactor and it reduced as low as 10200 MB

I also tried Fused optimizer groups = 10 and it was like 10500 MB

However when enabling Fused backward pass + block swaps it didn't make any more difference

Can I reduce VRAM usage any further? SDXL training at 1024x1024

I can train below 8 GB GPUs FLUX dev with block swaps

rockerBOO and others added 30 commits April 1, 2024 15:38

Add LoRA+ support

f99fe28

Add LoRA-FA for LoRA+

c769160

Fix default_lr being applied

1933ab4

Fix default LR, Add overall LoRA+ ratio, Add log

75833e8

`--loraplus_ratio` added for both TE and UNet Add log for lora+

Fix unset or invalid LR from making a param_group

68467bd

Fused backward pass

4f203ce

add disable_mmap to args

64916a3

Display name of error latent file

feefcf2

When trying to load stored latents, if an error occurs, this change will tell you what file failed to load Currently it will just tell you that something failed without telling you which file

Allow negative learning rate

fc37437

This can be used to train away from a group of images you don't want As this moves the model away from a point instead of towards it, the change in the model is unbounded So, don't set it too low. -4e-7 seemed to work well.

passing filtered hyperparameters to accelerate

2c9db5d

fix train controlnet

4477116

Cleaned typing to be in line with accelerate hyperparameters type res…

b886d0a

…ctrictions

Update train_util.py

5cb145d

disable main process check for deepspeed #1247

52652cb

pop weights if available #1247

0540c33

Regenerate failed file

040e26f

If a latent file fails to load, print out the path and the error, then return false to regenerate it

removed unnecessary torch import on line 115

fdbb03c

as per #1290

Merge pull request #1233 from rockerBOO/lora-plus

834445a

Add LoRA+ support

move loraplus args from args to network_args, simplify log lr desc

969f82a

Fix caption_separator missing in subset schema

dbb7bb2

Add caption_separator to output for subset

8db0cad

support block dim/lr for sdxl

58c2d85

add debug log

52e64c6

update loraplus on dylora/lofa_fa

7fe8150

fix dylora loraplus

3fd8cdc

Merge pull request #1259 from 2kpr/fused_backward_pass

2a359e0

Adafactor fused backward pass and optimizer step, lowers SDXL (@ 1024 resolution) VRAM usage to BF16(10GB)/FP32(16.4GB)

update help message for fused_backward_pass

017b82e

add experimental option to fuse params to optimizer groups

b56d5f7

fix get_trainable_params in controlnet-llite training

793aeb9

chore: Refactor optimizer group

607e041

kohya-ss and others added 26 commits September 19, 2024 21:15

Merge branch 'main' into dev

d7e1472

Merge pull request #1615 from Maru-mee/patch-1

0b7927e

Bug fix: alpha_mask load

make timestep sampling behave in the standard way when huber loss is …

e1f23af

…used

retain alpha in pil_resize backport #1619

29177d2

init

ab7b231

Merge pull request #1628 from recris/huber-timesteps

c1d16a7

Make timesteps work in the standard way when Huber loss is used

update README

e74f581

delete code for cleaning

1beddd8

fix flip_aug, alpha_mask, random_crop issue in caching

bf91bea

Merge pull request #1640 from sdbds/ademamix8bit

4296e28

New optimizer:AdEMAMix8bit and PagedAdEMAMix8bit

fix to work bitsandbytes optimizers with full path #1640

a94bc84

update readme

ce49ced

adjust min/max bucket reso divisible by reso steps #1632

fe2aa32

update help text #1632

1567549

fix to work linear/cosine scheduler closes #1651 ref #1393

012e7e6

Fix training for V-pred and ztSNR

8fc30f8

1) Updates debiased estimation loss function for V-pred. 2) Prevents now-deprecated scaling of loss if ztSNR is enabled.

Only add warning for deprecated scaling vpred loss function

e1b63c2

Remove scale_v_pred_loss_like_noise_pred deprecation

0e7c592

#1715 (comment)

Remove v-pred warnings

be14c06

Different model architectures, such as SDXL, can take advantage of v-pred. It doesn't make sense to include these warnings anymore.

Merge pull request #1715 from catboxanon/vpred-ztsnr-fixes

c632af8

Update debiased estimation loss function to accommodate V-pred

Merge pull request #1717 from catboxanon/fix/remove-vpred-warnings

b8ae745

Remove v-pred warnings

update README

b1e6504

Merge branch 'main' into dev

900d551

Merge branch 'main' into dev

e070bd9

Merge branch 'main' into dev

6adb69b

update README for merging

345daaa

kohya-ss merged commit 6e3c1d0 into main Jan 17, 2025
2 checks passed

MeemeeLab mentioned this pull request Jan 18, 2025

Optimizer error KohakuBlueleaf/LyCORIS#233

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge dev to main #1879

merge dev to main #1879

kohya-ss commented Jan 17, 2025

RalFingerLP commented Jan 18, 2025

FurkanGozukara commented Jan 23, 2025

merge dev to main #1879

merge dev to main #1879

Conversation

kohya-ss commented Jan 17, 2025

RalFingerLP commented Jan 18, 2025

FurkanGozukara commented Jan 23, 2025