Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge dev to main #1879

Merged
merged 140 commits into from
Jan 17, 2025
Merged

merge dev to main #1879

merged 140 commits into from
Jan 17, 2025

Conversation

kohya-ss
Copy link
Owner

  • important The dependent libraries are updated. Please see Upgrade and update the libraries.

    • bitsandbytes, transformers, accelerate and huggingface_hub are updated.
    • If you encounter any issues, please report them.
  • Fixed a bug where the loss weight was incorrect when --debiased_estimation_loss was specified with --v_parameterization. PR #1715 Thanks to catboxanon! See the PR for details.

    • Removed the warning when --v_parameterization is specified in SDXL and SD1.5. PR #1717
  • There was a bug where the min_bucket_reso/max_bucket_reso in the dataset configuration did not create the correct resolution bucket if it was not divisible by bucket_reso_steps. These values are now warned and automatically rounded to a divisible value. Thanks to Maru-mee for raising the issue. Related PR #1632

  • bitsandbytes is updated to 0.44.0. Now you can use AdEMAMix8bit and PagedAdEMAMix8bit in the training script. PR #1640 Thanks to sdbds!

    • There is no abbreviation, so please specify the full path like --optimizer_type bitsandbytes.optim.AdEMAMix8bit (not bnb but bitsandbytes).
  • Fixed a bug in the cache of latents. When flip_aug, alpha_mask, and random_crop are different in multiple subsets in the dataset configuration file (.toml), the last subset is used instead of reflecting them correctly.

  • Fixed an issue where the timesteps in the batch were the same when using Huber loss. PR #1628 Thanks to recris!

  • Improvements in OFT (Orthogonal Finetuning) Implementation

    1. Optimization of Calculation Order:
      • Changed the calculation order in the forward method from (Wx)R to W(xR).
      • This has improved computational efficiency and processing speed.
    2. Correction of Bias Application:
      • In the previous implementation, R was incorrectly applied to the bias.
      • The new implementation now correctly handles bias by using F.conv2d and F.linear.
    3. Efficiency Enhancement in Matrix Operations:
      • Introduced einsum in both the forward and merge_to methods.
      • This has optimized matrix operations, resulting in further speed improvements.
    4. Proper Handling of Data Types:
      • Improved to use torch.float32 during calculations and convert results back to the original data type.
      • This maintains precision while ensuring compatibility with the original model.
    5. Unified Processing for Conv2d and Linear Layers:
      • Implemented a consistent method for applying OFT to both layer types.
    • These changes have made the OFT implementation more efficient and accurate, potentially leading to improved model performance and training stability.

    • Additional Information

      • Recommended α value for OFT constraint: We recommend using α values between 1e-4 and 1e-2. This differs slightly from the original implementation of "(α*out_dim*out_dim)". Our implementation uses "(α*out_dim)", hence we recommend higher values than the 1e-5 suggested in the original implementation.

      • Performance Improvement: Training speed has been improved by approximately 30%.

      • Inference Environment: This implementation is compatible with and operates within Stable Diffusion web UI (SD1/2 and SDXL).

  • The INVERSE_SQRT, COSINE_WITH_MIN_LR, and WARMUP_STABLE_DECAY learning rate schedules are now available in the transformers library. See PR #1393 for details. Thanks to sdbds!

    • See the transformers documentation for details on each scheduler.
    • --lr_warmup_steps and --lr_decay_steps can now be specified as a ratio of the number of training steps, not just the step value. Example: --lr_warmup_steps=0.1 or --lr_warmup_steps=10%, etc.
  • When enlarging images in the script (when the size of the training image is small and bucket_no_upscale is not specified), it has been changed to use Pillow's resize and LANCZOS interpolation instead of OpenCV2's resize and Lanczos4 interpolation. The quality of the image enlargement may be slightly improved. PR #1426 Thanks to sdbds!

  • Sample image generation during training now works on non-CUDA devices. PR #1433 Thanks to millie-v!

  • --v_parameterization is available in sdxl_train.py. The results are unpredictable, so use with caution. PR #1505 Thanks to liesened!

  • Fused optimizer is available for SDXL training. PR #1259 Thanks to 2kpr!

    • The memory usage during training is significantly reduced by integrating the optimizer's backward pass with step. The training results are the same as before, but if you have plenty of memory, the speed will be slower.
    • Specify the --fused_backward_pass option in sdxl_train.py. At this time, only AdaFactor is supported. Gradient accumulation is not available.
    • Setting mixed precision to no seems to use less memory than fp16 or bf16.
    • Training is possible with a memory usage of about 17GB with a batch size of 1 and fp32. If you specify the --full_bf16 option, you can further reduce the memory usage (but the accuracy will be lower). With the same memory usage as before, you can increase the batch size.
    • PyTorch 2.1 or later is required because it uses the new API Tensor.register_post_accumulate_grad_hook(hook).
    • Mechanism: Normally, backward -> step is performed for each parameter, so all gradients need to be temporarily stored in memory. "Fuse backward and step" reduces memory usage by performing backward/step for each parameter and reflecting the gradient immediately. The more parameters there are, the greater the effect, so it is not effective in other training scripts (LoRA, etc.) where the memory usage peak is elsewhere, and there are no plans to implement it in those training scripts.
  • Optimizer groups feature is added to SDXL training. PR #1319

    • Memory usage is reduced by the same principle as Fused optimizer. The training results and speed are the same as Fused optimizer.
    • Specify the number of groups like --fused_optimizer_groups 10 in sdxl_train.py. Increasing the number of groups reduces memory usage but slows down training. Since the effect is limited to a certain number, it is recommended to specify 4-10.
    • Any optimizer can be used, but optimizers that automatically calculate the learning rate (such as D-Adaptation and Prodigy) cannot be used. Gradient accumulation is not available.
    • --fused_optimizer_groups cannot be used with --fused_backward_pass. When using AdaFactor, the memory usage is slightly larger than with Fused optimizer. PyTorch 2.1 or later is required.
    • Mechanism: While Fused optimizer performs backward/step for individual parameters within the optimizer, optimizer groups reduce memory usage by grouping parameters and creating multiple optimizers to perform backward/step for each group. Fused optimizer requires implementation on the optimizer side, while optimizer groups are implemented only on the training script side.
  • LoRA+ is supported. PR #1233 Thanks to rockerBOO!

    • LoRA+ is a method to improve training speed by increasing the learning rate of the UP side (LoRA-B) of LoRA. Specify the multiple. The original paper recommends 16, but adjust as needed. Please see the PR for details.
    • Specify loraplus_lr_ratio with --network_args. Example: --network_args "loraplus_lr_ratio=16"
    • loraplus_unet_lr_ratio and loraplus_lr_ratio can be specified separately for U-Net and Text Encoder.
      • Example: --network_args "loraplus_unet_lr_ratio=16" "loraplus_text_encoder_lr_ratio=4" or --network_args "loraplus_lr_ratio=16" "loraplus_text_encoder_lr_ratio=4" etc.
    • network_module networks.lora and networks.dylora are available.
  • The feature to use the transparency (alpha channel) of the image as a mask in the loss calculation has been added. PR #1223 Thanks to u-haru!

    • The transparent part is ignored during training. Specify the --alpha_mask option in the training script or specify alpha_mask = true in the dataset configuration file.
    • See About masked loss for details.
  • LoRA training in SDXL now supports block-wise learning rates and block-wise dim (rank). PR #1331

  • Negative learning rates can now be specified during SDXL model training. PR #1277 Thanks to Cauldrath!

    • The model is trained to move away from the training images, so the model is easily collapsed. Use with caution. A value close to 0 is recommended.
    • When specifying from the command line, use = like --learning_rate=-1e-7.
  • Training scripts can now output training settings to wandb or Tensor Board logs. Specify the --log_config option. PR #1285 Thanks to ccharest93, plucked, rockerBOO, and VelocityRa!

    • Some settings, such as API keys and directory specifications, are not output due to security issues.
  • The ControlNet training script train_controlnet.py for SD1.5/2.x was not working, but it has been fixed. PR #1284 Thanks to sdbds!

  • train_network.py and sdxl_train_network.py now restore the order/position of data loading from DataSet when resuming training. PR #1353 #1359 Thanks to KohakuBlueleaf!

    • This resolves the issue where the order of data loading from DataSet changes when resuming training.
    • Specify the --skip_until_initial_step option to skip data loading until the specified step. If not specified, data loading starts from the beginning of the DataSet (same as before).
    • If --resume is specified, the step saved in the state is used.
    • Specify the --initial_step or --initial_epoch option to skip data loading until the specified step or epoch. Use these options in conjunction with --skip_until_initial_step. These options can be used without --resume (use them when resuming training with --network_weights).
  • An option --disable_mmap_load_safetensors is added to disable memory mapping when loading the model's .safetensors in SDXL. PR #1266 Thanks to Zovjsra!

    • It seems that the model file loading is faster in the WSL environment etc.
    • Available in sdxl_train.py, sdxl_train_network.py, sdxl_train_textual_inversion.py, and sdxl_train_control_net_lllite.py.
  • When there is an error in the cached latents file on disk, the file name is now displayed. PR #1278 Thanks to Cauldrath!

  • Fixed an error that occurs when specifying --max_dataloader_n_workers in tag_images_by_wd14_tagger.py when Onnx is not used. PR #1291 issue #1290 Thanks to frodo821!

  • Fixed a bug that caption_separator cannot be specified in the subset in the dataset settings .toml file. #1312 and #1313 Thanks to rockerBOO!

  • Fixed a potential bug in ControlNet-LLLite training. PR #1322 Thanks to aria1th!

  • Fixed some bugs when using DeepSpeed. Related #1247

  • Added a prompt option --f to gen_imgs.py to specify the file name when saving. Also, Diffusers-based keys for LoRA weights are now supported.

rockerBOO and others added 30 commits April 1, 2024 15:38
`--loraplus_ratio` added for both TE and UNet
Add log for lora+
When trying to load stored latents, if an error occurs, this change will tell you what file failed to load
Currently it will just tell you that something failed without telling you which file
This can be used to train away from a group of images you don't want
As this moves the model away from a point instead of towards it, the change in the model is unbounded
So, don't set it too low. -4e-7 seemed to work well.
If a latent file fails to load, print out the path and the error, then return false to regenerate it
Adafactor fused backward pass and optimizer step, lowers SDXL (@ 1024 resolution) VRAM usage to BF16(10GB)/FP32(16.4GB)
kohya-ss and others added 26 commits September 19, 2024 21:15
Make timesteps work in the standard way when Huber loss is used
New optimizer:AdEMAMix8bit and PagedAdEMAMix8bit
1) Updates debiased estimation loss function for V-pred.
2) Prevents now-deprecated scaling of loss if ztSNR is enabled.
Different model architectures, such as SDXL, can take advantage of
v-pred. It doesn't make sense to include these warnings anymore.
Update debiased estimation loss function to accommodate V-pred
@kohya-ss kohya-ss merged commit 6e3c1d0 into main Jan 17, 2025
2 checks passed
@RalFingerLP
Copy link

Impressive update, thanks to all the contributors!

@FurkanGozukara
Copy link

@kohya-ss amazing work

I tested Fused backward pass on SDXL with Adafactor and it reduced as low as 10200 MB

I also tried Fused optimizer groups = 10 and it was like 10500 MB

However when enabling Fused backward pass + block swaps it didn't make any more difference

Can I reduce VRAM usage any further? SDXL training at 1024x1024

I can train below 8 GB GPUs FLUX dev with block swaps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.