Skip to content

Added --blocks_to_swap_while_sampling, may allow faster sample image generation #2056

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: sd3
Choose a base branch
from

Conversation

araleza
Copy link

@araleza araleza commented Apr 20, 2025

I've recently switched over to doing Flux full fine tuning instead of LoRA training. But I've found that sample image generation while training is very slow. I'm using --blocks_to_swap 35 which lets me have a batch size of 5. This block-swapping persists during sample inference, increasing sample image generation time.

The reason that block swapping is useful while training is because it saves a lot of VRAM, allowing a larger batch size, and for the memory needed for the optimizer's structures, e.g. momentum. But these are not required when doing sample image inference. If I open up nvtop I can see that my VRAM is mostly unused during this time.

This new option allows the number of blocks to swap to be set to a lower number while generating sample images. This may allow faster image generation. e.g. On my current setup where I'm using 50 sampling steps per image, putting --blocks_to_swap_while_sampling 2 reduces the time per image from around 3 minutes to around 1 minute 48 secs. That might not sound like a big difference at first, but if I generate around 100 images over a run, this saves around 2 hours in total.

@araleza araleza mentioned this pull request Apr 20, 2025
25 tasks
@rockerBOO
Copy link
Contributor

Maybe it would be better to not have the blocks_to_swap_while_sampling in the forward and instead would be a model method to call instead or to change the block swap value specifically. This way the forward doesn't have new parameters that could cause side effects later with needing to pass more parameters down to the model that aren't specifically inputs.

@kohya-ss
Copy link
Owner

I'm sorry it took me so long to check.

rockerBOO has a point.

I think there may be a way to extend ModelOffloader and switch the number of blocks depending on whether the model is train or eval. submit_move_blocks, wait_for_block, and prepare_block_devices_before_forward may be able to adjust the number of blocks by receiving the model as an additional argument.

@araleza
Copy link
Author

araleza commented Apr 27, 2025

Thanks for the reviews, @rockerBOO and @kohya-ss . I'll take a look at the code soon and try to include your suggestions for improvement. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants