Added --blocks_to_swap_while_sampling, may allow faster sample image generation #2056
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I've recently switched over to doing Flux full fine tuning instead of LoRA training. But I've found that sample image generation while training is very slow. I'm using
--blocks_to_swap 35
which lets me have a batch size of 5. This block-swapping persists during sample inference, increasing sample image generation time.The reason that block swapping is useful while training is because it saves a lot of VRAM, allowing a larger batch size, and for the memory needed for the optimizer's structures, e.g. momentum. But these are not required when doing sample image inference. If I open up nvtop I can see that my VRAM is mostly unused during this time.
This new option allows the number of blocks to swap to be set to a lower number while generating sample images. This may allow faster image generation. e.g. On my current setup where I'm using 50 sampling steps per image, putting
--blocks_to_swap_while_sampling 2
reduces the time per image from around 3 minutes to around 1 minute 48 secs. That might not sound like a big difference at first, but if I generate around 100 images over a run, this saves around 2 hours in total.