Prior model preservation #505

dxqbYD · 2024-10-11T05:23:37Z

This code can be used to preserve the prior model on prompts other than the trained captions. After several more tests I think this is worth implementing and a quite generic feature:

It does not require any regularization image data. It works even when using the same training data for the reg steps as for the regular training steps.
It does not require a regularization caption. An empty caption for the reg steps works, indicating that this can preserve all kinds of concepts and whatever you train on
Additionally, it might improve training results on the trained captions, but I am not sure about this yet.

Let me know if I should provide more details here, which you can currently find on the OT discord.
There is a feature request for SimpleTuner here: bghira/SimpleTuner#1031

This is a draft PR only to determine the interest for a full PR. It only works with batch size one, only for Flux, only for LoRA, and only for transformer.

It could be implemented generically for all LoRA. With major effort, it could be implemented for Full Finetune, but to avoid having the full model in VRAM twice, pre-generation of reg steps predictions would be necessary.

FurkanGozukara · 2024-10-11T13:53:06Z

@dxqbYD can you add examples? your examples are great. even though i couldnt make it work maybe after properly implemented it will work :D

so examples of comparison and how you did setup your concepts

dxqbYD · 2024-10-14T03:48:11Z

samples can be found in these release notes of SimpleTuner: https://www.reddit.com/r/StableDiffusion/comments/1g2i13s/simpletuner_v112_now_with_masked_loss_training/

dxqbYD · 2024-10-19T14:46:33Z

kohya implementation: kohya-ss/sd-scripts#1710

Nerogar · 2024-10-20T20:44:43Z

This sounds like a really good idea to add as an option. But it definitely needs a more generic implementation. There are two issues to solve

Dataset

How do we select the regularization samples during training? This also needs to work with a higher batch size than 1. Ideally it would mix regularization samples and normal training samples within the same batch.
"It does not require a regularization caption" I don't think this is strictly true. You need some kind of conditioning for the model. Not conditioning the model at all will probably significantly reduce the effect of this training method.
What do you think about adding a new flag to concepts that toggles this loss calculation for specific training samples? Then the user can decide whether to include captions or not, and which images to use.

Unhooking the LoRA

Each model has different sub-modules. So we need a generic method of disabling the LoRA for the prior result. A function in the model class to enable/disable all LoRAs could work well.

bghira · 2024-10-20T21:20:15Z

how do you intend on mixing regularisation and training samples in a single batch @Nerogar ? that seems like not trivial. the actual target is changed.

Nerogar · 2024-10-20T21:33:09Z

The only difference between prior preservation and normal training is the prediction target. So what I would do is basically this:

Find the samples in the batch where the prior_preservation flag is set to True
Calculate the prior prediction without the LoRA for those samples
Replace the target of the batch in those samples with the prior prediction
Calculate the loss without any modification

bghira · 2024-10-20T21:49:08Z

yes, unfortunately it just doesn't have the same regularisation effect to do it that way. having an entire batch pull back toward the model works.

dxqbYD · 2024-10-20T22:16:26Z

yes, unfortunately it just doesn't have the same regularisation effect to do it that way. having an entire batch pull back toward the model works.

what are you basing this on?

what Nerogar describes above is what kohya has implemented. So if true, that would mean kohya's implementation doesn't work (as well)

bghira · 2024-10-20T22:22:20Z

basing it on numerous tests we've run on a cluster of H100s over the last week

dxqbYD · 2024-10-20T22:23:51Z

How do we select the regularization samples during training? This also needs to work with a higher batch size than 1. Ideally it would mix regularization samples and normal training samples within the same batch. "It does not require a regularization caption" I don't think this is strictly true. You need some kind of conditioning for the model. Not conditioning the model at all will probably significantly reduce the effect of this training method.

It isn't obvious that this would work without captions, but it does. You can see samples in the reddit link above. The right-most column is without captions.

What do you think about adding a new flag to concepts that toggles this loss calculation for specific training samples? Then the user can decide whether to include captions or not, and which images to use.

Yes, agreed. There are more use cases than captions in favor of having it as a separate concept, for example balancing the regularisation using the number of repeats. In some of my tests, 1:1 was too much.

@bghira has also found using his implementation in SimpleTuner that even though it works with no external data, it works better against high-quality external data.

dxqbYD · 2024-10-20T22:27:18Z

basing it on numerous tests we've run on a cluster of H100s over the last week

okay thanks. any theory on why that would be? I don't see a theoretical reason for your finding that it works better on a separate batch:
reg gradients are tiny.
the regularisation described in the Dreambooth paper was always implemented in the same batch in the early scripts.
you could even argue that this type of contrastive training should work better in the same batch.

O-J1 · 2024-10-21T02:15:48Z

basing it on numerous tests we've run on a cluster of H100s over the last week

Could you please provide some evidence of this? I.e a significant enough amount of samples that your aren’t falling victim to seed rng

it’s important to get this right

dxqbYD · 2024-10-21T09:44:43Z

basing it on numerous tests we've run on a cluster of H100s over the last week

Could you please provide some evidence of this? I.e a significant enough amount of samples that your aren’t falling victim to seed rng

it’s important to get this right

if this turns out to be right, I'd recommend to implement a feature into the OT concepts like
"try to keep this concept separate from concept Y in batches"
and
"try to combine this concept with concept Y in batches"

It would influence how the batches are built, and the first option would be how ST builds batches.

This could be a useful feature on its own. For example, if you train 2 concepts, it can be beneficial to have 1 image of each concept in a batch, instead of the same concept twice, especially if the images in a concept are very similar.

bghira · 2024-10-21T13:17:34Z

i dont have time, sorry, do it however works best for your codebase.

DriveHabits · 2024-11-13T20:22:31Z

any update on this @dxqbYD

dxqbYD · 2024-11-13T20:54:41Z

any update on this @dxqbYD

nothing usable for OneTrainer users yet.
more interesting experiments beyond just preserving prior knowledge of a separate prompt as above: It appears it can also be very useful when training a concept, controlling for what you don't want it to learn. The concept can then be mixed in by prompting, and even mixing with other independently trained LoRAs seems to work better then.

I should mention that there was apparently a paper published proposing this technique in April of this year, I just didn't know about it: https://arxiv.org/pdf/2404.07554
The authors have pointed this out at the PR of kohya's implementation. They coined it "Contrastive Adapter Training"

FurkanGozukara · 2024-11-13T20:58:32Z

@dxqbYD so we have it in kohya atm? i couldnt find

initial 6a83157

d911196

Merge branch 'Nerogar:master' into prior_reg

8a92161

dxqbYD mentioned this pull request Nov 13, 2024

Differential Output Preservation loss for LoRA kohya-ss/sd-scripts#1710

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prior model preservation #505

Prior model preservation #505

dxqbYD commented Oct 11, 2024

FurkanGozukara commented Oct 11, 2024

dxqbYD commented Oct 14, 2024

dxqbYD commented Oct 19, 2024

Nerogar commented Oct 20, 2024

bghira commented Oct 20, 2024

Nerogar commented Oct 20, 2024

bghira commented Oct 20, 2024

dxqbYD commented Oct 20, 2024

bghira commented Oct 20, 2024

dxqbYD commented Oct 20, 2024

dxqbYD commented Oct 20, 2024 •

edited

Loading

O-J1 commented Oct 21, 2024 •

edited

Loading

dxqbYD commented Oct 21, 2024

bghira commented Oct 21, 2024

DriveHabits commented Nov 13, 2024

dxqbYD commented Nov 13, 2024

FurkanGozukara commented Nov 13, 2024

Prior model preservation #505

Are you sure you want to change the base?

Prior model preservation #505

Conversation

dxqbYD commented Oct 11, 2024

FurkanGozukara commented Oct 11, 2024

dxqbYD commented Oct 14, 2024

dxqbYD commented Oct 19, 2024

Nerogar commented Oct 20, 2024

Dataset

Unhooking the LoRA

bghira commented Oct 20, 2024

Nerogar commented Oct 20, 2024

bghira commented Oct 20, 2024

dxqbYD commented Oct 20, 2024

bghira commented Oct 20, 2024

dxqbYD commented Oct 20, 2024

dxqbYD commented Oct 20, 2024 • edited Loading

O-J1 commented Oct 21, 2024 • edited Loading

dxqbYD commented Oct 21, 2024

bghira commented Oct 21, 2024

DriveHabits commented Nov 13, 2024

dxqbYD commented Nov 13, 2024

FurkanGozukara commented Nov 13, 2024

dxqbYD commented Oct 20, 2024 •

edited

Loading

O-J1 commented Oct 21, 2024 •

edited

Loading