New Training Method : Masked Score Estimation #96
Replies: 9 comments 9 replies
-
That's a great idea. Would be nice to build this into a general augmentation pipeline, or even to allow user-defined masks. I wonder if we could apply it to finetune the inpainting model? |
Beta Was this translation helpful? Give feedback.
-
Also wonder how much of this could be acheived with captioning. E.g., "TOKEN wearing blue and black striped jacket" to get the model to dissociate those features from the target token. |
Beta Was this translation helpful? Give feedback.
-
Can you please describe in details the second formula (for M)? What are S and delta, and how to get them? |
Beta Was this translation helpful? Give feedback.
-
Throwing bunch of results I've shared on Twitter just for reference. All used face segmentation |
Beta Was this translation helpful? Give feedback.
-
for a work project i messed around with this (which is JAX but based on this which is pytorch if you are sticking with that library) to demonstrate how attention models apply to product photography in digital advertising (we found a correlation with click-to-view rates and ROI vs the layout and concentration of these attention maps)... i think it would probably be somewhat helpful here as well, both for visualizing what the attention layers are "looking at", as well as to auto-define them as masks (i bet you could combine the concept with CLIP, such that the self-attention layers "focus" on the image areas matching a text prompt) |
Beta Was this translation helpful? Give feedback.
-
Awesome idea! Do we have any plan to implement this into some stable diffusion webui plugin, e.g., https://github.com/d8ahazard/sd_dreambooth_extension ? |
Beta Was this translation helpful? Give feedback.
-
Great. will this help in speeding up the training process or Memory usage or Quality of the training or all? |
Beta Was this translation helpful? Give feedback.
-
Hi! I have a question regarding masking training, |
Beta Was this translation helpful? Give feedback.
-
Hi all! Anyone can point me at what process was used to do this? I love the idea behind it - I'd love to implement it. Also, why do we have to add the mask latents onto the model_pred input? Haha, a lot to parse through... |
Beta Was this translation helpful? Give feedback.
-
Basically copy-paste from my twitter thread, but sharing here as well:
Consider yourself fine-tuning on the following two images. As a neural network, you have no idea if blue cloth is something intrinsic to . But what if you modified the score estimation objective to only consider the region of interest : in our case, facial features?
Explicitly, we can define the region of interest as some appropriate mask function given image x_t. In my experiments, I've used result from face recognition model, and scaled them to make them have equal importance.
$$
\begin{align}
L &= \mathbb{E}{t~[1, \lambda T]}||(\epsilon_t - \epsilon\theta (x_t , t, c))\cdot M(x_t)||^2 \
M(x_t) &= \frac{S(x_t) + \delta}{s(x_t)} \
\lambda& \in [0,1]
\end{align}
$$
One way to make sense of this simple trick, is to see it as "identification" (or projection) with regards to equivalence relation : If the region of interest is same, they are the same. Here is the ablation study, Left 4 pictures contain "school-uniformness" and "blueness" on their cloth, whereas right 4 pictures have non of them. seed from 0 to 3.
Here are results with random prompts from lexica, NOT CHERRYPICKED.
Currently merged into #88 .
Beta Was this translation helpful? Give feedback.
All reactions