You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Edit: Also, (unless I misread your code) you seem to only put the timestep embedding in the AdaLN scale/shift thingy, but the SD3 paper also puts a vector made from the image description in there. Did you find the former worked better?
Edit 2: Also also, did your muP optimization lead that far from a 1e^-4 learning rate? Can you share the results of your hparam search?
The text was updated successfully, but these errors were encountered:
| Edit: Also, (unless I misread your code) you seem to only put the timestep embedding in the AdaLN scale/shift thingy, but the SD3 paper also puts a vector made from the image description in there. Did you find the former worked better?
I just dont find clip embedding useful when I inference with them. Kinda my personal thing.
Because muP devides the global learning rate with input dimension, its actually more like 1e-4 in practice for fat layers.
For biases or input, its much larger, which is the rationale behind muP.
minRF/advanced/mmdit.py
Line 161 in 261859e
minRF/advanced/mmdit.py
Line 86 in 72feb0c
Not used in last layer, should be moved into an
if not last
statement. Unused parameters make some distributed algos slow and sad: https://pytorch.org/docs/stable/notes/ddp.html#internal-designEdit: Also, (unless I misread your code) you seem to only put the timestep embedding in the AdaLN scale/shift thingy, but the SD3 paper also puts a vector made from the image description in there. Did you find the former worked better?
Edit 2: Also also, did your muP optimization lead that far from a 1e^-4 learning rate? Can you share the results of your hparam search?
The text was updated successfully, but these errors were encountered: