Very low avg loss on LoRA training / Trying to understand TensorBoard logs #237

Kalerindel · 2023-02-26T23:14:22Z

Kalerindel
Feb 26, 2023

Was wondering if someone could please have a look at my settings and logs as I'm having some issues interpreting my training results

Default settings/setup were as follows:

13 training images
0 reg images
172 repeats (2236 steps)
4 batch size
559 optimization steps (2236 / 4)
unet_lr and text_encoder_lr values are multiplied by train_batch_size

CLI command
--num_cpu_threads_per_process 8 train_network.py --pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5 --train_data_dir=/training_data/testproject_v1 --logging_dir=/output/lora/testproject_v1/log/ --output_dir=/output/lora/testproject_v1 --output_name=testproject_v1 --caption_extension=.txt --unet_lr=0.0001 --text_encoder_lr=0.00005 --max_train_epochs=1 --network_dim=128 --network_alpha=128 --resolution=512,512 --train_batch_size=4 --gradient_accumulation_steps=1 --save_every_n_epochs=1 --enable_bucket --bucket_reso_steps=64 --random_crop --optimizer_type=AdamW8bit --xformers --mixed_precision=fp16 --save_precision=fp16 --save_model_as=safetensors --clip_skip=1 --lr_scheduler=cosine_with_restarts --seed=1234 --network_module=networks.lora

First 50 steps, which turns into a sharp drop

Full graph

Run # 15 at 1500 steps

Notes from the tests:

#	Change from default settings	Avg Loss	Results
7	Default settings	0.1002	Overfit. Impossible to style
9	UNET 0.00002 Text 0.00001 lr_scheduler=cosine_with_restarts ... Deliberately set LR low	0.1193	Decent likeness, easy to style
10	Same as test 9, but set lr_scheduler=constant	0.1179	More or less identical to # 9
11	DAdaptation --unet_lr=1.0 --text_encoder_lr=0.5 --lr_scheduler=cosine_with_restarts	0.0957	Overfit. Impossible to style
12	1 batch * 43 repeats to target 559 steps. UNET/Text Encoder set to default 0.0001/0.00005	0.1136	Seems to be easier to style, but slightly worse likeness compared to # 9 and # 10
14	Default settings, but used source training images before they were 2x upscaled with R-ESRGAN 2x+	0.1058	Overfit. Impossible to style
15	1 batch * 115 repeats to target ~1500 steps	0.1067	Starting to overfit? Decent likeness and is stylable, but head/neck is too large compared to body
16	1 batch * 43 repeats to target 559 steps. UNET 0.00001 Text 0.000005 ... Deliberately set LR low	0.1191	No likeness. Unable to "activate"
17	1 batch * 43 repeats to target 559 steps. UNET 0.001 Text 0.0005 ... Deliberately set LR high	0.1089	Overfit. Impossible to style

In YT videos and misc guides I keep seeing people having just under 0.2 avg loss and it all makes sense when looking at the logs when it comes to indentifying rapid learn into churn into fry, whereas I'm struggling making sense of my logs as it seems to be such a small number between a decent result and having something completely unusable. For instance:

0.1089 from # 17 is the lowest number where the model is competely overfit
0.1193 from # 9 has decent likeness and a version I would want to improve
0.1191 from # 16, which has less than the avg loss in # 9 is completely unusable

Was trying to take # 9 or # 12 and make the avg loss higher to see how the model looks like, but can't seem to figure out how to do so without reducing steps/repeats to stop earlier which seems seems to go against all guides/recommendations. Is there a way I can reduce the drop from the rapid learning process or am I going about this the wrong way?

Have tried reinstalling and have also noticed the same kind of results across different datasets.

Thanks!

rafstahelin · 2023-04-30T15:10:51Z

rafstahelin
Apr 30, 2023

I like your approach. Did you find some answers?

0 replies

Vinylspider · 2023-08-05T05:06:04Z

Vinylspider
Aug 5, 2023

I've been reading up a lot, though I've been having limited success on SDXL.
Regularization images of may help reduce overfitting.
You might also reduce the repeats and increase the epochs with a save at each epoch. This approach makes it easier to add a few more epochs or compare the results from a number of epochs.

I did find this TY particularly useful: https://www.youtube.com/watch?v=wJX4bBtDr9Y

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very low avg loss on LoRA training / Trying to understand TensorBoard logs #237

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Very low avg loss on LoRA training / Trying to understand TensorBoard logs #237

Kalerindel Feb 26, 2023

Replies: 2 comments

rafstahelin Apr 30, 2023

Vinylspider Aug 5, 2023

Kalerindel
Feb 26, 2023

rafstahelin
Apr 30, 2023

Vinylspider
Aug 5, 2023