Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimated Number of Epoch Required For Training #9

Closed
WeiyunJiang opened this issue May 14, 2024 · 6 comments
Closed

Estimated Number of Epoch Required For Training #9

WeiyunJiang opened this issue May 14, 2024 · 6 comments

Comments

@WeiyunJiang
Copy link

Hi Zheng and Mengqi,

Thank you for your amazing work and making your code public! I wonder if you could kindly provide some insights on my experiments below?

I am training on the truncated CelebA dataset, which only has 5000 256x256 images. And I am using CLIP embedding. My batch-size is 96.

  1. How many epoch would it take for the train.py to converge?
  2. How many epoch would it take for the train_latent.py to converge?
  3. Should I run train.py longer or train_latent.py longer when the generated images have grid-like artifacts?
@WeiyunJiang
Copy link
Author

WeiyunJiang commented May 15, 2024

For train.py, I am at 12M steps (96 batchsize x # of iterations). And the sampled training images look pretty good:
Sampled:
sampled

Real:
image

However, the unconditionally generated images do not make sense. For train_latent.py, I am at 36M step (1 batchsize x # of iterations). I used 96 batchsize (same as the original GitHub code).
Loss:
image

Testing:
10
11

@WeiyunJiang
Copy link
Author

WeiyunJiang commented May 15, 2024

Did not intend to close the issue. Any suggestions or insights would be appreciated! Thank you! @zh-ding @Mq-Zhang1

@zh-ding
Copy link
Contributor

zh-ding commented May 15, 2024

Hello, thanks for your interest. Can you share more details on how you train the latents stage? It looks a bit weird to me since the images from training and testing should look similar. I can see from you results that the latent training stage doesn't learn the data latent distribution at all. So I'm wondering if there is any mismatch in this process.

BTW, why is the batch size only 1 for the latent training?

@WeiyunJiang
Copy link
Author

Hi @zh-ding,

Thank you so much for the prompt reply. Yes, I agree that the latent code was not sampled properly to match the training latent code distribution. Sorry for the confusion, I did not use batch size of 1. The batch size during latents stage was 128 (same as the original GitHub code).

For training the latents stage, I set the model_path the same as the last model from the training stage.
I use the following command:
python train_latent.py --model_path ./checkpoints/exp_clip/last.ckpt --name train_latent_clip

I keep most of the code the same. However, I did modify the codes on line 432, experiment.py to suppress the error mentioned here (#8 (comment)).

if self.conf.train_mode.require_dataset_infer():
    imgs = None
    idx = None
else:
    imgs = batch['img']
    idx = batch["index"]
    
    self.log_sample(x_start = imgs, step = self.global_step, idx = idx)

After the above modification, the tensorboard would no longer log the images during the latent stage.

Thus, I used test.py for inference and generated those testing images.
The command I used for testing is as follows:
python test.py --batch_size 1 --patch_size 64 --output_dir ./output_images_clip --image_size 256x256 --img_num 5 --full_path ./checkpoints/train_latent_clip/last.ckpt

P.S. I wonder if the number of training images is not enough for the latent code sampler to learn the latent distribution? I only used 5000 256x256 celebA images for training. Did you ever use 5000 images for training and have any luck with them? In your paper, I think the smallest dataset (nature 21K) you used has 21K images.

Please let me know if you have any insights or suggestions! I really appreciate that! Thank you.

@Mq-Zhang1
Copy link
Collaborator

Hi @WeiyunJiang,

Thank you for all the details provided!

The problem may caused by conf.latent_znormalize set to True, learning a normalized latent distribution instead of the original. Add conf.latent_znormalize = False in train_latent.py should fix the problem.
image
We didn't experiment on smaller datasets. An approximate training epochs for 21k Nature is 1882, but actually at 235 epoch, we could already observe reasonable results (batch size set to 256).

I will fix this normalization config problem during latent training and sampling immediately to make it more clear. Thanks for raising this up! Plz feel free to let me know if results are still weird.

@WeiyunJiang
Copy link
Author

Hi @Mq-Zhang1 and @zh-ding ,

Thanks again for the prompt response. After the fix, it works like a charm. THANK YOU! Fantastic work! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants