Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running out of memory... best number of samples for custom data sets? #8

Open
pgtinsley opened this issue Mar 19, 2024 · 6 comments
Open

Comments

@pgtinsley
Copy link

pgtinsley commented Mar 19, 2024

Hello!

I was wondering if you have any intuition on how many training samples are required to get good results/how much memory is required to train the unconditional VQVAE?

I have about 200k grayscale images at 256x256... which was obviously too much, so I scaled back to 70 images just to see if it would start training, but it didn't... throwing the too little memory error.

Is this something batch size can fix or do I need to mess with a bunch of other parameters? I only changed the im_channels and save_latent parameters from their defaults.

Thank you!

@explainingai-code
Copy link
Owner

Hello @pgtinsley ,
The out of memory error would not have anything to do with the number of images in your dataset. You can keep it at 200K but if its taking too much to train and you want to speed that up, you can try training on 50K images(but it depends on how much variation is their between the images).
Could you tell me what GPU are you using?
The parameters that you would essentially play with to get rid of this error would be batch size, down/mid channels, number of down/mid/up layers .

@pgtinsley
Copy link
Author

Hi @explainingai-code ,
I'm using some older hardware -- 4 GTX Titan X's. Could that also be a problem?
Thank you!

@pgtinsley
Copy link
Author

Here is the config file:

dataset_params:
  im_path: 'data/combined_cropped_256x256'
  im_channels : 1
  im_size : 256
  name: 'combined_cropped_256x256'

diffusion_params:
  num_timesteps : 1000
  beta_start : 0.0015
  beta_end : 0.0195

ldm_params:
  down_channels: [ 256, 384, 512, 768 ]
  mid_channels: [ 768, 512 ]
  down_sample: [ True, True, True ]
  attn_down : [True, True, True]
  time_emb_dim: 512
  norm_channels: 32
  num_heads: 16
  conv_out_channels : 128
  num_down_layers : 2
  num_mid_layers : 2
  num_up_layers : 2

autoencoder_params:
  z_channels: 3
  codebook_size : 8192
  down_channels : [64, 128, 256, 256]
  mid_channels : [256, 256]
  down_sample : [True, True, True]
  attn_down : [False, False, False]
  norm_channels: 32
  num_heads: 4
  num_down_layers : 2
  num_mid_layers : 2
  num_up_layers : 2


train_params:
  seed : 1111
  task_name: 'combined_cropped_256x256'
  ldm_batch_size: 16
  autoencoder_batch_size: 4
  disc_start: 15000
  disc_weight: 0.5
  codebook_weight: 1
  commitment_beta: 0.2
  perceptual_weight: 1
  kl_weight: 0.000005
  ldm_epochs: 100
  autoencoder_epochs: 20
  num_samples: 1
  num_grid_rows: 1
  ldm_lr: 0.000005
  autoencoder_lr: 0.00001
  autoencoder_acc_steps: 4
  autoencoder_img_save_steps: 64
  save_latents : True
  vae_latent_dir_name: 'vae_latents'
  vqvae_latent_dir_name: 'vqvae_latents'
  ldm_ckpt_name: 'ddpm_ckpt.pth'
  vqvae_autoencoder_ckpt_name: 'vqvae_autoencoder_ckpt.pth'
  vae_autoencoder_ckpt_name: 'vae_autoencoder_ckpt.pth'
  vqvae_discriminator_ckpt_name: 'vqvae_discriminator_ckpt.pth'
  vae_discriminator_ckpt_name: 'vae_discriminator_ckpt.pth'

@explainingai-code
Copy link
Owner

explainingai-code commented Mar 19, 2024

Yeah.
But thats fine, lets try to reduce the memory required without reducing the network parameters first.
Can you try changing these two parameters:
autoencoder_batch_size:2
autoencoder_acc_steps:8

And see if its runs. If not then also try with
autoencoder_batch_size:1
autoencoder_acc_steps:16

Just to add when I was training on celebhq with 256x256 rgb images, I was using Nvidia V100

@pgtinsley
Copy link
Author

Still no luck with either of those options... I'll try to get some GPUs with more memory... thank you!

@explainingai-code
Copy link
Owner

@pgtinsley, Yes that should solve this problem.
Couple of other things that you can try incase you are unable to get a higher memory gpu:
num_down_layers : 1 (instead of 2 in autoencoder_params)
First maybe just try this with (autoencoder_batch_size:1 , autoencoder_acc_steps:16)

Then lastly modify the downsample parameter and have autoencoder work with 128x128 images.
down_sample : [True, True, False] (instead of [True, True, True] in autoencoder_params)
im_size : 128(instead of 256 in dataset_params)
This will build an auto encoder that takes 128x128 images to 32x32 latent images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants