Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Out of Memory Errors w Batch Size of 1 on 16GB V100 #27

Open
jordanparker6 opened this issue Aug 14, 2022 · 4 comments
Open

CUDA Out of Memory Errors w Batch Size of 1 on 16GB V100 #27

jordanparker6 opened this issue Aug 14, 2022 · 4 comments
Labels
question Further information is requested

Comments

@jordanparker6
Copy link

Using the default FeatureExtractor settings for the HuggingFace port of YOLOS, I am consistently running into CUDA OOM errors on a 16GB V100 (even with a training batch size of 1).

I would like to train YOLOS on publaynet and ideally use 4-8 V100s.

Is there a way to lower the CUDA memory usage while training YOLOS besides batch size (whilst preserving the accuracy and leveraging the pertained models)?

I see that other models (e.g. DiT) use image sizes of 244x244. However, is it fair to assume that such a small image size would not be appropriate for object detection as too much information is lost? In the DiT case document image classification was the objective.

@jordanparker6 jordanparker6 added the question Further information is requested label Aug 14, 2022
@Yuxin-CV
Copy link
Member

Hi, for the memory issue, please refer to #5 (comment)

@jordanparker6
Copy link
Author

Ahh that great! Thank you.

@jordanparker6
Copy link
Author

For those interested, I found that the HF implementation is set up for Gradient Accumulation.

Enable it with:

    self.model = YolosForObjectDetection.from_pretrained(
      self.hparams.pretrained_model_name_or_path, 
      config=config,
      ignore_mismatched_sizes=True
    )
    self.model.gradient_checkpointing_enable()

I was able to increase the batch size from 1 to 8 using this on a T4 with dpp_sharded in pytorch-lightning. It shaved about 35 mins off per epoch reducing the per epoch time from 165mins to 130mins.

model:
  pretrained_model_name_or_path: "hustvl/yolos-base"
  learning_rate: 2e-5
data:
  data_dir: "/datastores/doclaynet/images"
  train_batch_size: 8
  val_batch_size: 8
  num_workers: 4
trainer:
  resume_from_checkpoint: null 
  accelerator: "gpu"
  num_nodes: 1
  strategy: "ddp_sharded"
  max_epochs: 10
  min_epochs: 3
  max_steps: -1
  val_check_interval: 1.0
  check_val_every_n_epoch: 1
  gradient_clip_val: 1.0

@Yuxin-CV
Copy link
Member

For those interested, I found that the HF implementation is set up for Gradient Accumulation.

Enable it with:

    self.model = YolosForObjectDetection.from_pretrained(
      self.hparams.pretrained_model_name_or_path, 
      config=config,
      ignore_mismatched_sizes=True
    )
    self.model.gradient_checkpointing_enable()

I was able to increase the batch size from 1 to 8 using this on a T4 with dpp_sharded in pytorch-lightning. It shaved about 35 mins off per epoch reducing the per epoch time from 165mins to 130mins.

model:
  pretrained_model_name_or_path: "hustvl/yolos-base"
  learning_rate: 2e-5
data:
  data_dir: "/datastores/doclaynet/images"
  train_batch_size: 8
  val_batch_size: 8
  num_workers: 4
trainer:
  resume_from_checkpoint: null 
  accelerator: "gpu"
  num_nodes: 1
  strategy: "ddp_sharded"
  max_epochs: 10
  min_epochs: 3
  max_steps: -1
  val_check_interval: 1.0
  check_val_every_n_epoch: 1
  gradient_clip_val: 1.0

Awesome!:smiling_face_with_three_hearts::smiling_face_with_three_hearts::smiling_face_with_three_hearts:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants