[Bug]: torch.cuda.OutOfMemoryError: CUDA out of memory - PatchCore #2033

udarnicus · 2024-04-23T06:55:20Z

udarnicus
Apr 23, 2024

Describe the bug

Hello,

I am experimenting with multiple models for binary classification on a dataset with 330x330 images. I am using a Tesla T4 with 15GB of memory. For models like EfficientAD or Fastflow, the training runs without problems, however I can only get to 77% of an epoch before the out of memory error. It does not matter which batch size I use, the gpu memory used increases with each batch. What could the issue be?

Dataset

Other (please specify in the text field below)

Model

PatchCore

Steps to reproduce the behavior

Run Patchcore on dataset with 4000 train samples of size 330x330

OS information

OS information:

Python version: 3.10
Anomalib version: 1.1.0.dev0
PyTorch version: 2.1.2
CUDA/cuDNN version: 12.2
GPU models and configuration: Tesla T4
Custom dataset, 4000 train samples with 330x330 resolution

Expected behavior

Run the training epoch without error

Screenshots

No response

Pip/GitHub

pip

What version/branch did you use?

No response

Configuration YAML

datamodule = Folder(
    name="mvtec_ad",
    normal_dir="",
    abnormal_dir="",
    normal_test_dir ="",
    task="classification",
    num_workers=0,
    #image_size=(330,330),
    train_batch_size=1,
    eval_batch_size = 32,
)

#######

engine = Engine(
    normalization="min_max", #none
    threshold="F1AdaptiveThreshold",
    task=TaskType.CLASSIFICATION,
    image_metrics=["OptimalF1","F1AdaptiveThreshold","AUROC","BinaryPrecision","BinaryRecall","F1Score"], 
    accelerator="auto",
    check_val_every_n_epoch=1,
    devices="auto",
    #num_nodes = 4,
    max_epochs=30,
    num_sanity_val_steps=0,
    val_check_interval=1.0,
    logger=[logger_tensorboard,logger_comet], 
    log_every_n_steps = 10,
    callbacks = callbacks,
    limit_test_batches = 0.5
)
#####
callbacks = [
ModelCheckpoint(
mode="max",
monitor="image_F1Score",
),
_VisualizationCallback(
visualizers = visualizers,
root="images",
save = False,
log = True,
show = False,
)
]

Logs

File "/home/sagemaker-user/anomalib/run_training_singlerun.py", line 176, in <module>
    engine.fit(datamodule=datamodule, model=model)
  File "/home/sagemaker-user/anomalib/src/anomalib/engine/engine.py", line 518, in fit
    self.trainer.fit(model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage
    self.fit_loop.run()
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run
    self.advance(data_fetcher)
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 240, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 187, in run
    self._optimizer_step(batch_idx, closure)
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 265, in _optimizer_step
    call._call_lightning_module_hook(
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 1291, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 151, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 230, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/precision.py", line 117, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/torch/optim/optimizer.py", line 373, in wrapper
    out = func(*args, **kwargs)
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 406, in step
    return closure()
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/precision.py", line 104, in _wrap_closure
    closure_result = closure()
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 140, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 126, in closure
    step_output = self._step_fn()
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 315, in _training_step
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 382, in training_step
    return self.lightning_module.training_step(*args, **kwargs)
  File "/home/sagemaker-user/anomalib/src/anomalib/models/image/patchcore/lightning_model.py", line 82, in training_step
    embedding = self.model(batch["image"])
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sagemaker-user/anomalib/src/anomalib/models/image/patchcore/torch_model.py", line 80, in forward
    embedding = self.generate_embedding(features)
  File "/home/sagemaker-user/anomalib/src/anomalib/models/image/patchcore/torch_model.py", line 121, in generate_embedding
    embeddings = torch.cat((embeddings, layer_embedding), 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 14.58 GiB of which 3.62 MiB is free. Process 27360 has 14.57 GiB memory in use. Of the allocated memory 13.99 GiB is allocated by PyTorch, and 468.78 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Code of Conduct

I agree to follow this project's Code of Conduct

Answered by blaz-r

Apr 23, 2024

PatchCore is a very memory-hungry model so you'll maybe need to reduce the image size or the number of training samples.

Also unrelated to this problem: PatchCore is not trained for multiple epochs, so set the epochs to 1 when using it.

View full answer

blaz-r · 2024-04-23T19:27:32Z

blaz-r
Apr 23, 2024

PatchCore is a very memory-hungry model so you'll maybe need to reduce the image size or the number of training samples.

Also unrelated to this problem: PatchCore is not trained for multiple epochs, so set the epochs to 1 when using it.

0 replies

udarnicus · 2024-04-24T07:57:18Z

udarnicus
Apr 24, 2024
Author

Thank you for the hint! I unfortunately have to keep the resolution and sample size the same, however I have access to bigger machines. The problem is that AWS machines like the ml.g5.12xlarge have 4 GPUs with 23GB of storage. I have not managed to enable distributed training in Anomalib. I have tried to use the strategy argument for the Engine or wrap the torch model inside lightning_model.py with torch.nn.DataParallel but I keep get errors because the library tries to access attributes like "learning_type" which the DataParallel does not have. Is there any easy way to train on all GPUs?

0 replies

samet-akcay · 2024-04-28T20:16:54Z

samet-akcay
Apr 28, 2024
Maintainer

@udarnicus, you could check this issue #1449 for multi-gpu training.

Regarding the patchcore memory issue, you could consider reducing the size of the memory bank, or changing the layer to extract features with smaller dimensions. For more details, you could refer to the args here

anomalib/src/anomalib/models/image/patchcore/lightning_model.py

Lines 31 to 35 in aee41f2

    
                   layers (list[str]): Layers to extract features from the backbone CNN 
        
                       Defaults to ``["layer2", "layer3"]``. 
        
                   pre_trained (bool, optional): Boolean to check whether to use a pre_trained backbone. 
        
                       Defaults to ``True``. 
        
                   coreset_sampling_ratio (float, optional): Coreset sampling ratio to subsample embedding.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: torch.cuda.OutOfMemoryError: CUDA out of memory - PatchCore #2033

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

[Bug]: torch.cuda.OutOfMemoryError: CUDA out of memory - PatchCore #2033

udarnicus Apr 23, 2024

Describe the bug

Dataset

Model

Steps to reproduce the behavior

OS information

Expected behavior

Screenshots

Pip/GitHub

What version/branch did you use?

Configuration YAML

Logs

Code of Conduct

Replies: 3 comments

blaz-r Apr 23, 2024

udarnicus Apr 24, 2024 Author

samet-akcay Apr 28, 2024 Maintainer

udarnicus
Apr 23, 2024

blaz-r
Apr 23, 2024

udarnicus
Apr 24, 2024
Author

samet-akcay
Apr 28, 2024
Maintainer