-
Describe the bugHello, I am experimenting with multiple models for binary classification on a dataset with 330x330 images. I am using a Tesla T4 with 15GB of memory. For models like EfficientAD or Fastflow, the training runs without problems, however I can only get to 77% of an epoch before the out of memory error. It does not matter which batch size I use, the gpu memory used increases with each batch. What could the issue be? DatasetOther (please specify in the text field below) ModelPatchCore Steps to reproduce the behavior
OS informationOS information:
Expected behaviorRun the training epoch without error ScreenshotsNo response Pip/GitHubpip What version/branch did you use?No response Configuration YAMLdatamodule = Folder(
name="mvtec_ad",
normal_dir="",
abnormal_dir="",
normal_test_dir ="",
task="classification",
num_workers=0,
#image_size=(330,330),
train_batch_size=1,
eval_batch_size = 32,
)
#######
engine = Engine(
normalization="min_max", #none
threshold="F1AdaptiveThreshold",
task=TaskType.CLASSIFICATION,
image_metrics=["OptimalF1","F1AdaptiveThreshold","AUROC","BinaryPrecision","BinaryRecall","F1Score"],
accelerator="auto",
check_val_every_n_epoch=1,
devices="auto",
#num_nodes = 4,
max_epochs=30,
num_sanity_val_steps=0,
val_check_interval=1.0,
logger=[logger_tensorboard,logger_comet],
log_every_n_steps = 10,
callbacks = callbacks,
limit_test_batches = 0.5
)
#####
callbacks = [
ModelCheckpoint(
mode="max",
monitor="image_F1Score",
),
_VisualizationCallback(
visualizers = visualizers,
root="images",
save = False,
log = True,
show = False,
)
] LogsFile "/home/sagemaker-user/anomalib/run_training_singlerun.py", line 176, in <module>
engine.fit(datamodule=datamodule, model=model)
File "/home/sagemaker-user/anomalib/src/anomalib/engine/engine.py", line 518, in fit
self.trainer.fit(model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
results = self._run_stage()
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage
self.fit_loop.run()
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run
self.advance()
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run
self.advance(data_fetcher)
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 240, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 187, in run
self._optimizer_step(batch_idx, closure)
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 265, in _optimizer_step
call._call_lightning_module_hook(
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 1291, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 151, in step
step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 230, in optimizer_step
return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/precision.py", line 117, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/torch/optim/optimizer.py", line 373, in wrapper
out = func(*args, **kwargs)
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 406, in step
return closure()
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/precision.py", line 104, in _wrap_closure
closure_result = closure()
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 140, in __call__
self._result = self.closure(*args, **kwargs)
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 126, in closure
step_output = self._step_fn()
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 315, in _training_step
training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 382, in training_step
return self.lightning_module.training_step(*args, **kwargs)
File "/home/sagemaker-user/anomalib/src/anomalib/models/image/patchcore/lightning_model.py", line 82, in training_step
embedding = self.model(batch["image"])
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sagemaker-user/.conda/envs/anomalib_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sagemaker-user/anomalib/src/anomalib/models/image/patchcore/torch_model.py", line 80, in forward
embedding = self.generate_embedding(features)
File "/home/sagemaker-user/anomalib/src/anomalib/models/image/patchcore/torch_model.py", line 121, in generate_embedding
embeddings = torch.cat((embeddings, layer_embedding), 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 14.58 GiB of which 3.62 MiB is free. Process 27360 has 14.57 GiB memory in use. Of the allocated memory 13.99 GiB is allocated by PyTorch, and 468.78 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Code of Conduct
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
PatchCore is a very memory-hungry model so you'll maybe need to reduce the image size or the number of training samples. Also unrelated to this problem: PatchCore is not trained for multiple epochs, so set the epochs to 1 when using it. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the hint! I unfortunately have to keep the resolution and sample size the same, however I have access to bigger machines. The problem is that AWS machines like the ml.g5.12xlarge have 4 GPUs with 23GB of storage. I have not managed to enable distributed training in Anomalib. I have tried to use the strategy argument for the Engine or wrap the torch model inside lightning_model.py with torch.nn.DataParallel but I keep get errors because the library tries to access attributes like "learning_type" which the DataParallel does not have. Is there any easy way to train on all GPUs? |
Beta Was this translation helpful? Give feedback.
-
@udarnicus, you could check this issue #1449 for multi-gpu training. Regarding the patchcore memory issue, you could consider reducing the size of the memory bank, or changing the layer to extract features with smaller dimensions. For more details, you could refer to the args here anomalib/src/anomalib/models/image/patchcore/lightning_model.py Lines 31 to 35 in aee41f2 |
Beta Was this translation helpful? Give feedback.
PatchCore is a very memory-hungry model so you'll maybe need to reduce the image size or the number of training samples.
Also unrelated to this problem: PatchCore is not trained for multiple epochs, so set the epochs to 1 when using it.