How to use GPU for training #2404

bommbomm · 2024-10-31T08:14:03Z

bommbomm
Oct 31, 2024

Describe the task

Can I use GPU for training? I looked at the code and it doesn't seem to have this interface

Acceptance Criteria

datamodule.setup()
model = Patchcore()
engine = Engine(task="classification")
engine.train(datamodule=datamodule, model=model)

Priority

High

Related Epic

No response

Estimated Time

No response

Current Status

Not Started

Additional Information

No response

Answered by bommbomm

Nov 4, 2024

Hello, thank you for your reply!
I know where the problem lies. Initially, I used a computer with only a CPU for installation, that is, pip install anomalib. After successfully running, the environment was copied to a computer with a GPU, only to find that the GPU could not be used for training. Reinstalling has now resolved the issue.

View full answer

blaz-r · 2024-10-31T11:25:04Z

blaz-r
Oct 31, 2024

Hi. The Engine is based on Lightning trainer, so you can pass any Trainer argument to the Engine. Keep in mind that currently multi-GPU training is not supported, but you can use a single GPU like you would using Trainer.

0 replies

vmiller987 · 2024-10-31T12:46:48Z

vmiller987
Oct 31, 2024

Then you can't pass any Trainer argument to the Engine...

devices=[0,1,2,3,4,5,6,7]

I've honestly found anomalib very difficult to use.

0 replies

samet-akcay · 2024-10-31T18:34:38Z

samet-akcay
Oct 31, 2024
Maintainer

@vmiller987 can you check you installed the torch with cuda option. GPU in Anomalib training should be automatically picked up if you have the correct torch.

With that being said, please note that multi-GPU is currently not supported, but we are working on it to enable it in v2.
#2258

I would love to get your feedback regarding which parts of anomalib you find it difficult to work with

0 replies

vmiller987 · 2024-11-01T11:51:40Z

vmiller987
Nov 1, 2024

@samet-akcay I have the correct torch. I can get it to run on one GPU. I have to use an environment variable in order to assign Anomalib to a specific GPU. I can't pass the Engine devices=[3] which is the Trainer way to assign devices. This will still default to GPU0.

I am a novice when it comes to unsupervised learning. I am trying to learn as I mainly have experience with supervised learning. Anomalib doesn't have a good place that explains it's models and how they should be used. The notebooks mainly revolve around Padim it seems. I am looking through the core papers/repo's for the other models to try and understand them.

Not all of the models are easily passed to the engine. For example, if I try to use model = Ganomaly() or model = Uflow(), both of these will return an attribute error Uflow object has no attribute model. These must be adapted or used in a different way which I haven't quite figured out.

0 replies

vmiller987 · 2024-11-01T18:23:51Z

vmiller987
Nov 1, 2024

Not all of the models are easily passed to the engine. For example, if I try to use model = Ganomaly() or model = Uflow(), both of these will return an attribute error Uflow object has no attribute model. These must be adapted or used in a different way which I haven't quite figured out.

I'm going to retract this part. I was doing something very silly and fixed it. I'm able to get quite a few of them to run including Ganomaly.

0 replies

samet-akcay · 2024-11-02T15:46:25Z

samet-akcay
Nov 2, 2024
Maintainer

I can't pass the Engine devices=[3] which is the Trainer way to assign devices. This will still default to GPU0.

This is a known issue, I've created a PR for this, which has not been merged yet.
#2256

We are also working on a better solution, where you will be able to choose the device ID or train multi GPU
#2258

0 replies

bommbomm · 2024-11-04T02:03:39Z

bommbomm
Nov 4, 2024
Author

您能否检查一下您安装了带有 CUDA 选项的手电筒。如果您有正确的手电筒，则应自动拾取 Anomalib 训练中的 GPU。

话虽如此，请注意，目前不支持多 GPU，但我们正在努力在 v2 中启用它。排名 #2258

我很想得到您的反馈，了解您觉得 anomalib 的哪些部分难以使用

1、 Because I wanted to use GPU, I changed it to the following code snippet:
import multiprocessing
from anomalib.data import Folder
from anomalib.models import Patchcore
from anomalib.engine import Engine

datamodule.setup()
model = Patchcore()
engine = Engine(task="classification",accelerator="gpu", devices="1")
engine.train(datamodule=datamodule, model=model)

2、 Then reinstalled CUDA and TORCH that support GPU training:
PyTorch version: 1.8.0+cu111
CUDA version: 11.1
cuDNN version: 8005
CUDA available: True
Number of GPUs available: 1
Device 0: NVIDIA GeForce RTX 3080 Ti

3、 Finally appeared
Traceback (most recent call last):
File "E:\WHGWD\anomalib-main\train.py", line 2, in
from anomalib.data import Folder
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\anomalib\data_init_.py", line 13, in
from .avenue import Avenue
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\anomalib\data\avenue.py", line 30, in
from anomalib.data.base import AnomalibVideoDataModule, AnomalibVideoDataset
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\anomalib\data\base_init_.py", line 7, in
from .datamodule import AnomalibDataModule
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\anomalib\data\base\datamodule.py", line 13, in
from pytorch_lightning import LightningDataModule
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\pytorch_lightning_init_.py", line 34, in
from lightning_fabric.utilities.seed import seed_everything # noqa: E402
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric_init_.py", line 23, in
from lightning_fabric.fabric import Fabric # noqa: E402
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\fabric.py", line 32, in
from lightning_fabric.plugins import Precision # avoid circular imports: # isort: split
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins_init_.py", line 18, in
from lightning_fabric.plugins.precision.deepspeed import DeepSpeedPrecision
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins\precision_init_.py", line 16, in
from lightning_fabric.plugins.precision.fsdp import FSDPPrecision
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins\precision\fsdp.py", line 19, in
from lightning_fabric.plugins.precision.native_amp import MixedPrecision
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins\precision\native_amp.py", line 29, in
class MixedPrecision(Precision):
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins\precision\native_amp.py", line 90, in MixedPrecision
def _autocast_context_manager(self) -> torch.autocast:
AttributeError: module 'torch' has no attribute 'autocast'

4、 I would like to ask, what would the code look like if I use GPU training correctly? And then how much is required for the torch version?

0 replies

bommbomm · 2024-11-04T02:04:39Z

bommbomm
Nov 4, 2024
Author

你好。Engine 基于 Lightning trainer，因此您可以将任何 Trainer 参数传递给 Engine。请记住，目前不支持多 GPU 训练，但您可以像使用 Trainer 一样使用单个 GPU。

Hello! Thank you for your reply!

1、 Because I wanted to use GPU, I changed it to the following code snippet:
import multiprocessing
from anomalib.data import Folder
from anomalib.models import Patchcore
from anomalib.engine import Engine

datamodule.setup()
model = Patchcore()
engine = Engine(task="classification",accelerator="gpu", devices="1")
engine.train(datamodule=datamodule, model=model)

2、 Then reinstalled CUDA and TORCH that support GPU training:
PyTorch version: 1.8.0+cu111
CUDA version: 11.1
cuDNN version: 8005
CUDA available: True
Number of GPUs available: 1
Device 0: NVIDIA GeForce RTX 3080 Ti

3、 Finally appeared
Traceback (most recent call last):
File "E:\WHGWD\anomalib-main\train.py", line 2, in
from anomalib.data import Folder
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\anomalib\data_init_.py", line 13, in
from .avenue import Avenue
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\anomalib\data\avenue.py", line 30, in
from anomalib.data.base import AnomalibVideoDataModule, AnomalibVideoDataset
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\anomalib\data\base_init_.py", line 7, in
from .datamodule import AnomalibDataModule
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\anomalib\data\base\datamodule.py", line 13, in
from pytorch_lightning import LightningDataModule
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\pytorch_lightning_init_.py", line 34, in
from lightning_fabric.utilities.seed import seed_everything # noqa: E402
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric_init_.py", line 23, in
from lightning_fabric.fabric import Fabric # noqa: E402
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\fabric.py", line 32, in
from lightning_fabric.plugins import Precision # avoid circular imports: # isort: split
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins_init_.py", line 18, in
from lightning_fabric.plugins.precision.deepspeed import DeepSpeedPrecision
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins\precision_init_.py", line 16, in
from lightning_fabric.plugins.precision.fsdp import FSDPPrecision
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins\precision\fsdp.py", line 19, in
from lightning_fabric.plugins.precision.native_amp import MixedPrecision
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins\precision\native_amp.py", line 29, in
class MixedPrecision(Precision):
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins\precision\native_amp.py", line 90, in MixedPrecision
def _autocast_context_manager(self) -> torch.autocast:
AttributeError: module 'torch' has no attribute 'autocast'

4、 I would like to ask, what would the code look like if I use GPU training correctly? And then how much is required for the torch version?

0 replies

bommbomm · 2024-11-04T02:07:28Z

bommbomm
Nov 4, 2024
Author

然后，你不能将任何 Trainer 参数传递给 Engine...

设备=[0,1,2,3,4,5,6,7]

老实说，我发现 anomalib 非常难用。

Hello, thank you for your reply!

I found that I don't know how to use the GPU, and the downloaded torch version seems to have problems as well.

But if I download the default torch version, it only supports CPU training, so I downloaded other torch versions that support CUDA, but in the end, it still doesn't work. How did you solve this problem?

0 replies

samet-akcay · 2024-11-04T06:04:51Z

samet-akcay
Nov 4, 2024
Maintainer

How do you install torch? pip install torch? On a fresh environment can you install all the required anomalib dependencies via anomalib install, which handles the which torch to install (CPU or GPU). It also handles to install the right version of the GPU based on the CUDA on your system.

Regarding the torch version, anomalib requires the following torch requirement. Your torch version could also be one of the issues:

anomalib/pyproject.toml

Line 52 in 6ed0067

"torch>=2",

1 reply

bommbomm Nov 4, 2024
Author

Hello, thank you for your reply!
I know where the problem lies. Initially, I used a computer with only a CPU for installation, that is, pip install anomalib. After successfully running, the environment was copied to a computer with a GPU, only to find that the GPU could not be used for training. Reinstalling has now resolved the issue.

Answer selected by bommbomm

samet-akcay · 2024-11-04T06:11:31Z

samet-akcay
Nov 4, 2024
Maintainer

You don't need to specify the GPU as accelerator, as it is automatically in auto mode that picks GPU if you have it installed.

For example here is the setup I tried.

Available GPU

❯ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:17:00.0 Off |                  N/A |
| 31%   38C    P8              24W / 350W |   3062MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off | 00000000:65:00.0 Off |                  N/A |
| 30%   41C    P8              18W / 350W |    283MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Code

# Import the required modules
from anomalib.data import MVTec
from anomalib.engine import Engine
from anomalib.models import Patchcore

# Initialize the datamodule, model and engine
datamodule = MVTec()
model = Patchcore()
engine = Engine()

# Train the model
engine.fit(datamodule=datamodule, model=model)

Output

FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)

>>> # Look at here to see if you have GPU installed, and are using it
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
<<<

You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
/home/sakcay/.pyenv/versions/3.11.8/envs/anomalib/lib/python3.11/site-packages/lightning/pytorch/core/optimizer.py:181: `LightningModule.configure_optimizers` returned `None`, this fit will run with no optimizer

  | Name           | Type                     | Params
------------------------------------------------------------
0 | pre_processor  | PreProcessor             | 0     
1 | post_processor | OneClassPostProcessor    | 0     
2 | model          | PatchcoreModel           | 24.9 M
3 | image_metrics  | AnomalibMetricCollection | 0     
4 | pixel_metrics  | AnomalibMetricCollection | 0     
------------------------------------------------------------
24.9 M    Trainable params
0         Non-trainable params
24.9 M    Total params
99.450    Total estimated model params size (MB)
Epoch 0:   0%|                                                      | 0/7 [00:00<?, ?it/s]/home/sakcay/.pyenv/versions/3.11.8/envs/anomalib/lib/python3.11/site-packages/lightning/pytorch/loops/optimization/automatic.py:132: `training_step` returned `None`. If this was on purpose, ignore this warning...
Epoch 0: 100%|██████████████████████████████████████████████| 7/7 [00:01<00:00,  4.84it/s^Selecting Coreset Indices.:  16%|███                | 2685/16385 [00:03<00:17, 795.60it/s]

1 reply

bommbomm Nov 4, 2024
Author

Hello, thank you for your reply!
We have reinstalled the environment on a computer with GPU and successfully used GPU for training!

samet-akcay · 2024-11-04T06:12:23Z

samet-akcay
Nov 4, 2024
Maintainer

I'm moving this to the Q&A as I don't think this is a bug on Anomalib, but an installation issue on your end. Feel free to ask your questions there. Thanks

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use GPU for training #2404

{{title}}

Replies: 12 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to use GPU for training #2404

bommbomm Oct 31, 2024

Describe the task

Acceptance Criteria

Priority

Related Epic

Estimated Time

Current Status

Additional Information

Replies: 12 comments · 2 replies

blaz-r Oct 31, 2024

vmiller987 Oct 31, 2024

samet-akcay Oct 31, 2024 Maintainer

vmiller987 Nov 1, 2024

vmiller987 Nov 1, 2024

samet-akcay Nov 2, 2024 Maintainer

bommbomm Nov 4, 2024 Author

bommbomm Nov 4, 2024 Author

bommbomm Nov 4, 2024 Author

samet-akcay Nov 4, 2024 Maintainer

bommbomm Nov 4, 2024 Author

samet-akcay Nov 4, 2024 Maintainer

Available GPU

Code

Output

bommbomm Nov 4, 2024 Author

samet-akcay Nov 4, 2024 Maintainer

bommbomm
Oct 31, 2024

Replies: 12 comments 2 replies

blaz-r
Oct 31, 2024

vmiller987
Oct 31, 2024

samet-akcay
Oct 31, 2024
Maintainer

vmiller987
Nov 1, 2024

vmiller987
Nov 1, 2024

samet-akcay
Nov 2, 2024
Maintainer

bommbomm
Nov 4, 2024
Author

bommbomm
Nov 4, 2024
Author

bommbomm
Nov 4, 2024
Author

samet-akcay
Nov 4, 2024
Maintainer

bommbomm Nov 4, 2024
Author

samet-akcay
Nov 4, 2024
Maintainer

bommbomm Nov 4, 2024
Author

samet-akcay
Nov 4, 2024
Maintainer