You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(PaddleSeg-develop-hesesu) D:\melanoma\PaddleSeg-develop>python tools/train.py --config configs/segformer/segformer_b5_cityscapes_1024x512_160k.yml --save_interval 200 --do_eval --use_vdl --save_dir output [2024/11/06 14:55:10] INFO: ------------Environment Information------------- platform: Windows-10-10.0.19045-SP0 Python: 3.8.20 (default, Oct 3 2024, 15:19:54) [MSC v.1929 64 bit (AMD64)] Paddle compiled with cuda: True NVCC: Build cuda_11.1.relgpu_drvr455TC455_06.29069683_0 cudnn: 8.1 GPUs used: 1 CUDA_VISIBLE_DEVICES: None GPU: ['GPU 0: RTX A4000'] PaddleSeg: 0.0.0.dev0 PaddlePaddle: 2.6.1 OpenCV: 4.5.5 ------------------------------------------------ [2024/11/06 14:55:10] INFO: ---------------Config Information--------------- batch_size: 1 iters: 160000 train_dataset: dataset_root: data/heisesu mode: train num_classes: 2 train_path: data/heisesu/train_list.txt transforms: - max_scale_factor: 2.0 min_scale_factor: 0.5 scale_step_size: 0.25 type: ResizeStepScaling - crop_size: - 1024 - 512 type: RandomPaddingCrop - type: RandomHorizontalFlip - brightness_range: 0.4 contrast_range: 0.4 saturation_range: 0.4 type: RandomDistort - type: Normalize type: Dataset val_dataset: dataset_root: data/heisesu mode: val num_classes: 2 transforms: - type: Normalize type: Dataset val_path: data/heisesu/test_list.txt optimizer: beta1: 0.9 beta2: 0.999 type: AdamW weight_decay: 0.01 lr_scheduler: end_lr: 0 learning_rate: 6.0e-05 power: 1 type: PolynomialDecay loss: coef: - 1 types: - type: CrossEntropyLoss model: backbone: pretrained: https://bj.bcebos.com/paddleseg/dygraph/backbone/mix_vision_transformer_b5.tar.gz type: MixVisionTransformer_B5 embedding_dim: 768 num_classes: 2 type: SegFormer ------------------------------------------------ [2024/11/06 14:55:10] INFO: Set device: gpu [2024/11/06 14:55:10] INFO: Use the following config to build model model: backbone: pretrained: https://bj.bcebos.com/paddleseg/dygraph/backbone/mix_vision_transformer_b5.tar.gz type: MixVisionTransformer_B5 embedding_dim: 768 num_classes: 2 type: SegFormer W1106 14:55:10.898733 38024 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.2, Runtime API Version: 11.2 W1106 14:55:10.899730 38024 gpu_resources.cc:164] device: 0, cuDNN Version: 8.1. [2024/11/06 14:55:12] INFO: Loading pretrained model from https://bj.bcebos.com/paddleseg/dygraph/backbone/mix_vision_transformer_b5.tar.gz [2024/11/06 14:55:12] INFO: There are 1052/1052 variables loaded into MixVisionTransformer. [2024/11/06 14:55:13] INFO: Use the following config to build train_dataset train_dataset: dataset_root: data/heisesu mode: train num_classes: 2 train_path: data/heisesu/train_list.txt transforms: - max_scale_factor: 2.0 min_scale_factor: 0.5 scale_step_size: 0.25 type: ResizeStepScaling - crop_size: - 1024 - 512 type: RandomPaddingCrop - type: RandomHorizontalFlip - brightness_range: 0.4 contrast_range: 0.4 saturation_range: 0.4 type: RandomDistort - type: Normalize type: Dataset [2024/11/06 14:55:13] INFO: Use the following config to build val_dataset val_dataset: dataset_root: data/heisesu mode: val num_classes: 2 transforms: - type: Normalize type: Dataset val_path: data/heisesu/test_list.txt [2024/11/06 14:55:13] INFO: Use the following config to build optimizer optimizer: beta1: 0.9 beta2: 0.999 type: AdamW weight_decay: 0.01 [2024/11/06 14:55:13] INFO: Use the following config to build loss loss: coef: - 1 types: - type: CrossEntropyLoss W1106 14:55:13.307674 38024 gpu_resources.cc:299] WARNING: device: . The installed Paddle is compiled with CUDNN 8.2, but CUDNN version in your machine is 8.1, which may cause serious incompatible bug. Please recompile or reinstall Pad dle with compatible CUDNN version. C:\ProgramData\Anaconda3\envs\PaddleSeg-develop-hesesu\lib\site-packages\paddle\nn\layer\norm.py:824: UserWarning: When training, we now always track global mean and variance. warnings.warn( [2024/11/06 14:55:20] INFO: [TRAIN] epoch: 1, iter: 10/160000, loss: 0.5862, lr: 0.000060, batch_cost: 0.7082, reader_cost: 0.02049, ips: 1.4121 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 31:28:20 [2024/11/06 14:55:23] INFO: [TRAIN] epoch: 1, iter: 20/160000, loss: 0.6621, lr: 0.000060, batch_cost: 0.3775, reader_cost: 0.00010, ips: 2.6488 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 16:46:37 [2024/11/06 14:55:27] INFO: [TRAIN] epoch: 1, iter: 30/160000, loss: 0.5721, lr: 0.000060, batch_cost: 0.3955, reader_cost: 0.00000, ips: 2.5283 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 17:34:31 [2024/11/06 14:55:31] INFO: [TRAIN] epoch: 1, iter: 40/160000, loss: 0.4767, lr: 0.000060, batch_cost: 0.3954, reader_cost: 0.00010, ips: 2.5293 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 17:34:01 [2024/11/06 14:55:36] INFO: [TRAIN] epoch: 1, iter: 50/160000, loss: 0.5364, lr: 0.000060, batch_cost: 0.4156, reader_cost: 0.00020, ips: 2.4060 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 18:27:59 [2024/11/06 14:55:40] INFO: [TRAIN] epoch: 1, iter: 60/160000, loss: 0.5858, lr: 0.000060, batch_cost: 0.4184, reader_cost: 0.00010, ips: 2.3901 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 18:35:18 [2024/11/06 14:55:44] INFO: [TRAIN] epoch: 1, iter: 70/160000, loss: 0.3769, lr: 0.000060, batch_cost: 0.4556, reader_cost: 0.00020, ips: 2.1951 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 20:14:18 [2024/11/06 14:55:49] INFO: [TRAIN] epoch: 1, iter: 80/160000, loss: 0.3629, lr: 0.000060, batch_cost: 0.4400, reader_cost: 0.00010, ips: 2.2725 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:32:50 [2024/11/06 14:55:53] INFO: [TRAIN] epoch: 1, iter: 90/160000, loss: 0.2942, lr: 0.000060, batch_cost: 0.4412, reader_cost: 0.00020, ips: 2.2667 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:35:47 [2024/11/06 14:55:58] INFO: [TRAIN] epoch: 1, iter: 100/160000, loss: 0.3662, lr: 0.000060, batch_cost: 0.5093, reader_cost: 0.00020, ips: 1.9635 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 22:37:16 [2024/11/06 14:56:03] INFO: [TRAIN] epoch: 1, iter: 110/160000, loss: 0.6134, lr: 0.000060, batch_cost: 0.4707, reader_cost: 0.00000, ips: 2.1246 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 20:54:17 [2024/11/06 14:56:08] INFO: [TRAIN] epoch: 1, iter: 120/160000, loss: 0.3395, lr: 0.000060, batch_cost: 0.4629, reader_cost: 0.00010, ips: 2.1602 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 20:33:30 [2024/11/06 14:56:12] INFO: [TRAIN] epoch: 1, iter: 130/160000, loss: 0.5634, lr: 0.000060, batch_cost: 0.4603, reader_cost: 0.00030, ips: 2.1725 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 20:26:27 [2024/11/06 14:56:16] INFO: [TRAIN] epoch: 1, iter: 140/160000, loss: 0.4350, lr: 0.000060, batch_cost: 0.4332, reader_cost: 0.00010, ips: 2.3084 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:14:12 [2024/11/06 14:56:21] INFO: [TRAIN] epoch: 1, iter: 150/160000, loss: 0.4544, lr: 0.000060, batch_cost: 0.4518, reader_cost: 0.00081, ips: 2.2135 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 20:03:34 [2024/11/06 14:56:25] INFO: [TRAIN] epoch: 1, iter: 160/160000, loss: 0.6249, lr: 0.000060, batch_cost: 0.4323, reader_cost: 0.00020, ips: 2.3130 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:11:44 [2024/11/06 14:56:30] INFO: [TRAIN] epoch: 1, iter: 170/160000, loss: 0.4180, lr: 0.000060, batch_cost: 0.4412, reader_cost: 0.00000, ips: 2.2665 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:35:17 [2024/11/06 14:56:34] INFO: [TRAIN] epoch: 1, iter: 180/160000, loss: 0.3575, lr: 0.000060, batch_cost: 0.4379, reader_cost: 0.00010, ips: 2.2837 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:26:23 [2024/11/06 14:56:39] INFO: [TRAIN] epoch: 1, iter: 190/160000, loss: 0.5617, lr: 0.000060, batch_cost: 0.4454, reader_cost: 0.00000, ips: 2.2450 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:46:25 [2024/11/06 14:56:43] INFO: [TRAIN] epoch: 1, iter: 200/160000, loss: 0.3030, lr: 0.000060, batch_cost: 0.4752, reader_cost: 0.00040, ips: 2.1044 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 21:05:37 [2024/11/06 14:56:43]
INFO: Start evaluating (total_samples: 600, total_iters: 600)...
报错从这里开始:
Traceback (most recent call last): File "tools/train.py", line 262, in main(args) File "tools/train.py", line 231, in main train(model, File "d:\melanoma\paddleseg-develop\paddleseg\core\train.py", line 342, in train mean_iou, acc, _, _, _ = evaluate(model, File "d:\melanoma\paddleseg-develop\paddleseg\core\val.py", line 158, in evaluate pred, logits = infer.inference( File "d:\melanoma\paddleseg-develop\paddleseg\core\infer.py", line 171, in inference logits = model(im) File "C:\ProgramData\Anaconda3\envs\PaddleSeg-develop-hesesu\lib\site-packages\paddle\nn\layer\layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "d:\melanoma\paddleseg-develop\paddleseg\models\segformer.py", line 83, in forward feats = self.backbone(x) File "C:\ProgramData\Anaconda3\envs\PaddleSeg-develop-hesesu\lib\site-packages\paddle\nn\layer\layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "d:\melanoma\paddleseg-develop\paddleseg\models\backbones\mix_transformer.py", line 466, in forward x = self.forward_features(x) File "d:\melanoma\paddleseg-develop\paddleseg\models\backbones\mix_transformer.py", line 433, in forward_features x = blk(x, H, W) File "C:\ProgramData\Anaconda3\envs\PaddleSeg-develop-hesesu\lib\site-packages\paddle\nn\layer\layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "d:\melanoma\paddleseg-develop\paddleseg\models\backbones\mix_transformer.py", line 201, in forward x = x + self.drop_path(self.attn(self.norm1(x), H, W)) File "C:\ProgramData\Anaconda3\envs\PaddleSeg-develop-hesesu\lib\site-packages\paddle\nn\layer\layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "d:\melanoma\paddleseg-develop\paddleseg\models\backbones\mix_transformer.py", line 140, in forward attn = (q @ k.transpose([0, 1, 3, 2])) * self.scale MemoryError: -------------------------------------- C++ Traceback (most recent call last): -------------------------------------- Not support stack backtrace yet. ---------------------- Error Message Summary: ---------------------- ResourceExhaustedError: Out of memory error on GPU 0. Cannot allocate 33.910150GB memory on GPU 0, 6.587891GB memory has been allocated and available memory is only 9.404297GB. Please check whether there is any other process using GPU 0. 1. If yes, please stop them, or start PaddlePaddle on another GPU. 2. If no, please decrease the batch size of your model. (at ..\paddle\fluid\memory\allocation\cuda_allocator.cc:86)
我都把batchsize设置为1了,还是报这个内存溢出的问题,电脑是a4000,专用gpu内存是16g,模型训练的过程中占用大概11g,到第200次模型评估的时候占用了6g左右,却显示内存溢出,这个错误要怎么解决,求求大佬们,帮忙解答一下吧。
复现环境 Environment
anyio
4.5.2
4.6.2
astor
0.8.1
0.8.1
babel
2.16.0
2.11.0
bce-python-sdk
0.9.23
blinker
1.8.2
1.6.2
ca-certificates
2024.9.24
2024.9.24
certifi
2024.8.30
2024.8.30
charset-normalizer
3.4.0
3.3.2
click
8.1.7
8.1.7
colorama
0.4.6
0.4.6
contourpy
1.1.1
1.2.0
cycler
0.12.1
0.11.0
decorator
5.1.1
5.1.1
exceptiongroup
1.2.2
1.2.0
filelock
3.16.1
3.13.1
flask
3.0.3
3.0.3
flask-babel
4.0.0
2.0.0
fonttools
4.54.1
4.51.0
future
1.0.0
0.18.3
h11
0.14.0
0.14.0
httpcore
1.0.6
1.0.2
httpx
0.27.2
0.27.0
idna
3.10
3.7
imageio
2.35.1
2.33.1
importlib-metadata
8.5.0
7.0.1
importlib-resources
6.4.5
6.4.0
itsdangerous
2.2.0
2.2.0
jinja2
3.1.4
3.1.4
joblib
1.4.2
1.4.2
kiwisolver
1.4.7
1.4.4
lazy-loader
0.4
libffi
3.4.4
3.4.4
markupsafe
2.1.5
2.1.3
matplotlib
3.7.5
3.9.2
networkx
3.1
3.3
numpy
1.24.4
2.1.3
opencv-python
4.5.5.64
openssl
3.0.15
3.0.15
opt-einsum
3.3.0
packaging
24.1
24.1
paddlepaddle-gpu
2.6.1.post112
paddleseg
0.0.0.dev0
pandas
2.0.3
2.2.2
pillow
10.4.0
10.4.0
pip
24.2
24.2
prettytable
3.11.0
3.5.0
protobuf
3.20.2
4.25.3
psutil
6.1.0
5.9.0
pycryptodome
3.21.0
3.20.0
pyparsing
3.1.4
3.1.2
python
3.8.20
3.13.0
python-dateutil
2.9.0.post0
2.9.0.post0
pytz
2024.2
2024.1
pywavelets
1.4.1
1.7.0
pyyaml
6.0.2
6.0.2
rarfile
4.2
requests
2.32.3
2.32.3
scikit-image
0.21.0
0.24.0
scikit-learn
1.3.2
1.5.1
scipy
1.10.1
1.13.1
setuptools
75.1.0
75.1.0
six
1.16.0
1.16.0
sniffio
1.3.1
1.3.0
sqlite
3.45.3
3.45.3
threadpoolctl
3.5.0
3.5.0
tifffile
2023.7.10
2023.4.12
tqdm
4.66.6
4.66.5
typing-extensions
4.12.2
4.11.0
tzdata
2024.2
2024b
urllib3
2.2.3
2.2.3
vc
14.40
14.40
visualdl
2.5.3
vs2015_runtime
14.40.33807
14.40.33807
wcwidth
0.2.13
0.2.5
werkzeug
3.0.6
3.0.3
wheel
0.44.0
0.44.0
zipp
3.20.2
3.20.2
Bug描述确认 Bug description confirmation
我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息,确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.
是否愿意提交PR? Are you willing to submit a PR?
我愿意提交PR!I'd like to help by submitting a PR!
The text was updated successfully, but these errors were encountered:
问题确认 Search before asking
Bug描述 Describe the Bug
(PaddleSeg-develop-hesesu) D:\melanoma\PaddleSeg-develop>python tools/train.py --config configs/segformer/segformer_b5_cityscapes_1024x512_160k.yml --save_interval 200 --do_eval --use_vdl --save_dir output [2024/11/06 14:55:10] INFO: ------------Environment Information------------- platform: Windows-10-10.0.19045-SP0 Python: 3.8.20 (default, Oct 3 2024, 15:19:54) [MSC v.1929 64 bit (AMD64)] Paddle compiled with cuda: True NVCC: Build cuda_11.1.relgpu_drvr455TC455_06.29069683_0 cudnn: 8.1 GPUs used: 1 CUDA_VISIBLE_DEVICES: None GPU: ['GPU 0: RTX A4000'] PaddleSeg: 0.0.0.dev0 PaddlePaddle: 2.6.1 OpenCV: 4.5.5 ------------------------------------------------ [2024/11/06 14:55:10] INFO: ---------------Config Information--------------- batch_size: 1 iters: 160000 train_dataset: dataset_root: data/heisesu mode: train num_classes: 2 train_path: data/heisesu/train_list.txt transforms: - max_scale_factor: 2.0 min_scale_factor: 0.5 scale_step_size: 0.25 type: ResizeStepScaling - crop_size: - 1024 - 512 type: RandomPaddingCrop - type: RandomHorizontalFlip - brightness_range: 0.4 contrast_range: 0.4 saturation_range: 0.4 type: RandomDistort - type: Normalize type: Dataset val_dataset: dataset_root: data/heisesu mode: val num_classes: 2 transforms: - type: Normalize type: Dataset val_path: data/heisesu/test_list.txt optimizer: beta1: 0.9 beta2: 0.999 type: AdamW weight_decay: 0.01 lr_scheduler: end_lr: 0 learning_rate: 6.0e-05 power: 1 type: PolynomialDecay loss: coef: - 1 types: - type: CrossEntropyLoss model: backbone: pretrained: https://bj.bcebos.com/paddleseg/dygraph/backbone/mix_vision_transformer_b5.tar.gz type: MixVisionTransformer_B5 embedding_dim: 768 num_classes: 2 type: SegFormer ------------------------------------------------ [2024/11/06 14:55:10] INFO: Set device: gpu [2024/11/06 14:55:10] INFO: Use the following config to build model model: backbone: pretrained: https://bj.bcebos.com/paddleseg/dygraph/backbone/mix_vision_transformer_b5.tar.gz type: MixVisionTransformer_B5 embedding_dim: 768 num_classes: 2 type: SegFormer W1106 14:55:10.898733 38024 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.2, Runtime API Version: 11.2 W1106 14:55:10.899730 38024 gpu_resources.cc:164] device: 0, cuDNN Version: 8.1. [2024/11/06 14:55:12] INFO: Loading pretrained model from https://bj.bcebos.com/paddleseg/dygraph/backbone/mix_vision_transformer_b5.tar.gz [2024/11/06 14:55:12] INFO: There are 1052/1052 variables loaded into MixVisionTransformer. [2024/11/06 14:55:13] INFO: Use the following config to build train_dataset train_dataset: dataset_root: data/heisesu mode: train num_classes: 2 train_path: data/heisesu/train_list.txt transforms: - max_scale_factor: 2.0 min_scale_factor: 0.5 scale_step_size: 0.25 type: ResizeStepScaling - crop_size: - 1024 - 512 type: RandomPaddingCrop - type: RandomHorizontalFlip - brightness_range: 0.4 contrast_range: 0.4 saturation_range: 0.4 type: RandomDistort - type: Normalize type: Dataset [2024/11/06 14:55:13] INFO: Use the following config to build val_dataset val_dataset: dataset_root: data/heisesu mode: val num_classes: 2 transforms: - type: Normalize type: Dataset val_path: data/heisesu/test_list.txt [2024/11/06 14:55:13] INFO: Use the following config to build optimizer optimizer: beta1: 0.9 beta2: 0.999 type: AdamW weight_decay: 0.01 [2024/11/06 14:55:13] INFO: Use the following config to build loss loss: coef: - 1 types: - type: CrossEntropyLoss W1106 14:55:13.307674 38024 gpu_resources.cc:299] WARNING: device: . The installed Paddle is compiled with CUDNN 8.2, but CUDNN version in your machine is 8.1, which may cause serious incompatible bug. Please recompile or reinstall Pad dle with compatible CUDNN version. C:\ProgramData\Anaconda3\envs\PaddleSeg-develop-hesesu\lib\site-packages\paddle\nn\layer\norm.py:824: UserWarning: When training, we now always track global mean and variance. warnings.warn( [2024/11/06 14:55:20] INFO: [TRAIN] epoch: 1, iter: 10/160000, loss: 0.5862, lr: 0.000060, batch_cost: 0.7082, reader_cost: 0.02049, ips: 1.4121 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 31:28:20 [2024/11/06 14:55:23] INFO: [TRAIN] epoch: 1, iter: 20/160000, loss: 0.6621, lr: 0.000060, batch_cost: 0.3775, reader_cost: 0.00010, ips: 2.6488 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 16:46:37 [2024/11/06 14:55:27] INFO: [TRAIN] epoch: 1, iter: 30/160000, loss: 0.5721, lr: 0.000060, batch_cost: 0.3955, reader_cost: 0.00000, ips: 2.5283 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 17:34:31 [2024/11/06 14:55:31] INFO: [TRAIN] epoch: 1, iter: 40/160000, loss: 0.4767, lr: 0.000060, batch_cost: 0.3954, reader_cost: 0.00010, ips: 2.5293 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 17:34:01 [2024/11/06 14:55:36] INFO: [TRAIN] epoch: 1, iter: 50/160000, loss: 0.5364, lr: 0.000060, batch_cost: 0.4156, reader_cost: 0.00020, ips: 2.4060 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 18:27:59 [2024/11/06 14:55:40] INFO: [TRAIN] epoch: 1, iter: 60/160000, loss: 0.5858, lr: 0.000060, batch_cost: 0.4184, reader_cost: 0.00010, ips: 2.3901 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 18:35:18 [2024/11/06 14:55:44] INFO: [TRAIN] epoch: 1, iter: 70/160000, loss: 0.3769, lr: 0.000060, batch_cost: 0.4556, reader_cost: 0.00020, ips: 2.1951 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 20:14:18 [2024/11/06 14:55:49] INFO: [TRAIN] epoch: 1, iter: 80/160000, loss: 0.3629, lr: 0.000060, batch_cost: 0.4400, reader_cost: 0.00010, ips: 2.2725 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:32:50 [2024/11/06 14:55:53] INFO: [TRAIN] epoch: 1, iter: 90/160000, loss: 0.2942, lr: 0.000060, batch_cost: 0.4412, reader_cost: 0.00020, ips: 2.2667 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:35:47 [2024/11/06 14:55:58] INFO: [TRAIN] epoch: 1, iter: 100/160000, loss: 0.3662, lr: 0.000060, batch_cost: 0.5093, reader_cost: 0.00020, ips: 1.9635 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 22:37:16 [2024/11/06 14:56:03] INFO: [TRAIN] epoch: 1, iter: 110/160000, loss: 0.6134, lr: 0.000060, batch_cost: 0.4707, reader_cost: 0.00000, ips: 2.1246 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 20:54:17 [2024/11/06 14:56:08] INFO: [TRAIN] epoch: 1, iter: 120/160000, loss: 0.3395, lr: 0.000060, batch_cost: 0.4629, reader_cost: 0.00010, ips: 2.1602 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 20:33:30 [2024/11/06 14:56:12] INFO: [TRAIN] epoch: 1, iter: 130/160000, loss: 0.5634, lr: 0.000060, batch_cost: 0.4603, reader_cost: 0.00030, ips: 2.1725 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 20:26:27 [2024/11/06 14:56:16] INFO: [TRAIN] epoch: 1, iter: 140/160000, loss: 0.4350, lr: 0.000060, batch_cost: 0.4332, reader_cost: 0.00010, ips: 2.3084 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:14:12 [2024/11/06 14:56:21] INFO: [TRAIN] epoch: 1, iter: 150/160000, loss: 0.4544, lr: 0.000060, batch_cost: 0.4518, reader_cost: 0.00081, ips: 2.2135 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 20:03:34 [2024/11/06 14:56:25] INFO: [TRAIN] epoch: 1, iter: 160/160000, loss: 0.6249, lr: 0.000060, batch_cost: 0.4323, reader_cost: 0.00020, ips: 2.3130 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:11:44 [2024/11/06 14:56:30] INFO: [TRAIN] epoch: 1, iter: 170/160000, loss: 0.4180, lr: 0.000060, batch_cost: 0.4412, reader_cost: 0.00000, ips: 2.2665 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:35:17 [2024/11/06 14:56:34] INFO: [TRAIN] epoch: 1, iter: 180/160000, loss: 0.3575, lr: 0.000060, batch_cost: 0.4379, reader_cost: 0.00010, ips: 2.2837 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:26:23 [2024/11/06 14:56:39] INFO: [TRAIN] epoch: 1, iter: 190/160000, loss: 0.5617, lr: 0.000060, batch_cost: 0.4454, reader_cost: 0.00000, ips: 2.2450 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:46:25 [2024/11/06 14:56:43] INFO: [TRAIN] epoch: 1, iter: 200/160000, loss: 0.3030, lr: 0.000060, batch_cost: 0.4752, reader_cost: 0.00040, ips: 2.1044 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 21:05:37 [2024/11/06 14:56:43]
INFO: Start evaluating (total_samples: 600, total_iters: 600)...
报错从这里开始:
Traceback (most recent call last): File "tools/train.py", line 262, in main(args) File "tools/train.py", line 231, in main train(model, File "d:\melanoma\paddleseg-develop\paddleseg\core\train.py", line 342, in train mean_iou, acc, _, _, _ = evaluate(model, File "d:\melanoma\paddleseg-develop\paddleseg\core\val.py", line 158, in evaluate pred, logits = infer.inference( File "d:\melanoma\paddleseg-develop\paddleseg\core\infer.py", line 171, in inference logits = model(im) File "C:\ProgramData\Anaconda3\envs\PaddleSeg-develop-hesesu\lib\site-packages\paddle\nn\layer\layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "d:\melanoma\paddleseg-develop\paddleseg\models\segformer.py", line 83, in forward feats = self.backbone(x) File "C:\ProgramData\Anaconda3\envs\PaddleSeg-develop-hesesu\lib\site-packages\paddle\nn\layer\layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "d:\melanoma\paddleseg-develop\paddleseg\models\backbones\mix_transformer.py", line 466, in forward x = self.forward_features(x) File "d:\melanoma\paddleseg-develop\paddleseg\models\backbones\mix_transformer.py", line 433, in forward_features x = blk(x, H, W) File "C:\ProgramData\Anaconda3\envs\PaddleSeg-develop-hesesu\lib\site-packages\paddle\nn\layer\layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "d:\melanoma\paddleseg-develop\paddleseg\models\backbones\mix_transformer.py", line 201, in forward x = x + self.drop_path(self.attn(self.norm1(x), H, W)) File "C:\ProgramData\Anaconda3\envs\PaddleSeg-develop-hesesu\lib\site-packages\paddle\nn\layer\layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "d:\melanoma\paddleseg-develop\paddleseg\models\backbones\mix_transformer.py", line 140, in forward attn = (q @ k.transpose([0, 1, 3, 2])) * self.scale MemoryError: -------------------------------------- C++ Traceback (most recent call last): -------------------------------------- Not support stack backtrace yet. ---------------------- Error Message Summary: ---------------------- ResourceExhaustedError: Out of memory error on GPU 0. Cannot allocate 33.910150GB memory on GPU 0, 6.587891GB memory has been allocated and available memory is only 9.404297GB. Please check whether there is any other process using GPU 0. 1. If yes, please stop them, or start PaddlePaddle on another GPU. 2. If no, please decrease the batch size of your model. (at ..\paddle\fluid\memory\allocation\cuda_allocator.cc:86)
我都把batchsize设置为1了,还是报这个内存溢出的问题,电脑是a4000,专用gpu内存是16g,模型训练的过程中占用大概11g,到第200次模型评估的时候占用了6g左右,却显示内存溢出,这个错误要怎么解决,求求大佬们,帮忙解答一下吧。
复现环境 Environment
Bug描述确认 Bug description confirmation
是否愿意提交PR? Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: