Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ Traceback (most recent call last): Not support stack backtrace yet.内存溢出 #3841

Open
3 tasks done
2711035086 opened this issue Nov 6, 2024 · 4 comments
Open
3 tasks done
Assignees
Labels
bug Something isn't working

Comments

@2711035086
Copy link

问题确认 Search before asking

Bug描述 Describe the Bug

(PaddleSeg-develop-hesesu) D:\melanoma\PaddleSeg-develop>python tools/train.py --config configs/segformer/segformer_b5_cityscapes_1024x512_160k.yml --save_interval 200 --do_eval --use_vdl --save_dir output [2024/11/06 14:55:10] INFO: ------------Environment Information------------- platform: Windows-10-10.0.19045-SP0 Python: 3.8.20 (default, Oct 3 2024, 15:19:54) [MSC v.1929 64 bit (AMD64)] Paddle compiled with cuda: True NVCC: Build cuda_11.1.relgpu_drvr455TC455_06.29069683_0 cudnn: 8.1 GPUs used: 1 CUDA_VISIBLE_DEVICES: None GPU: ['GPU 0: RTX A4000'] PaddleSeg: 0.0.0.dev0 PaddlePaddle: 2.6.1 OpenCV: 4.5.5 ------------------------------------------------ [2024/11/06 14:55:10] INFO: ---------------Config Information--------------- batch_size: 1 iters: 160000 train_dataset: dataset_root: data/heisesu mode: train num_classes: 2 train_path: data/heisesu/train_list.txt transforms: - max_scale_factor: 2.0 min_scale_factor: 0.5 scale_step_size: 0.25 type: ResizeStepScaling - crop_size: - 1024 - 512 type: RandomPaddingCrop - type: RandomHorizontalFlip - brightness_range: 0.4 contrast_range: 0.4 saturation_range: 0.4 type: RandomDistort - type: Normalize type: Dataset val_dataset: dataset_root: data/heisesu mode: val num_classes: 2 transforms: - type: Normalize type: Dataset val_path: data/heisesu/test_list.txt optimizer: beta1: 0.9 beta2: 0.999 type: AdamW weight_decay: 0.01 lr_scheduler: end_lr: 0 learning_rate: 6.0e-05 power: 1 type: PolynomialDecay loss: coef: - 1 types: - type: CrossEntropyLoss model: backbone: pretrained: https://bj.bcebos.com/paddleseg/dygraph/backbone/mix_vision_transformer_b5.tar.gz type: MixVisionTransformer_B5 embedding_dim: 768 num_classes: 2 type: SegFormer ------------------------------------------------ [2024/11/06 14:55:10] INFO: Set device: gpu [2024/11/06 14:55:10] INFO: Use the following config to build model model: backbone: pretrained: https://bj.bcebos.com/paddleseg/dygraph/backbone/mix_vision_transformer_b5.tar.gz type: MixVisionTransformer_B5 embedding_dim: 768 num_classes: 2 type: SegFormer W1106 14:55:10.898733 38024 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.2, Runtime API Version: 11.2 W1106 14:55:10.899730 38024 gpu_resources.cc:164] device: 0, cuDNN Version: 8.1. [2024/11/06 14:55:12] INFO: Loading pretrained model from https://bj.bcebos.com/paddleseg/dygraph/backbone/mix_vision_transformer_b5.tar.gz [2024/11/06 14:55:12] INFO: There are 1052/1052 variables loaded into MixVisionTransformer. [2024/11/06 14:55:13] INFO: Use the following config to build train_dataset train_dataset: dataset_root: data/heisesu mode: train num_classes: 2 train_path: data/heisesu/train_list.txt transforms: - max_scale_factor: 2.0 min_scale_factor: 0.5 scale_step_size: 0.25 type: ResizeStepScaling - crop_size: - 1024 - 512 type: RandomPaddingCrop - type: RandomHorizontalFlip - brightness_range: 0.4 contrast_range: 0.4 saturation_range: 0.4 type: RandomDistort - type: Normalize type: Dataset [2024/11/06 14:55:13] INFO: Use the following config to build val_dataset val_dataset: dataset_root: data/heisesu mode: val num_classes: 2 transforms: - type: Normalize type: Dataset val_path: data/heisesu/test_list.txt [2024/11/06 14:55:13] INFO: Use the following config to build optimizer optimizer: beta1: 0.9 beta2: 0.999 type: AdamW weight_decay: 0.01 [2024/11/06 14:55:13] INFO: Use the following config to build loss loss: coef: - 1 types: - type: CrossEntropyLoss W1106 14:55:13.307674 38024 gpu_resources.cc:299] WARNING: device: . The installed Paddle is compiled with CUDNN 8.2, but CUDNN version in your machine is 8.1, which may cause serious incompatible bug. Please recompile or reinstall Pad dle with compatible CUDNN version. C:\ProgramData\Anaconda3\envs\PaddleSeg-develop-hesesu\lib\site-packages\paddle\nn\layer\norm.py:824: UserWarning: When training, we now always track global mean and variance. warnings.warn( [2024/11/06 14:55:20] INFO: [TRAIN] epoch: 1, iter: 10/160000, loss: 0.5862, lr: 0.000060, batch_cost: 0.7082, reader_cost: 0.02049, ips: 1.4121 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 31:28:20 [2024/11/06 14:55:23] INFO: [TRAIN] epoch: 1, iter: 20/160000, loss: 0.6621, lr: 0.000060, batch_cost: 0.3775, reader_cost: 0.00010, ips: 2.6488 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 16:46:37 [2024/11/06 14:55:27] INFO: [TRAIN] epoch: 1, iter: 30/160000, loss: 0.5721, lr: 0.000060, batch_cost: 0.3955, reader_cost: 0.00000, ips: 2.5283 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 17:34:31 [2024/11/06 14:55:31] INFO: [TRAIN] epoch: 1, iter: 40/160000, loss: 0.4767, lr: 0.000060, batch_cost: 0.3954, reader_cost: 0.00010, ips: 2.5293 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 17:34:01 [2024/11/06 14:55:36] INFO: [TRAIN] epoch: 1, iter: 50/160000, loss: 0.5364, lr: 0.000060, batch_cost: 0.4156, reader_cost: 0.00020, ips: 2.4060 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 18:27:59 [2024/11/06 14:55:40] INFO: [TRAIN] epoch: 1, iter: 60/160000, loss: 0.5858, lr: 0.000060, batch_cost: 0.4184, reader_cost: 0.00010, ips: 2.3901 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 18:35:18 [2024/11/06 14:55:44] INFO: [TRAIN] epoch: 1, iter: 70/160000, loss: 0.3769, lr: 0.000060, batch_cost: 0.4556, reader_cost: 0.00020, ips: 2.1951 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 20:14:18 [2024/11/06 14:55:49] INFO: [TRAIN] epoch: 1, iter: 80/160000, loss: 0.3629, lr: 0.000060, batch_cost: 0.4400, reader_cost: 0.00010, ips: 2.2725 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:32:50 [2024/11/06 14:55:53] INFO: [TRAIN] epoch: 1, iter: 90/160000, loss: 0.2942, lr: 0.000060, batch_cost: 0.4412, reader_cost: 0.00020, ips: 2.2667 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:35:47 [2024/11/06 14:55:58] INFO: [TRAIN] epoch: 1, iter: 100/160000, loss: 0.3662, lr: 0.000060, batch_cost: 0.5093, reader_cost: 0.00020, ips: 1.9635 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 22:37:16 [2024/11/06 14:56:03] INFO: [TRAIN] epoch: 1, iter: 110/160000, loss: 0.6134, lr: 0.000060, batch_cost: 0.4707, reader_cost: 0.00000, ips: 2.1246 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 20:54:17 [2024/11/06 14:56:08] INFO: [TRAIN] epoch: 1, iter: 120/160000, loss: 0.3395, lr: 0.000060, batch_cost: 0.4629, reader_cost: 0.00010, ips: 2.1602 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 20:33:30 [2024/11/06 14:56:12] INFO: [TRAIN] epoch: 1, iter: 130/160000, loss: 0.5634, lr: 0.000060, batch_cost: 0.4603, reader_cost: 0.00030, ips: 2.1725 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 20:26:27 [2024/11/06 14:56:16] INFO: [TRAIN] epoch: 1, iter: 140/160000, loss: 0.4350, lr: 0.000060, batch_cost: 0.4332, reader_cost: 0.00010, ips: 2.3084 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:14:12 [2024/11/06 14:56:21] INFO: [TRAIN] epoch: 1, iter: 150/160000, loss: 0.4544, lr: 0.000060, batch_cost: 0.4518, reader_cost: 0.00081, ips: 2.2135 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 20:03:34 [2024/11/06 14:56:25] INFO: [TRAIN] epoch: 1, iter: 160/160000, loss: 0.6249, lr: 0.000060, batch_cost: 0.4323, reader_cost: 0.00020, ips: 2.3130 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:11:44 [2024/11/06 14:56:30] INFO: [TRAIN] epoch: 1, iter: 170/160000, loss: 0.4180, lr: 0.000060, batch_cost: 0.4412, reader_cost: 0.00000, ips: 2.2665 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:35:17 [2024/11/06 14:56:34] INFO: [TRAIN] epoch: 1, iter: 180/160000, loss: 0.3575, lr: 0.000060, batch_cost: 0.4379, reader_cost: 0.00010, ips: 2.2837 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:26:23 [2024/11/06 14:56:39] INFO: [TRAIN] epoch: 1, iter: 190/160000, loss: 0.5617, lr: 0.000060, batch_cost: 0.4454, reader_cost: 0.00000, ips: 2.2450 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 19:46:25 [2024/11/06 14:56:43] INFO: [TRAIN] epoch: 1, iter: 200/160000, loss: 0.3030, lr: 0.000060, batch_cost: 0.4752, reader_cost: 0.00040, ips: 2.1044 samples/sec, max_mem_reserved: 7862 MB, max_mem_allocated: 7353 MB | ETA 21:05:37 [2024/11/06 14:56:43]
INFO: Start evaluating (total_samples: 600, total_iters: 600)...
报错从这里开始:
Traceback (most recent call last): File "tools/train.py", line 262, in main(args) File "tools/train.py", line 231, in main train(model, File "d:\melanoma\paddleseg-develop\paddleseg\core\train.py", line 342, in train mean_iou, acc, _, _, _ = evaluate(model, File "d:\melanoma\paddleseg-develop\paddleseg\core\val.py", line 158, in evaluate pred, logits = infer.inference( File "d:\melanoma\paddleseg-develop\paddleseg\core\infer.py", line 171, in inference logits = model(im) File "C:\ProgramData\Anaconda3\envs\PaddleSeg-develop-hesesu\lib\site-packages\paddle\nn\layer\layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "d:\melanoma\paddleseg-develop\paddleseg\models\segformer.py", line 83, in forward feats = self.backbone(x) File "C:\ProgramData\Anaconda3\envs\PaddleSeg-develop-hesesu\lib\site-packages\paddle\nn\layer\layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "d:\melanoma\paddleseg-develop\paddleseg\models\backbones\mix_transformer.py", line 466, in forward x = self.forward_features(x) File "d:\melanoma\paddleseg-develop\paddleseg\models\backbones\mix_transformer.py", line 433, in forward_features x = blk(x, H, W) File "C:\ProgramData\Anaconda3\envs\PaddleSeg-develop-hesesu\lib\site-packages\paddle\nn\layer\layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "d:\melanoma\paddleseg-develop\paddleseg\models\backbones\mix_transformer.py", line 201, in forward x = x + self.drop_path(self.attn(self.norm1(x), H, W)) File "C:\ProgramData\Anaconda3\envs\PaddleSeg-develop-hesesu\lib\site-packages\paddle\nn\layer\layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "d:\melanoma\paddleseg-develop\paddleseg\models\backbones\mix_transformer.py", line 140, in forward attn = (q @ k.transpose([0, 1, 3, 2])) * self.scale MemoryError: -------------------------------------- C++ Traceback (most recent call last): -------------------------------------- Not support stack backtrace yet. ---------------------- Error Message Summary: ---------------------- ResourceExhaustedError: Out of memory error on GPU 0. Cannot allocate 33.910150GB memory on GPU 0, 6.587891GB memory has been allocated and available memory is only 9.404297GB. Please check whether there is any other process using GPU 0. 1. If yes, please stop them, or start PaddlePaddle on another GPU. 2. If no, please decrease the batch size of your model. (at ..\paddle\fluid\memory\allocation\cuda_allocator.cc:86)
我都把batchsize设置为1了,还是报这个内存溢出的问题,电脑是a4000,专用gpu内存是16g,模型训练的过程中占用大概11g,到第200次模型评估的时候占用了6g左右,却显示内存溢出,这个错误要怎么解决,求求大佬们,帮忙解答一下吧。

复现环境 Environment

anyio 4.5.2 4.6.2
astor 0.8.1 0.8.1
babel 2.16.0 2.11.0
bce-python-sdk 0.9.23  
blinker 1.8.2 1.6.2
ca-certificates 2024.9.24 2024.9.24
certifi 2024.8.30 2024.8.30
charset-normalizer 3.4.0 3.3.2
click 8.1.7 8.1.7
colorama 0.4.6 0.4.6
contourpy 1.1.1 1.2.0
cycler 0.12.1 0.11.0
decorator 5.1.1 5.1.1
exceptiongroup 1.2.2 1.2.0
filelock 3.16.1 3.13.1
flask 3.0.3 3.0.3
flask-babel 4.0.0 2.0.0
fonttools 4.54.1 4.51.0
future 1.0.0 0.18.3
h11 0.14.0 0.14.0
httpcore 1.0.6 1.0.2
httpx 0.27.2 0.27.0
idna 3.10 3.7
imageio 2.35.1 2.33.1
importlib-metadata 8.5.0 7.0.1
importlib-resources 6.4.5 6.4.0
itsdangerous 2.2.0 2.2.0
jinja2 3.1.4 3.1.4
joblib 1.4.2 1.4.2
kiwisolver 1.4.7 1.4.4
lazy-loader 0.4  
libffi 3.4.4 3.4.4
markupsafe 2.1.5 2.1.3
matplotlib 3.7.5 3.9.2
networkx 3.1 3.3
numpy 1.24.4 2.1.3
opencv-python 4.5.5.64  
openssl 3.0.15 3.0.15
opt-einsum 3.3.0  
packaging 24.1 24.1
paddlepaddle-gpu 2.6.1.post112  
paddleseg 0.0.0.dev0  
pandas 2.0.3 2.2.2
pillow 10.4.0 10.4.0
pip 24.2 24.2
prettytable 3.11.0 3.5.0
protobuf 3.20.2 4.25.3
psutil 6.1.0 5.9.0
pycryptodome 3.21.0 3.20.0
pyparsing 3.1.4 3.1.2
python 3.8.20 3.13.0
python-dateutil 2.9.0.post0 2.9.0.post0
pytz 2024.2 2024.1
pywavelets 1.4.1 1.7.0
pyyaml 6.0.2 6.0.2
rarfile 4.2  
requests 2.32.3 2.32.3
scikit-image 0.21.0 0.24.0
scikit-learn 1.3.2 1.5.1
scipy 1.10.1 1.13.1
setuptools 75.1.0 75.1.0
six 1.16.0 1.16.0
sniffio 1.3.1 1.3.0
sqlite 3.45.3 3.45.3
threadpoolctl 3.5.0 3.5.0
tifffile 2023.7.10 2023.4.12
tqdm 4.66.6 4.66.5
typing-extensions 4.12.2 4.11.0
tzdata 2024.2 2024b
urllib3 2.2.3 2.2.3
vc 14.40 14.40
visualdl 2.5.3  
vs2015_runtime 14.40.33807 14.40.33807
wcwidth 0.2.13 0.2.5
werkzeug 3.0.6 3.0.3
wheel 0.44.0 0.44.0
zipp 3.20.2 3.20.2

Bug描述确认 Bug description confirmation

  • 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息,确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.

是否愿意提交PR? Are you willing to submit a PR?

  • 我愿意提交PR!I'd like to help by submitting a PR!
@2711035086 2711035086 added the bug Something isn't working label Nov 6, 2024
@TingquanGao TingquanGao reopened this Nov 13, 2024
@2711035086
Copy link
Author

真讨厌,为什么让我关闭啊

@TingquanGao
Copy link
Collaborator

真讨厌,为什么让我关闭啊

抱歉,误操作,已重新打开。

@2711035086
Copy link
Author

好哒好哒,期待有效的解决方法

@TingquanGao TingquanGao reopened this Nov 13, 2024
@TingquanGao
Copy link
Collaborator

我们会跟进该问题。另外也建议可以尝试下PaddleX,PaddleX是飞桨推出的低代码开发工具,可以更简单地完成模型训练与部署,Segformer-B5模型已经接入PaddleX,相关文档

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants