Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

单机双卡训练IA-SSD时报错:Hint: Expected groups_need_finalize == false, but received groups_need_finalize_:1 != false:0 #245

Open
Poet-LiBai opened this issue Feb 10, 2023 · 5 comments
Assignees

Comments

@Poet-LiBai
Copy link

LAUNCH INFO 2023-02-10 19:05:42,469 ------------------------- ERROR LOG DETAIL -------------------------
found, try JIT build
Compiling user custom op, it will cost a few seconds.....
W0210 19:05:37.927033 32446 custom_operator.cc:723] Operator (farthest_point_sample) has been registered.
W0210 19:05:37.927069 32446 custom_operator.cc:723] Operator (grouping_operation_stack) has been registered.
W0210 19:05:37.927075 32446 custom_operator.cc:723] Operator (ball_query_stack) has been registered.
W0210 19:05:37.927080 32446 custom_operator.cc:723] Operator (voxel_query_wrapper) has been registered.
W0210 19:05:37.927084 32446 custom_operator.cc:723] Operator (grouping_operation_batch) has been registered.
W0210 19:05:37.927088 32446 custom_operator.cc:723] Operator (ball_query_batch) has been registered.
W0210 19:05:37.927093 32446 custom_operator.cc:723] Operator (gather_operation) has been registered.
2023-02-10 19:05:37,939 - INFO - roiaware_pool3d builded success!
W0210 19:05:38.199252 32446 reducer.cc:622] All parameters are involved in the backward pass. It is recommended to set find_unused_parameters to False to improve performance. However, if unused parameters appear in subsequent iterative training, then an error will occur. Please make it clear that in the subsequent training, there will be no parameters that are not used in the backward pass, and then set find_unused_parameters
Traceback (most recent call last):
File "tools/train.py", line 202, in
main(args)
File "tools/train.py", line 197, in main
trainer.train()
File "/home/t/ps/DL/Paddle3D/paddle3d/apis/trainer.py", line 284, in train
output = training_step(
File "/home/t/ps/DL/Paddle3D/paddle3d/apis/pipeline.py", line 66, in training_step
outputs = model(sample)
File "/home/t/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 948, in call
return self.forward(*inputs, **kwargs)
File "/home/t/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/parallel.py", line 777, in forward
self.reducer.prepare_for_backward(list(
RuntimeError: (PreconditionNotMet) A serious error has occurred here. Please set find_unused_parameters=True to traverse backward graph in each step to prepare reduce in advance. If you have set, There may be several reasons for this error: 1) Please note that all forward outputs derived from the module parameters must participate in the calculation of losses and subsequent gradient calculations. If not, the wrapper will hang, waiting for autograd to generate gradients for these parameters. you can use detach or stop_gradient to make the unused parameters detached from the autograd graph. 2) Used multiple forwards and one backward. You may be able to wrap multiple forwards in a model.
[Hint: Expected groups_need_finalize
== false, but received groups_need_finalize_:1 != false:0.] (at /paddle/paddle/fluid/distributed/collective/reducer.cc:609)

I0210 19:05:41.005930 32576 tcp_store.cc:257] receive shutdown event and so quit from MasterDaemon run loop
LAUNCH INFO 2023-02-10 19:05:42,470 Exit code 1

@Birdylx
Copy link
Collaborator

Birdylx commented Feb 13, 2023

@Poet-LiBai hi, 能否给出你的环境信息?包括cuda、paddle、paddle3d版本,以及你的启动命令

@Poet-LiBai
Copy link
Author

硬件配置

内存125.6 GiB
Intel® Core™ i9-9900K CPU @ 3.60GHz × 16
2 × NVIDIA GeForce RTX 3060/PCIe/SSE2 每张12显存

环境信息

------------Environment Information-------------
platform:
Linux-5.4.0-136-generic-x86_64-with-glibc2.10
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Python - 3.8.0 (default, Nov 6 2019, 21:49:08) [GCC 7.3.0]

Science Toolkits:
cv2 - 4.5.5
numpy - 1.23.5
numba - 0.56.4
pandas - 1.5.3
pillow - 8.3.2
skimage - 0.19.3

PaddlePaddle:
paddle(gpu) - 2.4.1
paddle3d - 1.0.0
paddleseg - 2.7.0
FLAGS_cudnn_deterministic - Not set.
FLAGS_cudnn_exhaustive_search - Not set.

CUDA:
cudnn - 8201
nvcc - Build cuda_11.2.r11.2/compiler.29373293_0

GPUs:
GPU 0: NVIDIA GeForce
GPU 1: NVIDIA GeForce

启动命令

export CUDA_VISIBLE_DEVICES=0,1
fleetrun tools/train.py --config configs/iassd/iassd_kitti.yaml --save_interval 1 --num_workers 2 --save_dir outputs/iassd_kitti

临时解决方法

在已设置Paddle3D/paddle3d/apis/trainer.py
model = paddle.DataParallel(self.model, find_unused_parameters=True)仍然报错情况下,
根据打印的错误日志Paddle3D/paddle3d/apis/pipeline.py
if isinstance(model, paddle.DataParallel) and hasattr(model._layers, 'use_recompute') \ and model._layers.use_recompute:
这个条件hasattr(model._layers, 'use_recompute')为False,导致后面没能进入
with model.no_sync():进行,所以我屏蔽掉这个条件就可以双卡训练了。不知道这样的做法是否合理?

新的问题

我的训练主机是双3060,每张显存为12G,yml配置文件里设置batch_size=8即双卡训练时总的total batch_size=16,启动分布式训练命令后,在第1或2个iter后导致我的主机奔溃自动重启,多次尝试问题一样。后面调小到每张卡的batch_size=4,即双卡总batch_size=8时才能稳定使用双卡训练,这样就等同于用一张卡训练了。使用单卡训练时batch_size=8,显存12G用了11G多,就不知道为什么双卡训练设置yml里的batch_size=8导致主机奔溃重启问题?

@Birdylx
Copy link
Collaborator

Birdylx commented Feb 22, 2023

@Poet-LiBai hi,我复现到这个问题了,是paddle版本的问题,你可以先采用paddle2.3的版本,同时修改以下几点:

  1. https://github.com/PaddlePaddle/Paddle3D/blob/release/1.0/paddle3d/__init__.py#L24 注释掉这段,跳过版本检查
  2. if paddle.distributed.is_initialized():
    paddle2.3没有这个API,多卡跑你可以直接if True
  3. from .sparse_resnet import SparseResNet3D
    paddle2.3没有sparse,屏蔽这两行
    这是一个暂时的解决办法,我们会尽快修复这个问题。

另外,如果采用单卡跑,你需要根据你设置的bs采用linear-rule调整学习率参数,同时你需要设置config的step数量

@WoodwindHu
Copy link

遇到了相同的问题

@Birdylx
Copy link
Collaborator

Birdylx commented Mar 11, 2024

请更新paddle==2.5.2版本

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants