-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
单机双卡训练IA-SSD时报错:Hint: Expected groups_need_finalize == false, but received groups_need_finalize_:1 != false:0 #245
Comments
@Poet-LiBai hi, 能否给出你的环境信息?包括cuda、paddle、paddle3d版本,以及你的启动命令 |
硬件配置内存125.6 GiB 环境信息------------Environment Information------------- Science Toolkits: PaddlePaddle: CUDA: GPUs: 启动命令export CUDA_VISIBLE_DEVICES=0,1 临时解决方法在已设置Paddle3D/paddle3d/apis/trainer.py 新的问题我的训练主机是双3060,每张显存为12G,yml配置文件里设置batch_size=8即双卡训练时总的total batch_size=16,启动分布式训练命令后,在第1或2个iter后导致我的主机奔溃自动重启,多次尝试问题一样。后面调小到每张卡的batch_size=4,即双卡总batch_size=8时才能稳定使用双卡训练,这样就等同于用一张卡训练了。使用单卡训练时batch_size=8,显存12G用了11G多,就不知道为什么双卡训练设置yml里的batch_size=8导致主机奔溃重启问题? |
@Poet-LiBai hi,我复现到这个问题了,是paddle版本的问题,你可以先采用paddle2.3的版本,同时修改以下几点:
另外,如果采用单卡跑,你需要根据你设置的bs采用linear-rule调整学习率参数,同时你需要设置config的step数量 |
遇到了相同的问题 |
请更新paddle==2.5.2版本 |
LAUNCH INFO 2023-02-10 19:05:42,469 ------------------------- ERROR LOG DETAIL -------------------------
found, try JIT build
Compiling user custom op, it will cost a few seconds.....
W0210 19:05:37.927033 32446 custom_operator.cc:723] Operator (farthest_point_sample) has been registered.
W0210 19:05:37.927069 32446 custom_operator.cc:723] Operator (grouping_operation_stack) has been registered.
W0210 19:05:37.927075 32446 custom_operator.cc:723] Operator (ball_query_stack) has been registered.
W0210 19:05:37.927080 32446 custom_operator.cc:723] Operator (voxel_query_wrapper) has been registered.
W0210 19:05:37.927084 32446 custom_operator.cc:723] Operator (grouping_operation_batch) has been registered.
W0210 19:05:37.927088 32446 custom_operator.cc:723] Operator (ball_query_batch) has been registered.
W0210 19:05:37.927093 32446 custom_operator.cc:723] Operator (gather_operation) has been registered.
2023-02-10 19:05:37,939 - INFO - roiaware_pool3d builded success!
W0210 19:05:38.199252 32446 reducer.cc:622] All parameters are involved in the backward pass. It is recommended to set find_unused_parameters to False to improve performance. However, if unused parameters appear in subsequent iterative training, then an error will occur. Please make it clear that in the subsequent training, there will be no parameters that are not used in the backward pass, and then set find_unused_parameters
Traceback (most recent call last):
File "tools/train.py", line 202, in
main(args)
File "tools/train.py", line 197, in main
trainer.train()
File "/home/t/ps/DL/Paddle3D/paddle3d/apis/trainer.py", line 284, in train
output = training_step(
File "/home/t/ps/DL/Paddle3D/paddle3d/apis/pipeline.py", line 66, in training_step
outputs = model(sample)
File "/home/t/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 948, in call
return self.forward(*inputs, **kwargs)
File "/home/t/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/parallel.py", line 777, in forward
self.reducer.prepare_for_backward(list(
RuntimeError: (PreconditionNotMet) A serious error has occurred here. Please set find_unused_parameters=True to traverse backward graph in each step to prepare reduce in advance. If you have set, There may be several reasons for this error: 1) Please note that all forward outputs derived from the module parameters must participate in the calculation of losses and subsequent gradient calculations. If not, the wrapper will hang, waiting for autograd to generate gradients for these parameters. you can use detach or stop_gradient to make the unused parameters detached from the autograd graph. 2) Used multiple forwards and one backward. You may be able to wrap multiple forwards in a model.
[Hint: Expected groups_need_finalize == false, but received groups_need_finalize_:1 != false:0.] (at /paddle/paddle/fluid/distributed/collective/reducer.cc:609)
I0210 19:05:41.005930 32576 tcp_store.cc:257] receive shutdown event and so quit from MasterDaemon run loop
LAUNCH INFO 2023-02-10 19:05:42,470 Exit code 1
The text was updated successfully, but these errors were encountered: