A question about distributed training on DDAD dataset #11

myc634 · 2022-10-30T09:48:06Z

Hello! I am following your work and doing a reproduction. But I got these questions below while using the command python -m torch.distributed.launch --nproc_per_node 8 run.py --model_name ddad --config configs/ddad.txt for distributed training on the DDAD dataset.

[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1806986 milliseconds before timing out. '

[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. T o avoid this inconsistency, we are taking the entire process down.

After training for a while, the process would be automatically shut down for running overtime.
Are there any details or training settings that I have ignored? Or does the torch version matter?
Thanks!

The text was updated successfully, but these errors were encountered:

weiyithu · 2022-10-31T12:05:08Z

Is the GPU out of the memory ?

myc634 · 2022-10-31T13:15:36Z

Maybe not, I guess. Currently, we are training on the A10 GPUs. Yesterday the training process just went on well without changing any settings. BTW, are you using 6*RTX3090 for training? I have found the code self.opt.batch_size = self.opt.batch_size // 6 and wondering what's this code for.

weiyithu · 2022-11-02T04:05:14Z

I'm sorry for that it is a liite bit confusing. This code means that we reshape one batch data (6, 3, H, W) to (1 ,6, 3, H, W) in the training step since one frame has 6 surrounding views.

myc634 · 2022-11-02T12:57:24Z

Thanks for your explanation! You did a great and solid work! I have one last question: how long did it take for training on your machine for the scale-aware model with the SfM pretraining? The program predicts that it's going to take 63 hours for training.

weiyithu · 2022-11-03T03:16:43Z

I remember that for DDAD it takes about 1.5 days on 8 RTX 3090 for scale-aware training.

myc634 · 2022-11-03T03:20:08Z

Thank you very much!

weiyithu · 2022-11-03T12:17:57Z

You're welcome.

myc634 · 2023-01-07T14:48:02Z

Noting that you mentioned in this paper that the results of FSM are different from the original paper, have you tried to reproduce the results of FSM before?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question about distributed training on DDAD dataset #11

A question about distributed training on DDAD dataset #11

myc634 commented Oct 30, 2022

weiyithu commented Oct 31, 2022

myc634 commented Oct 31, 2022

weiyithu commented Nov 2, 2022

myc634 commented Nov 2, 2022

weiyithu commented Nov 3, 2022

myc634 commented Nov 3, 2022

weiyithu commented Nov 3, 2022

myc634 commented Jan 7, 2023

A question about distributed training on DDAD dataset #11

A question about distributed training on DDAD dataset #11

Comments

myc634 commented Oct 30, 2022

weiyithu commented Oct 31, 2022

myc634 commented Oct 31, 2022

weiyithu commented Nov 2, 2022

myc634 commented Nov 2, 2022

weiyithu commented Nov 3, 2022

myc634 commented Nov 3, 2022

weiyithu commented Nov 3, 2022

myc634 commented Jan 7, 2023