Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question about distributed training on DDAD dataset #11

Open
myc634 opened this issue Oct 30, 2022 · 8 comments
Open

A question about distributed training on DDAD dataset #11

myc634 opened this issue Oct 30, 2022 · 8 comments

Comments

@myc634
Copy link

myc634 commented Oct 30, 2022

Hello! I am following your work and doing a reproduction. But I got these questions below while using the command python -m torch.distributed.launch --nproc_per_node 8 run.py --model_name ddad --config configs/ddad.txt for distributed training on the DDAD dataset.

[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1806986 milliseconds before timing out. '

[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. T o avoid this inconsistency, we are taking the entire process down.

After training for a while, the process would be automatically shut down for running overtime.
Are there any details or training settings that I have ignored? Or does the torch version matter?
Thanks!

@weiyithu
Copy link
Owner

Is the GPU out of the memory ?

@myc634
Copy link
Author

myc634 commented Oct 31, 2022

Maybe not, I guess. Currently, we are training on the A10 GPUs. Yesterday the training process just went on well without changing any settings. BTW, are you using 6*RTX3090 for training? I have found the code self.opt.batch_size = self.opt.batch_size // 6 and wondering what's this code for.

@weiyithu
Copy link
Owner

weiyithu commented Nov 2, 2022

I'm sorry for that it is a liite bit confusing. This code means that we reshape one batch data (6, 3, H, W) to (1 ,6, 3, H, W) in the training step since one frame has 6 surrounding views.

@myc634
Copy link
Author

myc634 commented Nov 2, 2022

Thanks for your explanation! You did a great and solid work! I have one last question: how long did it take for training on your machine for the scale-aware model with the SfM pretraining? The program predicts that it's going to take 63 hours for training.

@weiyithu
Copy link
Owner

weiyithu commented Nov 3, 2022

I remember that for DDAD it takes about 1.5 days on 8 RTX 3090 for scale-aware training.

@myc634
Copy link
Author

myc634 commented Nov 3, 2022

Thank you very much!

@weiyithu
Copy link
Owner

weiyithu commented Nov 3, 2022

You're welcome.

@myc634
Copy link
Author

myc634 commented Jan 7, 2023

Noting that you mentioned in this paper that the results of FSM are different from the original paper, have you tried to reproduce the results of FSM before?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants