-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A question about distributed training on DDAD dataset #11
Comments
Is the GPU out of the memory ? |
Maybe not, I guess. Currently, we are training on the A10 GPUs. Yesterday the training process just went on well without changing any settings. BTW, are you using 6*RTX3090 for training? I have found the code |
I'm sorry for that it is a liite bit confusing. This code means that we reshape one batch data (6, 3, H, W) to (1 ,6, 3, H, W) in the training step since one frame has 6 surrounding views. |
Thanks for your explanation! You did a great and solid work! I have one last question: how long did it take for training on your machine for the |
I remember that for DDAD it takes about 1.5 days on 8 RTX 3090 for scale-aware training. |
Thank you very much! |
You're welcome. |
Noting that you mentioned in this paper that the results of FSM are different from the original paper, have you tried to reproduce the results of FSM before? |
Hello! I am following your work and doing a reproduction. But I got these questions below while using the command
python -m torch.distributed.launch --nproc_per_node 8 run.py --model_name ddad --config configs/ddad.txt
for distributed training on the DDAD dataset.[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1806986 milliseconds before timing out. '
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. T o avoid this inconsistency, we are taking the entire process down.
After training for a while, the process would be automatically shut down for running overtime.
Are there any details or training settings that I have ignored? Or does the torch version matter?
Thanks!
The text was updated successfully, but these errors were encountered: