You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I didn't encounter this issue before. You mentioned that the problem happened in the sync_params(), can you try to remove this for a quick workaround? Since the stage 2 only utilizes a single GPU for training. I will check the codes in the future to see what caused the problem when I get more time. Thank you!
Hi, thank you for providing the opensource code.
While doing stage 2 training,
mpiexec -n 1 python scripts/train.py --latent_dim 64 --encoder_type resnet18 --log_dir log/stage2 --resume_checkpoint log/stage1/stage1_model050000.pt --data_dir peronsal_deca.lmdb --lr 1e-5 --p2_weight True --image_size 256 --batch_size 4 --max_steps 5000 --num_workers 8 --save_interval 5000 --stage 2
the code gave me an error:
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
It said that I have to change all of the in-place operation with not-in-place operation, or using torch.no_grad().
But it seems that it already uses torch.no_grad() option in sync_params() (it is where actually the error occurs).
def sync_params(params):
"""
Synchronize a sequence of Tensors across ranks from rank 0.
"""
for p in params:
with th.no_grad():
dist.broadcast(p, 0)
Can you give me some advice to mange this problem?
Thank you.
The text was updated successfully, but these errors were encountered: