About stage 2 training #13

Rayiz3 · 2024-11-26T07:35:26Z

Hi, thank you for providing the opensource code.

While doing stage 2 training,
mpiexec -n 1 python scripts/train.py --latent_dim 64 --encoder_type resnet18 --log_dir log/stage2 --resume_checkpoint log/stage1/stage1_model050000.pt --data_dir peronsal_deca.lmdb --lr 1e-5 --p2_weight True --image_size 256 --batch_size 4 --max_steps 5000 --num_workers 8 --save_interval 5000 --stage 2

the code gave me an error:
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

It said that I have to change all of the in-place operation with not-in-place operation, or using torch.no_grad().

But it seems that it already uses torch.no_grad() option in sync_params() (it is where actually the error occurs).

def sync_params(params):
"""
Synchronize a sequence of Tensors across ranks from rank 0.
"""
for p in params:
with th.no_grad():
dist.broadcast(p, 0)

Can you give me some advice to mange this problem?

Thank you.

zh-ding · 2024-11-27T00:36:26Z

I didn't encounter this issue before. You mentioned that the problem happened in the sync_params(), can you try to remove this for a quick workaround? Since the stage 2 only utilizes a single GPU for training. I will check the codes in the future to see what caused the problem when I get more time. Thank you!

Rayiz3 · 2024-11-27T01:03:32Z

I was actually working on Windows, and when I try to do this on WSL, it works well. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About stage 2 training #13

About stage 2 training #13

Rayiz3 commented Nov 26, 2024

zh-ding commented Nov 27, 2024

Rayiz3 commented Nov 27, 2024

About stage 2 training #13

About stage 2 training #13

Comments

Rayiz3 commented Nov 26, 2024

zh-ding commented Nov 27, 2024

Rayiz3 commented Nov 27, 2024