Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About stage 2 training #13

Open
Rayiz3 opened this issue Nov 26, 2024 · 2 comments
Open

About stage 2 training #13

Rayiz3 opened this issue Nov 26, 2024 · 2 comments

Comments

@Rayiz3
Copy link

Rayiz3 commented Nov 26, 2024

Hi, thank you for providing the opensource code.

While doing stage 2 training,
mpiexec -n 1 python scripts/train.py --latent_dim 64 --encoder_type resnet18 --log_dir log/stage2 --resume_checkpoint log/stage1/stage1_model050000.pt --data_dir peronsal_deca.lmdb --lr 1e-5 --p2_weight True --image_size 256 --batch_size 4 --max_steps 5000 --num_workers 8 --save_interval 5000 --stage 2

the code gave me an error:
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

It said that I have to change all of the in-place operation with not-in-place operation, or using torch.no_grad().

But it seems that it already uses torch.no_grad() option in sync_params() (it is where actually the error occurs).

def sync_params(params):
"""
Synchronize a sequence of Tensors across ranks from rank 0.
"""
for p in params:
with th.no_grad():
dist.broadcast(p, 0)

Can you give me some advice to mange this problem?

Thank you.

@zh-ding
Copy link
Collaborator

zh-ding commented Nov 27, 2024

I didn't encounter this issue before. You mentioned that the problem happened in the sync_params(), can you try to remove this for a quick workaround? Since the stage 2 only utilizes a single GPU for training. I will check the codes in the future to see what caused the problem when I get more time. Thank you!

@Rayiz3
Copy link
Author

Rayiz3 commented Nov 27, 2024

I was actually working on Windows, and when I try to do this on WSL, it works well. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants