Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On using multi-GPU workers with DistributedDataParallel (DDP) and PyTorch Lightning #135

Open
simon-ging opened this issue Aug 18, 2024 · 0 comments

Comments

@simon-ging
Copy link

Hi there! First of all, thanks for your great work.

Correct me if I am wrong, but currently there is no implemented way to run neps workers that use multiple GPUs with DistributedDataParallel (DDP) to evaluate a single config run.

The problem is that DDP (or in pytorch lightning ddp, the trainer.fit method) spawns additional processes for each GPU. If these processes call neps.run they will register as a new worker and get a new config. This is incorrect, since they should load the same config as rank 0.

The solution is to have only rank 0 interact with neps and get a new config. Then ranks 1+ need to ignore neps and figure out what rank 0 is doing and start training with the same config. Finally set max_evaluations_per_run=1, kill all processes, and restart everything to run the next config. (The last one is to avoid potential issues with reusing distributed process groups for a new config / new lightning trainers).

You can see the solution script which also has more details below. Note that it works with the official neps master branch, no need to install any forks.

https://github.com/simon-ging/neps/blob/multigpu/neps_examples/experimental/lightning_multigpu_workers.py

No action from your side is necessary. I Just wanted to share this in case anyone else runs into the same problem. Feel free to use it for your repository.

Best,

Simon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

1 participant