On using multi-GPU workers with DistributedDataParallel (DDP) and PyTorch Lightning #135

simon-ging · 2024-08-18T11:37:34Z

Hi there! First of all, thanks for your great work.

Correct me if I am wrong, but currently there is no implemented way to run neps workers that use multiple GPUs with DistributedDataParallel (DDP) to evaluate a single config run.

The problem is that DDP (or in pytorch lightning ddp, the trainer.fit method) spawns additional processes for each GPU. If these processes call neps.run they will register as a new worker and get a new config. This is incorrect, since they should load the same config as rank 0.

The solution is to have only rank 0 interact with neps and get a new config. Then ranks 1+ need to ignore neps and figure out what rank 0 is doing and start training with the same config. Finally set max_evaluations_per_run=1, kill all processes, and restart everything to run the next config. (The last one is to avoid potential issues with reusing distributed process groups for a new config / new lightning trainers).

You can see the solution script which also has more details below. Note that it works with the official neps master branch, no need to install any forks.

https://github.com/simon-ging/neps/blob/multigpu/neps_examples/experimental/lightning_multigpu_workers.py

No action from your side is necessary. I Just wanted to share this in case anyone else runs into the same problem. Feel free to use it for your repository.

Best,

Simon

github-project-automation bot added this to NePS Project Board Aug 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On using multi-GPU workers with DistributedDataParallel (DDP) and PyTorch Lightning #135

On using multi-GPU workers with DistributedDataParallel (DDP) and PyTorch Lightning #135

simon-ging commented Aug 18, 2024

On using multi-GPU workers with DistributedDataParallel (DDP) and PyTorch Lightning #135

On using multi-GPU workers with DistributedDataParallel (DDP) and PyTorch Lightning #135

Comments

simon-ging commented Aug 18, 2024