You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there! First of all, thanks for your great work.
Correct me if I am wrong, but currently there is no implemented way to run neps workers that use multiple GPUs with DistributedDataParallel (DDP) to evaluate a single config run.
The problem is that DDP (or in pytorch lightning ddp, the trainer.fit method) spawns additional processes for each GPU. If these processes call neps.run they will register as a new worker and get a new config. This is incorrect, since they should load the same config as rank 0.
The solution is to have only rank 0 interact with neps and get a new config. Then ranks 1+ need to ignore neps and figure out what rank 0 is doing and start training with the same config. Finally set max_evaluations_per_run=1, kill all processes, and restart everything to run the next config. (The last one is to avoid potential issues with reusing distributed process groups for a new config / new lightning trainers).
You can see the solution script which also has more details below. Note that it works with the official neps master branch, no need to install any forks.
No action from your side is necessary. I Just wanted to share this in case anyone else runs into the same problem. Feel free to use it for your repository.
Best,
Simon
The text was updated successfully, but these errors were encountered:
Hi there! First of all, thanks for your great work.
Correct me if I am wrong, but currently there is no implemented way to run neps workers that use multiple GPUs with DistributedDataParallel (DDP) to evaluate a single config run.
The problem is that DDP (or in pytorch lightning ddp, the trainer.fit method) spawns additional processes for each GPU. If these processes call neps.run they will register as a new worker and get a new config. This is incorrect, since they should load the same config as rank 0.
The solution is to have only rank 0 interact with neps and get a new config. Then ranks 1+ need to ignore neps and figure out what rank 0 is doing and start training with the same config. Finally set max_evaluations_per_run=1, kill all processes, and restart everything to run the next config. (The last one is to avoid potential issues with reusing distributed process groups for a new config / new lightning trainers).
You can see the solution script which also has more details below. Note that it works with the official neps master branch, no need to install any forks.
https://github.com/simon-ging/neps/blob/multigpu/neps_examples/experimental/lightning_multigpu_workers.py
No action from your side is necessary. I Just wanted to share this in case anyone else runs into the same problem. Feel free to use it for your repository.
Best,
Simon
The text was updated successfully, but these errors were encountered: