You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are using torch distributed elastic launch method to kick off training on multi-gpu single node environment. It seems to be working fine when running locally, i.e in a machine that has multi-gpu available, it also works fine on single GPU but it hangs when we provide WORLD_SIZE, MASTER_ADDR, MASTER_PORT parameters. There seems to be some issue with the master address/port configuration where its trying to connect with the GPU but keeps waiting.
@gkumbhat Are you running a second instance of that command anywhere else, or what is the rationale for setting WORLD_SIZE to 2? From a cursory glance, the first process could be waiting for the second which has never been started.
Also from the links provided, it appears caikit is using torch multiprocessing...
Are WORLD_SIZE, RANK, MASTER_ADDR, or MASTER_PORT set in the environment prior to running this command, and if so what are their values?
Description
We are using torch distributed elastic launch method to kick off training on multi-gpu single node environment. It seems to be working fine when running locally, i.e in a machine that has multi-gpu available, it also works fine on single GPU but it hangs when we provide
WORLD_SIZE
,MASTER_ADDR
,MASTER_PORT
parameters. There seems to be some issue with the master address/port configuration where its trying to connect with the GPU but keeps waiting.Run command:
Relevant code to launch the training:
The text was updated successfully, but these errors were encountered: