You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to run the fine_tune.py script on my lab's server, which is just a normal 4-GPU Ubuntu station without slurm support. When I ran it without distributed training setup, everything was okay. Then I tried to switch to multi-GPU setting and somehow I just couldn't get it work. I have tried the following ways and none of them seemed to work:
accelerate config and accelerate launch fine_tune.py --py_args, which gave me the following error while initializing the accelerator object:
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
torchrun fine_tune.py --py_args, which gave me the same error as method 1.
Write another shell script which calls the fine_tune_pascal.sh script 4 times and passes in different $SLURM_ARRAY_TASK_ID, which seems not to be the correct way since every process claimed to be the main process and I guess they were just generating replicate things.
Could you help me out with this? I'm pretty sure my accelerate library setting is okay since I'm able to run their official toy example. Is that because the codes inside if __name__ == "__main__": block is not fully wrapped as a main() function, as instructed by huggingface accelerate? Should I wrap it again?
The text was updated successfully, but these errors were encountered:
Hi,
I tried to run the
fine_tune.py
script on my lab's server, which is just a normal 4-GPU Ubuntu station without slurm support. When I ran it without distributed training setup, everything was okay. Then I tried to switch to multi-GPU setting and somehow I just couldn't get it work. I have tried the following ways and none of them seemed to work:accelerate config
andaccelerate launch fine_tune.py --py_args
, which gave me the following error while initializing the accelerator object:torchrun fine_tune.py --py_args
, which gave me the same error as method 1.Write another shell script which calls the
fine_tune_pascal.sh
script 4 times and passes in different$SLURM_ARRAY_TASK_ID
, which seems not to be the correct way since every process claimed to be the main process and I guess they were just generating replicate things.Could you help me out with this? I'm pretty sure my accelerate library setting is okay since I'm able to run their official toy example. Is that because the codes inside
if __name__ == "__main__":
block is not fully wrapped as amain()
function, as instructed by huggingface accelerate? Should I wrap it again?The text was updated successfully, but these errors were encountered: