You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm training a BigGAN with differential augmentation and LeCam optimization on a custom dataset. My setup features 4 NVIDIA RTX 3070 and I'm running on Ubuntu 20.04. I observe that running the training on the 4 GPUs, using Distributed Data Parallel takes the same time as performing the training using a single GPU. Am I doing something wrong?
For training using a single GPU, I'm using the following command: CUDA_VISIBLE_DEVICES=0 python3 src/main.py -t -hdf5 -l -std_stat -std_max 64 -std_step 64 -metrics fid is prdc -ref "train" -cfg src/configs/VWW/BigGAN-DiffAug-LeCam.yaml -data ../Datasets/vw_coco2014_96_GAN -save SAVE_PATH_VWW -mpc --post_resizer "friendly" --eval_backbone "InceptionV3_tf"
For training using the 4 GPUs, I'm using the following commands: export MASTER_ADDR=localhost export MASTER_PORT=1234 CUDA_VISIBLE_DEVICES=0,1,2,3 python3 src/main.py -t -DDP -tn 1 -cn 0 -std_stat -std_max 64 -std_step 64 -metrics fid is prdc -ref "train" -cfg src/configs/VWW/BigGAN-DiffAug-LeCam.yaml -data ../Datasets/vw_coco2014_96_GAN -save SAVE_PATH_VWW -mpc --post_resizer "friendly" --eval_backbone "InceptionV3_tf"
The text was updated successfully, but these errors were encountered:
MiguelCosta94
changed the title
Using 4 GPUs is slower than using just 1
Using 4 GPUs for training takes the same time as using just 1
Dec 5, 2023
Could you please check the batch size used in the training process?
If you are using 1 GPU with a batch size of 256, it is advisable to switch to 4 GPUs, each with a batch size of 64, in order to accelerate training. It's important not to use the 256 batch size for each GPU for faster training.
I'm training a BigGAN with differential augmentation and LeCam optimization on a custom dataset. My setup features 4 NVIDIA RTX 3070 and I'm running on Ubuntu 20.04. I observe that running the training on the 4 GPUs, using Distributed Data Parallel takes the same time as performing the training using a single GPU. Am I doing something wrong?
For training using a single GPU, I'm using the following command:
CUDA_VISIBLE_DEVICES=0 python3 src/main.py -t -hdf5 -l -std_stat -std_max 64 -std_step 64 -metrics fid is prdc -ref "train" -cfg src/configs/VWW/BigGAN-DiffAug-LeCam.yaml -data ../Datasets/vw_coco2014_96_GAN -save SAVE_PATH_VWW -mpc --post_resizer "friendly" --eval_backbone "InceptionV3_tf"
For training using the 4 GPUs, I'm using the following commands:
export MASTER_ADDR=localhost
export MASTER_PORT=1234
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 src/main.py -t -DDP -tn 1 -cn 0 -std_stat -std_max 64 -std_step 64 -metrics fid is prdc -ref "train" -cfg src/configs/VWW/BigGAN-DiffAug-LeCam.yaml -data ../Datasets/vw_coco2014_96_GAN -save SAVE_PATH_VWW -mpc --post_resizer "friendly" --eval_backbone "InceptionV3_tf"
The text was updated successfully, but these errors were encountered: