You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm aiming to assess the impact of modifying low-level communication library hyperparameters on distributed machine learning training throughput, and have selected your DLRM implementation as a benchmark workload.
Curiously, before making any alterations to the underlying communication libraries (i.e., utilizing default configurations for all hyperparameters), my performance logs have shown a substantial variance across multiple test runs. The disparity reaches two orders of magnitude, with iteration times sometimes around 20ms and other times reaching 1500ms, all under conditions where num-batch=50.
In pursuit of identifying the source of these substantial fluctuations, I have experimented with varying the launch method for DDP (utilizing both mpirun and torchrun), altering DDP parameters (num-workers, mlperf), modifying the number of participating nodes (ranging from 1 to 4 nodes), and have tried restarting machines, processes, and changing communication ports. Nevertheless, the cause of the fluctuations remains elusive.
My system specifications include:
Ubuntu 22.04.1 LTS Codename: jammy
Kernel version 5.15.0-105-generic
miniconda python3.8 torch version 2.3.0
GPU and CUDA/NCCL details as follows:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 ||-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |||| MIG M. ||=========================================+========================+======================|| 0 Tesla V100-PCIE-16GB Off | 00000000:3B:00.0 Off | 0 || N/A 32C P0 36W / 250W | 0MiB / 16384MiB | 0% Default |||| N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla V100-PCIE-16GB Off | 00000000:86:00.0 Off | 0 || N/A 31C P0 38W / 250W | 0MiB / 16384MiB | 0% Default |||| N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 Tesla V100-PCIE-16GB Off | 00000000:AF:00.0 Off | 0 || N/A 31C P0 37W / 250W | 0MiB / 16384MiB | 0% Default |||| N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 Tesla V100-PCIE-16GB Off | 00000000:D8:00.0 Off | 0 || N/A 33C P0 38W / 250W | 0MiB / 16384MiB | 2% Default |||| N/A |
+-----------------------------------------+------------------------+----------------------+
Here is the command used to start the training script (of the master node):
For reproducibility, I have set the num-batches parameter to 10 and consistently observed large performance differences across four repeated runs as follows:
Run 1 (average ~1320ms)
Finished training it 1/10 of epoch 0, 0/1=1776.15 ms/it, loss 1.159998
Finished training it 2/10 of epoch 0, 0/1=1243.90 ms/it, loss 0.896872
Finished training it 3/10 of epoch 0, 0/1=1314.72 ms/it, loss 0.844668
Finished training it 4/10 of epoch 0, 0/1=1238.35 ms/it, loss 0.690382
Finished training it 5/10 of epoch 0, 0/1=1257.99 ms/it, loss 0.694125
Finished training it 6/10 of epoch 0, 0/1=1315.04 ms/it, loss 0.693931
Finished training it 7/10 of epoch 0, 0/1=1293.88 ms/it, loss 0.692185
Finished training it 8/10 of epoch 0, 0/1=1244.26 ms/it, loss 0.693441
Finished training it 9/10 of epoch 0, 0/1=1277.02 ms/it, loss 0.696777
Finished training it 10/10 of epoch 0, 0/1=1242.58 ms/it, loss 0.693220
Run 2 (average ~1073ms)
Finished training it 1/10 of epoch 0, 0/1=1575.05 ms/it, loss 1.159998
Finished training it 2/10 of epoch 0, 0/1=1076.93 ms/it, loss 0.896872
Finished training it 3/10 of epoch 0, 0/1=1116.18 ms/it, loss 0.844668
Finished training it 4/10 of epoch 0, 0/1=1094.06 ms/it, loss 0.690382
Finished training it 5/10 of epoch 0, 0/1=1098.16 ms/it, loss 0.694125
Finished training it 6/10 of epoch 0, 0/1=1071.85 ms/it, loss 0.693931
Finished training it 7/10 of epoch 0, 0/1=1128.99 ms/it, loss 0.692185
Finished training it 8/10 of epoch 0, 0/1=1127.59 ms/it, loss 0.693441
Finished training it 9/10 of epoch 0, 0/1=1106.33 ms/it, loss 0.696777
Finished training it 10/10 of epoch 0, 0/1=1093.37 ms/it, loss 0.693220
Run 3 (average ~351ms)
Finished training it 1/10 of epoch 0, 0/1=804.80 ms/it, loss 1.159998
Finished training it 2/10 of epoch 0, 0/1=297.06 ms/it, loss 0.896872
Finished training it 3/10 of epoch 0, 0/1=295.61 ms/it, loss 0.844668
Finished training it 4/10 of epoch 0, 0/1=312.81 ms/it, loss 0.690382
Finished training it 5/10 of epoch 0, 0/1=284.89 ms/it, loss 0.694125
Finished training it 6/10 of epoch 0, 0/1=316.30 ms/it, loss 0.693931
Finished training it 7/10 of epoch 0, 0/1=266.81 ms/it, loss 0.692185
Finished training it 8/10 of epoch 0, 0/1=316.45 ms/it, loss 0.693441
Finished training it 9/10 of epoch 0, 0/1=310.37 ms/it, loss 0.696777
Finished training it 10/10 of epoch 0, 0/1=313.58 ms/it, loss 0.693220
Run 4 (average ~155.54ms)
Finished training it 1/10 of epoch 0, 0/1=608.88 ms/it, loss 1.159998
Finished training it 2/10 of epoch 0, 0/1=122.61 ms/it, loss 0.896872
Finished training it 3/10 of epoch 0, 0/1=112.35 ms/it, loss 0.844668
Finished training it 4/10 of epoch 0, 0/1=76.45 ms/it, loss 0.690382
Finished training it 5/10 of epoch 0, 0/1=77.42 ms/it, loss 0.694125
Finished training it 6/10 of epoch 0, 0/1=110.31 ms/it, loss 0.693931
Finished training it 7/10 of epoch 0, 0/1=129.44 ms/it, loss 0.692185
Finished training it 8/10 of epoch 0, 0/1=62.11 ms/it, loss 0.693441
Finished training it 9/10 of epoch 0, 0/1=134.10 ms/it, loss 0.696777
Finished training it 10/10 of epoch 0, 0/1=121.83 ms/it, loss 0.693220
I would greatly appreciate any insights you may have on what could be causing these performance inconsistencies. Ensuring a stable baseline is crucial before I proceed with tweaking communication library hyperparameters.
Thank you for your time and support. ❤
The text was updated successfully, but these errors were encountered:
I'm aiming to assess the impact of modifying low-level communication library hyperparameters on distributed machine learning training throughput, and have selected your DLRM implementation as a benchmark workload.
Curiously, before making any alterations to the underlying communication libraries (i.e., utilizing default configurations for all hyperparameters), my performance logs have shown a substantial variance across multiple test runs. The disparity reaches two orders of magnitude, with iteration times sometimes around 20ms and other times reaching 1500ms, all under conditions where num-batch=50.
In pursuit of identifying the source of these substantial fluctuations, I have experimented with varying the launch method for DDP (utilizing both mpirun and torchrun), altering DDP parameters (num-workers, mlperf), modifying the number of participating nodes (ranging from 1 to 4 nodes), and have tried restarting machines, processes, and changing communication ports. Nevertheless, the cause of the fluctuations remains elusive.
My system specifications include:
Ubuntu 22.04.1 LTS Codename: jammy
Kernel version 5.15.0-105-generic
miniconda python3.8 torch version 2.3.0
GPU and CUDA/NCCL details as follows:
Here is the command used to start the training script (of the master node):
For reproducibility, I have set the num-batches parameter to 10 and consistently observed large performance differences across four repeated runs as follows:
Run 1 (average ~1320ms)
Run 2 (average ~1073ms)
Run 3 (average ~351ms)
Run 4 (average ~155.54ms)
I would greatly appreciate any insights you may have on what could be causing these performance inconsistencies. Ensuring a stable baseline is crucial before I proceed with tweaking communication library hyperparameters.
Thank you for your time and support. ❤
The text was updated successfully, but these errors were encountered: