Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant Variability in DLRM Benchmark Performance Metrics #383

Open
SweeneyJun opened this issue May 20, 2024 · 0 comments
Open

Significant Variability in DLRM Benchmark Performance Metrics #383

SweeneyJun opened this issue May 20, 2024 · 0 comments

Comments

@SweeneyJun
Copy link

SweeneyJun commented May 20, 2024

I'm aiming to assess the impact of modifying low-level communication library hyperparameters on distributed machine learning training throughput, and have selected your DLRM implementation as a benchmark workload.

Curiously, before making any alterations to the underlying communication libraries (i.e., utilizing default configurations for all hyperparameters), my performance logs have shown a substantial variance across multiple test runs. The disparity reaches two orders of magnitude, with iteration times sometimes around 20ms and other times reaching 1500ms, all under conditions where num-batch=50.

In pursuit of identifying the source of these substantial fluctuations, I have experimented with varying the launch method for DDP (utilizing both mpirun and torchrun), altering DDP parameters (num-workers, mlperf), modifying the number of participating nodes (ranging from 1 to 4 nodes), and have tried restarting machines, processes, and changing communication ports. Nevertheless, the cause of the fluctuations remains elusive.

My system specifications include:

Ubuntu 22.04.1 LTS Codename: jammy
Kernel version 5.15.0-105-generic
miniconda python3.8 torch version 2.3.0
GPU and CUDA/NCCL details as follows:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-16GB           Off |   00000000:3B:00.0 Off |                    0 |
| N/A   32C    P0             36W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-PCIE-16GB           Off |   00000000:86:00.0 Off |                    0 |
| N/A   31C    P0             38W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla V100-PCIE-16GB           Off |   00000000:AF:00.0 Off |                    0 |
| N/A   31C    P0             37W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla V100-PCIE-16GB           Off |   00000000:D8:00.0 Off |                    0 |
| N/A   33C    P0             38W /  250W |       0MiB /  16384MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Here is the command used to start the training script (of the master node):

NCCL_SOCKET_IFNAME=custom LD_LIBRARY_PATH=/root/openmpi/lib:/root/miniconda3/lib:/usr/local/cuda-12.4/lib64:/root/miniconda3/envs/lla/lib PATH=/root/miniconda3/bin:/root/miniconda3/condabin:/root/openmpi/ /root/miniconda3/envs/lla/bin/torchrun \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr=192.168.245.165 \
    --master_port=1234 \
    /root/dlrm/dlrm_s_pytorch.py \
    --dist-backend nccl \
    --arch-embedding-size 1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
    --arch-sparse-feature-size 64 --arch-mlp-bot 512-512-64 \
    --arch-mlp-top 1024-1024-1024-1 \
    --max-ind-range 40000000 --data-generation random --loss-function bce --round-targets True --learning-rate 0.1 \
    --mini-batch-size 2048 --print-freq 1 --print-time --test-freq 0 --test-mini-batch-size 2048 \
    --use-gpu --num-batches 10  > 0520_4.txt 2>&1 &

For reproducibility, I have set the num-batches parameter to 10 and consistently observed large performance differences across four repeated runs as follows:

Run 1 (average ~1320ms)

Finished training it 1/10 of epoch 0, 0/1=1776.15 ms/it, loss 1.159998
Finished training it 2/10 of epoch 0, 0/1=1243.90 ms/it, loss 0.896872
Finished training it 3/10 of epoch 0, 0/1=1314.72 ms/it, loss 0.844668
Finished training it 4/10 of epoch 0, 0/1=1238.35 ms/it, loss 0.690382
Finished training it 5/10 of epoch 0, 0/1=1257.99 ms/it, loss 0.694125
Finished training it 6/10 of epoch 0, 0/1=1315.04 ms/it, loss 0.693931
Finished training it 7/10 of epoch 0, 0/1=1293.88 ms/it, loss 0.692185
Finished training it 8/10 of epoch 0, 0/1=1244.26 ms/it, loss 0.693441
Finished training it 9/10 of epoch 0, 0/1=1277.02 ms/it, loss 0.696777
Finished training it 10/10 of epoch 0, 0/1=1242.58 ms/it, loss 0.693220

Run 2 (average ~1073ms)

Finished training it 1/10 of epoch 0, 0/1=1575.05 ms/it, loss 1.159998
Finished training it 2/10 of epoch 0, 0/1=1076.93 ms/it, loss 0.896872
Finished training it 3/10 of epoch 0, 0/1=1116.18 ms/it, loss 0.844668
Finished training it 4/10 of epoch 0, 0/1=1094.06 ms/it, loss 0.690382
Finished training it 5/10 of epoch 0, 0/1=1098.16 ms/it, loss 0.694125
Finished training it 6/10 of epoch 0, 0/1=1071.85 ms/it, loss 0.693931
Finished training it 7/10 of epoch 0, 0/1=1128.99 ms/it, loss 0.692185
Finished training it 8/10 of epoch 0, 0/1=1127.59 ms/it, loss 0.693441
Finished training it 9/10 of epoch 0, 0/1=1106.33 ms/it, loss 0.696777
Finished training it 10/10 of epoch 0, 0/1=1093.37 ms/it, loss 0.693220

Run 3 (average ~351ms)

Finished training it 1/10 of epoch 0, 0/1=804.80 ms/it, loss 1.159998
Finished training it 2/10 of epoch 0, 0/1=297.06 ms/it, loss 0.896872
Finished training it 3/10 of epoch 0, 0/1=295.61 ms/it, loss 0.844668
Finished training it 4/10 of epoch 0, 0/1=312.81 ms/it, loss 0.690382
Finished training it 5/10 of epoch 0, 0/1=284.89 ms/it, loss 0.694125
Finished training it 6/10 of epoch 0, 0/1=316.30 ms/it, loss 0.693931
Finished training it 7/10 of epoch 0, 0/1=266.81 ms/it, loss 0.692185
Finished training it 8/10 of epoch 0, 0/1=316.45 ms/it, loss 0.693441
Finished training it 9/10 of epoch 0, 0/1=310.37 ms/it, loss 0.696777
Finished training it 10/10 of epoch 0, 0/1=313.58 ms/it, loss 0.693220

Run 4 (average ~155.54ms)

Finished training it 1/10 of epoch 0, 0/1=608.88 ms/it, loss 1.159998
Finished training it 2/10 of epoch 0, 0/1=122.61 ms/it, loss 0.896872
Finished training it 3/10 of epoch 0, 0/1=112.35 ms/it, loss 0.844668
Finished training it 4/10 of epoch 0, 0/1=76.45 ms/it, loss 0.690382
Finished training it 5/10 of epoch 0, 0/1=77.42 ms/it, loss 0.694125
Finished training it 6/10 of epoch 0, 0/1=110.31 ms/it, loss 0.693931
Finished training it 7/10 of epoch 0, 0/1=129.44 ms/it, loss 0.692185
Finished training it 8/10 of epoch 0, 0/1=62.11 ms/it, loss 0.693441
Finished training it 9/10 of epoch 0, 0/1=134.10 ms/it, loss 0.696777
Finished training it 10/10 of epoch 0, 0/1=121.83 ms/it, loss 0.693220

I would greatly appreciate any insights you may have on what could be causing these performance inconsistencies. Ensuring a stable baseline is crucial before I proceed with tweaking communication library hyperparameters.

Thank you for your time and support. ❤

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant